Professional Documents
Culture Documents
DWDM-viva Question PDF
DWDM-viva Question PDF
DWDM-viva Question PDF
1. Which of the following forms the logical subset of the complete data warehouse?
(a)Dimensional model
(b)Fact table
(c)Dimensional table
(d)Operational Data Store
(e)Data Mart.
3.Which of the following is a dimension that means the same thing with every possible fact table to
which it can be joined?
(a)Permissible snowflaking
(b)Confirmed Dimensions
(c)Degenerate dimensions
(d)Junk Dimensions
(e)Monster Dimensions.
4.Which of the following is not the managing issue in the modeling process?
(a)Content of primary units column
(b)Document each candidate data source
(c)Do regions report to zones
(d)Walk through business scenarios
(e)Ensure that the transaction edit flat is used for analysis.
5.Which of the following criteria is not used for selecting the data sources?
(a)Data Accessibility
(b)Platform
(c)Data accuracy
(d)Longevity of the feed
(e)Project scheduling.
6.Which of the following does not relate to the data modeling tool?
(a)Link to the dimension table designs
(b)Business user Documentation
(c)Helps assure consistency in naming
(d)Length of the logical column.
(e)Generates physical object DDL.
7.Which of the following is true on building a Matrix for Data warehouse bus architecture?
(a)Data marts as columns and dimensions as rows
(b)Dimensions as rows and facts as columns
(c)Data marts as rows and dimensions as columns
(d)Data marts as rows and facts as columns
(e)Facts as rows and data marts as columns.
8.Which of the following should not be considered for each dimension attribute?
(a)Attribute name
(b)Rapid changing dimension policy
(c)Attribute definition
(d)Sample data
(e)Cardinality.
9.Which of following form the set of data created to support a specific short lived business situation?
(a)Personal Data Marts
(b)Application Models
(c)Downstream systems
(d)Disposable Data Marts
(e)Data mining models.
11.What is the special kind of clustering that identifies events or transactions that occur
simultaneously?
(a)Affinity grouping
(b)Classifying
(c)Clustering
(d)Estimating
(e)Predicting.
12.Of the following team members, who do not form audience for Data warehousing?
(a)Data architects
(b)DBAs
(c)Business Intelligence experts
(d)Managers
(e)Customers/users.
15.Which of the following employ data mining techniques to analyze the intent of a user query,
provided additional generalized or associated information relevant to the query?
(a)Iceberg Query Method
(b)Data Analyzer
(c)Intelligent Query answering
(d)DBA
(e)Query Parser.
16.Of the following clustering algorithm what is the method which initially creates a hierarchical
decomposition of the given set of data objects?
(a)Partitioning Method
(b)Hierarchical Method
(c)Density-based method
(d)Grid-based Method
(e)Model-based Method.
17.Which one of the following can be performed using the attribute-oriented induction in a manner
similar to concept characterization?
(a)Analytical characterization
(b)Concept Description.
(c)OLAP based approach
(d)Concept Comparison
(e)Data Mining.
18.Which one of the following is an efficient association rule mining algorithm that explores the level-
wise mining?
(a)FP-tree algorithm
(b)Apriori Algorithm
(c)Level-based Algorithm
(d)Partitioning Algorithm
(e)Base Algorithm.
19.What allows users to focus the search for rules by providing metarules and additional mining
constraints?
(a)Correlation rule mining
(b)Multilevel Association rule mining
(c)Single level Association rule mining
(d)Constraint based rule mining
(e)Association rule mining.
20.Which of the following can be used in describing central tendency and data description from the
descriptive statistics point of view?
(a)Concept measures
(b)Statistical measures
(c)T-weight
(d)D-weight
(e)Generalization.
21.Which of the following is the collection of data objects that are similar to one another within the
same group?
(a)Partitioning
(b)Grid
(c)Cluster
(d)Table
(e)Data source.
22.In which of the following binning strategy, each bin has approximately the same number of tuples
assigned to it?
(a)Equiwidth binning
(b)Equidepth binning
(c)Homogeneity-based binning
(d)Equilength binning
(e)Frequent predicate set.
23.Which of the following binning strategy has the interval size of each bin the same?
(a)Equiwidth binning
(b)Ordinary binning
(c)Heterogeneity-based binning
(d)Un-Equaling binning
(e)Predicate Set.
25.What algorithms attempt to improve accuracy by removing tree branches reflecting noise in the
data?
(a)Partitioning
(b)Apriori
(c)Clustering
(d)FP tree
(e)Pruning.
26.Which of the following process includes data cleaning, data integration, data selection, data
transformation, data mining, pattern evolution, and knowledge presentation?
(a)KDD Process
(b)ETL Process
(c)KTL Process
(d)MDX process
(e)DW&DM.
27.What is the target physical machine on which the data warehouse is organized and stored for
direct querying by end users, report writers, and other applications?
(a)Presentation server
(b)Application server
(c)Database server
(d)Interface server
(e)Data staging server.
30.Type 1: Overwriting the dimension record, thereby loosing the history, Type 2: Create a new
additional dimension record using a new value of the surrogate key and Type 3: Create an old field
in the dimension record to store the immediate previous attribute value. Belong to:
(a)Slow changing Dimensions
(b)Rapidly changing Dimensions
(c)Artificial Dimensions
(d)Degenerate Dimensions
(e)Caveats.
END OF SECTION A
2.
a. How do you optimize the Backup process for a Data Warehouse?
b. Compare and Contrast Naïve Bayesian classification and Bayesian Belief networks.
(4 + 6 = 10 marks)
3.
Discuss why analytical characterization is needed and how it can be performed with an example.
(10 marks)
4.
Briefly outline how to compute the dissimilarity between objects described by the following types of
variables.
i. Asymmetric binary variables.
ii. Normal Variables.
iii. Ratio-Scaled Variables.
iv. Interval-Scaled Variables.
(10 marks)
5.
Analyze and give the Benefits of having a data warehouse architecture.
(10 marks)
END OF SECTION B
Section C : Applied Theory (20 Marks)
This section consists of questions with serial number 6 - 7.
Answer all questions.
Marks are indicated against each question.
Do not spend more than 25 -30 minutes on section C.
6.
Describe the Data Warehouse architecture framework.
(10 marks)
7.
Write short notes on any two of the following.
a. Factless fact tables.
b. Web mining.
c. Market Basket Analysis.
(5 + 5 = 10 marks)
END OF SECTION C
Suggested Answers
Data Warehousing and Data Mining (MC332) :
January 2007
Section A : Basic Concepts
1.
Answer : (e)
Reason : Data Mart forms the logical subset of the complete data warehouse.
2.
Answer : (e)
Reason : Metadata driven models are not included in Modeling Applications.
3.
Answer : (b)
Reason : Confirmed Dimensions are means the same thing with every possible fact table to which it can
be joined.
4.
Answer : (e)
Reason : Ensure that the transaction edit flat is used for analysis is not the managing issue in the
modeling process.
5.
Answer : (b)
Reason : Platform is not used for selecting the data sources.
6.
Answer : (d)
Reason : Length of the logical column does not relate to the data modeling tool.
7.
Answer : (c)
Reason : Data marts as rows and dimensions as columns is true on building a Matrix for Data warehouse
bus architecture.
8.
Answer : (b)
Reason : Rapid changing dimension policy should not be considered for each dimension attribute.
9.
Answer : (d)
Reason : Disposable data marts form the set of data created to support a specific short lived business
situation.
10.
Answer : (b)
Reason : Report Linking does not form future access services.
11.
Answer : (a)
Reason : Affinity grouping is the special kind of clustering that identifies events or transactions that occur
simultaneously.
12.
Answer : (e)
Reason : Customers /users does not form audience for Data warehousing.
13.
Answer : (c)
Reason : Aggregates are the precalculated summary values.
14.
Answer : (a)
Reason : Online Analytical Processing.
15.
Answer : (c)
Reason : Intelligent Query answering employees data mining techniques to analyze the intent of a user
query, provided additional generalized or associated information relevant to the query.
16.
Answer : (b)
Reason : Hierarchical method is a clustering algorithm which first creates a hierarchical decomposition
of the given set of data objects.
17.
Answer : (d)
Reason : Concept comparison can be performed using the attribute-oriented induction in a manner similar
to concept characterization.
18.
Answer : (b)
Reason : Apriori Alg. is an efficient association rule mining algorithm that explores the level-wise mining.
19.
Answer : (d)
Reason : Constraint based rule mining allows users to focus the search for rules by providing metarules
and additional mining constraints.
20.
Answer : (c)
Reason : Statistical Measures can be used in describing central tendency and data description from the
descriptive statistics point of view.
21.
Answer : (c)
Reason : Cluster is the collection of data objects that are similar to one another within the same group.
22.
Answer : (b)
Reason : Equidepth binning is a strategy where each bin has approximately the same number of tuples
assigned to it.
23.
Answer : (a)
Reason : Equiwidth binning is the binning strategy where the interval size of each bin is the same.
24.
Answer : (b)
Reason : Boolean association shows relationships between discrete objects.
25.
Answer : (d)
Reason : FP tree attempt to improve accuracy by removing tree branches reflecting noise in the data.
26.
Answer : (a)
Reason : KDD Process includes data cleaning, data integration, data selection, data transformation, data
mining, pattern evolution, and knowledge presentation.
27.
Answer : (a)
Reason : Presentation Server is the target physical machine on which the data warehouse data is organized
and stored for direct querying by end users, report writers, and other applications.
28.
Answer : (e)
Reason : Clustering queries cannot form a category of queries.
29.
Answer : (c)
Reason : Equally unavailable is not related to dimension table attributes.
30.
Answer : (a)
Reason : Slow changing Dimensions have Type 1: Overwriting the dimension record, thereby loosing the
history, Type 2: Create a new additional dimension record using a new value of the surrogate
key and Type 3: Create an old field in the dimension record to store the immediate previous
attribute value.
Section B : Problems
1.
Central Data Warehouse Design
This represents the “wholesale” level of the datawarehouse, which is used to supply data
marts with data. The most important requirement of the central data warehouse is that it
provides a consistent, integrated and flexible source of data. We argue that traditional
data modeling techniques (Entity Relationship models and normalization) are most
appropriate at this level. A normalized database design ensures maximum consistency
and integrity of the data. It also provides the most flexible data structure-new data can be
easily added to the warehouse in a modular way, and the database structure will support
any analysis requirements. Aggregation or demoralization at this stage will lose.
Information and restrict the kind of analyses which can be carried out. An enterprise data
model, if one exists, should be used as the basis for structuring the central data
warehouse.
Data Mart Design
Data marts represent the “retail” level of the data warehouse, where data is accessed
directly by end users.Data is extracted from the central data warehouse into data marts to
support particular analysis requirements. The most important requirement at this level is
that data is structured in a way that is easy for users to understand and use. For this
reason, dimensional modeling techniques are most appropriate at this level. This ensures
that data structures are as simple as possible in order to simplify user queries. Next
describes an approach for developing dimensional models from an enterprise data model.
DATA WAREHOUSE DESIGN
A simple example is used to illustrate the design approach. Following figure shows an
operational data model for a sales application. The highlighted attributes indicate the
primary keys of each entity.
Such a model is typical of data models that are used by operational (OLTP) systems.
Such a model is well suited to a transaction processing environment. It contains no
redundancy, thus maximizing efficiency of updates, and explicitly shows all the data and
the relationships between them. Unfortunately most decision makers would find this
schema incomprehensible. Even quite simple queries require multi-table joins and
complex subqueries. As a result, end users will be dependent on technical specialists to
write queries for them.
Step 1. Classify Entities
The first step in producing a dimensional model from an Entity Relationship model is to
classify the
entities into three categories:
Transaction Entities
Transaction entities record details about particular events that occur in the business .for
example, orders,
insurance claims, salary payments and hotel bookings. Invariably, it is these events that
decision makers want to understand and analyze. The key characteristics of a transaction
entity are:
It describes an event that happens at a point in time
It contains measurements or quantities that may be summarized e.g. dollar amounts,
weights, volumes.
For example, an insurance claim records a particular business event and (among other
things) the amount claimed. Transaction entities are the most important entities in a data
warehouse, and form the basis for constructing fact tables in star schemas. Not all
transaction entities will be of interest for decision support, so user input will be required
in identifying which transactions are important.
Component Entities
A component entity is one which is directly related to a transaction entity via a one-to-
many relationship.
Component entities define the details or “components” of each business transaction.
Component entities answer the “who”, “what”, “when”, “where”, “how” and “why” of a
business event. For example, a sales transaction may be defined by a number of
components:
Customer: who made the purchase
Product: what was sold
Location: where it was sold
Period: when it was sold
An important component of any transaction is time- historical analysis is an important
part of any data warehouse. Component entities form the basis for constructing dimension
tables in star schemas.
Classification Entities
Classification entities are entities which are related to component entities by a chain of
one-to-many relationships-that is, they are functionally dependent on a component entity
(directly or transitively). Classification entities represent hierarchies embedded in the
data model, which may be collapsed into component entities to form dimension tables in
a star schema.
Figure shows the classification of the entities in the example data model. In the diagram,
Black entities represent Transaction entities
Grey entities indicate Component entities
White entities indicate Classification entities
Resolving Ambiguities
In some cases, entities may fit into multiple categories. We therefore define a precedence
hierarchy for resolving such ambiguities:
1. Transaction entity (highest precedence)
2. Classification entity
3. Component entity (lowest precedence)
For example, if an entity can be classified as either a classification entity or a component
entity, it should be classified as a classification entity. In practice, some entities will not
fit into any of these categories. Such entities do not fit the hierarchical structure of a
dimensional model, and cannot be included in star schemas.
This is where real world data sometimes does not fit the star schema “mould”.
Step 2. Identify Hierarchies
Hierarchies are an extremely important concept in dimensional modelling, and form the
primary basis for deriving dimensional models from Entity Relationship models. As
mentioned , most dimension tables in star schemas contain embedded hierarchies. A
hierarchy in an Entity Relationship model is any sequence of entities joined together by
one-to-many relationships, all aligned in the same direction. Figure shows a hierarchy
extracted from the example data model, with State at the top and Sale Item at the bottom.
In hierarchical terminology:
State is the “parent” of Region
Region is the “child” of State
Sale Item, Sale, Location and Region are all “descendants” of State
Sale, Location, Region and State are all “ancestors” of Sale Item
Maximal Hierarchy
A hierarchy is called maximal if it cannot be extended upwards or downwards by
including another entity. In all, there are 14 maximal hierarchies in the example data
model:
Customer Type-Customer-Sale-Sale Fee
Customer Type-Customer-Sale-Sale Item
Fee Type-Sale Fee
Location Type-Location-Sale-Sale Fee
Location Type-Location-Sale-Sale Item
Period (posted)-Sale-Sale Fee
Period (posted)-Sale-Sale Item
Period (sale)-Sale-Sale Fee
Period (sale)-Sale-Sale Item
Product Type-Product-Sale Item
State-Region-Customer-Sale-Sale Fee
State-Region-Customer-Sale-Sale Item
State-Region-Location-Sale-Sale Fee
State-Region-Location-Sale-Sale Item
An entity is called minimal if it is at the bottom of a maximal hierarchy and maximal if it
is at the top of one. Minimal entities can be easily identified as they are entities with no
one-to-many relationships (or “leaf” entities in hierarchical terminology), while maximal
entities are entities with no many to one relationships (or “root” entities). In the example
data model there are
Two minimal entities: Sale Item and Sale Fee
Six maximal entities: Period, Customer_Type, State,
Location Type, Product Type and Fee Type.
Step 3. Produce Dimensional Models
Operators For Producing Dimensional Models
We use two operators to produce dimensional models from Entity Relationship models.
Higher level entities can be “collapsed” into lower level entities within hierarchies.
Figure shows the State entity being collapsed into the Region entity. The Region entity
contains its original attributes plus the attributes of the collapsed table. This introduces
redundancy in the form of a transitive dependency, which is a violation to third normal
form. Collapsing a hierarchy is therefore a form of denormalisation.
Aggregation
The aggregation operator can be applied to a transaction entity to create a new entity
containing summarized data. A subset of attributes is chosen from the source entity to
aggregate (the aggregation attributes) and another subset of attributes chosen to
aggregate by (the grouping attributes). Aggregation attributes must be numerical
quantities.
For example, we could apply the aggregation operator to the Sale Item entity to create a
new entity called Product Summary as in Figure. This aggregated entity shows for each
product the total sales amount (quantity*price), the average quantity per order and
average price per item on a daily basis. The aggregation attributes are quantity and price,
while the grouping attributes are Product ID and Date. The key of this entity is the
combination of the attributes used to aggregate by (grouping attributes). Note that
aggregation loses information: we cannot reconstruct the details of individual sale items
from the product summary table.
Figure shows the snowflake schema which results from the Sale transaction entity.
2.
a. The following approaches can be used to optimize the back up process of a data
warehouse:
Partitioning can be used to increase operational flexibility.
Incremental backup to reduce elapsed time to complete an operation.
Parallel processing can be used to divide and conquer large data volumes.
Concurrent backup is allowed to extend availability.
RAID is used to recover from media failure.
b. Naïve Bayesian classification and Bayesian belief networks are based on Bayes
theorem of posterior probability. Unlike Naïve Bayesian classification, Bayesian
belief networks allow class conditional independencies to be defined between subsets
of variables.
3.
Example: Analytical Characterization
Task
○ Mine general characteristics describing graduate students using analytical
characterization
Given
○ attributes name, gender, major, birth_place, birth_date, phone#, and gpa
○ Gen(ai) = concept hierarchies on ai
○ Ui = attribute analytical thresholds for ai
○ Ti = attribute generalization thresholds for ai
○ R = attribute relevance threshold
○
2. Analytical generalization using Ui
○ attribute removal
- remove name and phone#
○ attribute generalization
- generalize major, birth_place, birth_date and gpa
- accumulate counts
○ candidate relation: gender, major, birth_country, age_range and gpa
Example: Analytical characterization (2)
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18
Candidate relation for Target class: Graduate students (=120)
arbitrary tuple
120 120 130 130
I(s1, s 2 ) I(120,130) log 2 log 2 0.9988
250 250 250 250
Number of grad
students in “Science” Number of undergrad
students in “Science”
4.
Interval-valued variables:
Internal scaled variables are continuous measurements of a roughly linear scale.
Typical examples include weight and height, latitude and longitude coordinates. The
measurement unit used can effect the clustering analysis.
For example changing measurement units from meters to inches for height.
How can the data for a variable be standardized? To standardized measurements one
choice is to convert the original measurements to unit less variables. Given
measurements for a variable f this can be performed as follows
Standardize data
○ Calculate the mean absolute deviation:
Where
○ Calculate the standardized measurement (z-score)
Using mean absolute deviation is more robust than using standard deviation
Binary Variables:
A contingency table for binary data
Object j
Object I
Nominal Variables:
A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching
○ m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
○ creating a new binary variable for each of the M nominal states
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at
exponential scale,
○ treat them like interval-scaled variables — not a good choice!
○ apply logarithmic transformation
yif = log(xif)
○ treat them as continuous ordinal data treat their rank as interval-scaled.
5.
Benefits of having a data warehouse architecture
Provides an organizing framework - the architecture draws the lines on the map in terms
of what the individual components are, how they fit together, who owns what parts, and
priorities.
Improved flexibility and maintenance - allows you to quickly add new data sources,
interface standards allow plug and play, and the model and meta data allow impact
analysis and single-point changes.
Faster development and reuse - warehouse developers are better able to understand the
data warehouse process, data base contents, and business rules more quickly.
Management and communications tool - define and communicate direction and scope to
set expectations, identify roles and responsibilities, and communicate requirements to
vendors.
Coordinate parallel efforts - multiple, relatively independent efforts have a chance to
converge successfully. Also, data marts without architecture become the stovepipes of
tomorrow.