DWDM-viva Question PDF

You might also like

You are on page 1of 31

Question Paper

Data Warehousing and Data Mining (MC332) :


January 2007

Section A : Basic Concepts (30 Marks)

 This section consists of questions with serial number 1 - 30.


 Answer all questions.
 Each question carries one mark.
 Maximum time for answering Section A is 30 Minutes.

1. Which of the following forms the logical subset of the complete data warehouse?
(a)Dimensional model
(b)Fact table
(c)Dimensional table
(d)Operational Data Store
(e)Data Mart.

2.Which of the following is not included in Modeling Applications?


(a)Forecasting models
(b)Behavior scoring models
(c)Allocation models
(d)Data mining Models
(e)Metadata driven models.

3.Which of the following is a dimension that means the same thing with every possible fact table to
which it can be joined?
(a)Permissible snowflaking
(b)Confirmed Dimensions
(c)Degenerate dimensions
(d)Junk Dimensions
(e)Monster Dimensions.

4.Which of the following is not the managing issue in the modeling process?
(a)Content of primary units column
(b)Document each candidate data source
(c)Do regions report to zones
(d)Walk through business scenarios
(e)Ensure that the transaction edit flat is used for analysis.

5.Which of the following criteria is not used for selecting the data sources?
(a)Data Accessibility
(b)Platform
(c)Data accuracy
(d)Longevity of the feed
(e)Project scheduling.

6.Which of the following does not relate to the data modeling tool?
(a)Link to the dimension table designs
(b)Business user Documentation
(c)Helps assure consistency in naming
(d)Length of the logical column.
(e)Generates physical object DDL.

7.Which of the following is true on building a Matrix for Data warehouse bus architecture?
(a)Data marts as columns and dimensions as rows
(b)Dimensions as rows and facts as columns
(c)Data marts as rows and dimensions as columns
(d)Data marts as rows and facts as columns
(e)Facts as rows and data marts as columns.

8.Which of the following should not be considered for each dimension attribute?
(a)Attribute name
(b)Rapid changing dimension policy
(c)Attribute definition
(d)Sample data
(e)Cardinality.

9.Which of following form the set of data created to support a specific short lived business situation?
(a)Personal Data Marts
(b)Application Models
(c)Downstream systems
(d)Disposable Data Marts
(e)Data mining models.

10.Which of the following does not form future access services?


(a)Authentication
(b)Report linking
(c)Push toward centralized services
(d)Vendor consolidation
(e)Web based customer access.

11.What is the special kind of clustering that identifies events or transactions that occur
simultaneously?
(a)Affinity grouping
(b)Classifying
(c)Clustering
(d)Estimating
(e)Predicting.

12.Of the following team members, who do not form audience for Data warehousing?
(a)Data architects
(b)DBAs
(c)Business Intelligence experts
(d)Managers
(e)Customers/users.

13.The precalculated summary values are called as


(a)Assertions
(b)Triggers
(c)Aggregates
(d)Schemas
(e)Indexes.

14.OLAP stands for


(a)Online Analytical Processing
(b)Online Attribute Processing
(c)Online Assertion Processing
(d)Online Association Processing
(e)Online Allocation Processing.

15.Which of the following employ data mining techniques to analyze the intent of a user query,
provided additional generalized or associated information relevant to the query?
(a)Iceberg Query Method
(b)Data Analyzer
(c)Intelligent Query answering
(d)DBA
(e)Query Parser.

16.Of the following clustering algorithm what is the method which initially creates a hierarchical
decomposition of the given set of data objects?
(a)Partitioning Method
(b)Hierarchical Method
(c)Density-based method
(d)Grid-based Method
(e)Model-based Method.

17.Which one of the following can be performed using the attribute-oriented induction in a manner
similar to concept characterization?
(a)Analytical characterization
(b)Concept Description.
(c)OLAP based approach
(d)Concept Comparison
(e)Data Mining.

18.Which one of the following is an efficient association rule mining algorithm that explores the level-
wise mining?
(a)FP-tree algorithm
(b)Apriori Algorithm
(c)Level-based Algorithm
(d)Partitioning Algorithm
(e)Base Algorithm.

19.What allows users to focus the search for rules by providing metarules and additional mining
constraints?
(a)Correlation rule mining
(b)Multilevel Association rule mining
(c)Single level Association rule mining
(d)Constraint based rule mining
(e)Association rule mining.

20.Which of the following can be used in describing central tendency and data description from the
descriptive statistics point of view?
(a)Concept measures
(b)Statistical measures
(c)T-weight
(d)D-weight
(e)Generalization.

21.Which of the following is the collection of data objects that are similar to one another within the
same group?
(a)Partitioning
(b)Grid
(c)Cluster
(d)Table
(e)Data source.

22.In which of the following binning strategy, each bin has approximately the same number of tuples
assigned to it?
(a)Equiwidth binning
(b)Equidepth binning
(c)Homogeneity-based binning
(d)Equilength binning
(e)Frequent predicate set.

23.Which of the following binning strategy has the interval size of each bin the same?
(a)Equiwidth binning
(b)Ordinary binning
(c)Heterogeneity-based binning
(d)Un-Equaling binning
(e)Predicate Set.

24.Which of the following association shows relationships between discrete objects?


(a)Quantitative
(b)Boolean
(c)Single Dimensional
(d)Multidimensional
(e)Bidirectional.

25.What algorithms attempt to improve accuracy by removing tree branches reflecting noise in the
data?
(a)Partitioning
(b)Apriori
(c)Clustering
(d)FP tree
(e)Pruning.

26.Which of the following process includes data cleaning, data integration, data selection, data
transformation, data mining, pattern evolution, and knowledge presentation?
(a)KDD Process
(b)ETL Process
(c)KTL Process
(d)MDX process
(e)DW&DM.

27.What is the target physical machine on which the data warehouse is organized and stored for
direct querying by end users, report writers, and other applications?
(a)Presentation server
(b)Application server
(c)Database server
(d)Interface server
(e)Data staging server.

28.Which of the following cannot form a category of queries?


(a)Simple constraints
(b)Correlated subqueries
(c)Simple behavioral queries
(d)Derived Behavioral queries
(e)Clustering queries.

29.Which of the following is not related to dimension table attributes?


(a)Verbose
(b)Descriptive
(c)Equally unavailable
(d)Complete
(e)Indexed.

30.Type 1: Overwriting the dimension record, thereby loosing the history, Type 2: Create a new
additional dimension record using a new value of the surrogate key and Type 3: Create an old field
in the dimension record to store the immediate previous attribute value. Belong to:
(a)Slow changing Dimensions
(b)Rapidly changing Dimensions
(c)Artificial Dimensions
(d)Degenerate Dimensions
(e)Caveats.

END OF SECTION A

Section B : Problems (50 Marks)


 This section consists of questions with serial number 1 – 5.
 Answer all questions.
 Marks are indicated against each question.
 Detailed workings should form part of your answer.
 Do not spend more than 110 - 120 minutes on Section B.
1.
One of the most important assets of an organization is its information. Data warehouse forms one of the
assets. To deliver the data to the end users, build a dimensional model taking Sales Application starting
from ER diagram.
(10 marks)

2.
a. How do you optimize the Backup process for a Data Warehouse?
b. Compare and Contrast Naïve Bayesian classification and Bayesian Belief networks.
(4 + 6 = 10 marks)

3.
Discuss why analytical characterization is needed and how it can be performed with an example.
(10 marks)

4.
Briefly outline how to compute the dissimilarity between objects described by the following types of
variables.
i. Asymmetric binary variables.
ii. Normal Variables.
iii. Ratio-Scaled Variables.
iv. Interval-Scaled Variables.
(10 marks)

5.
Analyze and give the Benefits of having a data warehouse architecture.
(10 marks)

END OF SECTION B
Section C : Applied Theory (20 Marks)
 This section consists of questions with serial number 6 - 7.
 Answer all questions.
 Marks are indicated against each question.
 Do not spend more than 25 -30 minutes on section C.

6.
Describe the Data Warehouse architecture framework.
(10 marks)

7.
Write short notes on any two of the following.
a. Factless fact tables.
b. Web mining.
c. Market Basket Analysis.
(5 + 5 = 10 marks)

END OF SECTION C

END OF QUESTION PAPER

Suggested Answers
Data Warehousing and Data Mining (MC332) :
January 2007
Section A : Basic Concepts
1.
Answer : (e)
Reason : Data Mart forms the logical subset of the complete data warehouse.

2.
Answer : (e)
Reason : Metadata driven models are not included in Modeling Applications.

3.
Answer : (b)
Reason : Confirmed Dimensions are means the same thing with every possible fact table to which it can
be joined.

4.
Answer : (e)
Reason : Ensure that the transaction edit flat is used for analysis is not the managing issue in the
modeling process.

5.
Answer : (b)
Reason : Platform is not used for selecting the data sources.

6.
Answer : (d)
Reason : Length of the logical column does not relate to the data modeling tool.

7.
Answer : (c)
Reason : Data marts as rows and dimensions as columns is true on building a Matrix for Data warehouse
bus architecture.

8.
Answer : (b)
Reason : Rapid changing dimension policy should not be considered for each dimension attribute.

9.
Answer : (d)
Reason : Disposable data marts form the set of data created to support a specific short lived business
situation.

10.
Answer : (b)
Reason : Report Linking does not form future access services.

11.
Answer : (a)
Reason : Affinity grouping is the special kind of clustering that identifies events or transactions that occur
simultaneously.

12.
Answer : (e)
Reason : Customers /users does not form audience for Data warehousing.

13.
Answer : (c)
Reason : Aggregates are the precalculated summary values.

14.
Answer : (a)
Reason : Online Analytical Processing.

15.
Answer : (c)
Reason : Intelligent Query answering employees data mining techniques to analyze the intent of a user
query, provided additional generalized or associated information relevant to the query.

16.
Answer : (b)
Reason : Hierarchical method is a clustering algorithm which first creates a hierarchical decomposition
of the given set of data objects.

17.
Answer : (d)
Reason : Concept comparison can be performed using the attribute-oriented induction in a manner similar
to concept characterization.

18.
Answer : (b)
Reason : Apriori Alg. is an efficient association rule mining algorithm that explores the level-wise mining.

19.
Answer : (d)
Reason : Constraint based rule mining allows users to focus the search for rules by providing metarules
and additional mining constraints.

20.
Answer : (c)
Reason : Statistical Measures can be used in describing central tendency and data description from the
descriptive statistics point of view.

21.
Answer : (c)
Reason : Cluster is the collection of data objects that are similar to one another within the same group.

22.
Answer : (b)
Reason : Equidepth binning is a strategy where each bin has approximately the same number of tuples
assigned to it.

23.
Answer : (a)
Reason : Equiwidth binning is the binning strategy where the interval size of each bin is the same.

24.
Answer : (b)
Reason : Boolean association shows relationships between discrete objects.

25.
Answer : (d)
Reason : FP tree attempt to improve accuracy by removing tree branches reflecting noise in the data.

26.
Answer : (a)
Reason : KDD Process includes data cleaning, data integration, data selection, data transformation, data
mining, pattern evolution, and knowledge presentation.

27.
Answer : (a)
Reason : Presentation Server is the target physical machine on which the data warehouse data is organized
and stored for direct querying by end users, report writers, and other applications.

28.
Answer : (e)
Reason : Clustering queries cannot form a category of queries.

29.
Answer : (c)
Reason : Equally unavailable is not related to dimension table attributes.

30.
Answer : (a)
Reason : Slow changing Dimensions have Type 1: Overwriting the dimension record, thereby loosing the
history, Type 2: Create a new additional dimension record using a new value of the surrogate
key and Type 3: Create an old field in the dimension record to store the immediate previous
attribute value.

Section B : Problems
1.
Central Data Warehouse Design
This represents the “wholesale” level of the datawarehouse, which is used to supply data
marts with data. The most important requirement of the central data warehouse is that it
provides a consistent, integrated and flexible source of data. We argue that traditional
data modeling techniques (Entity Relationship models and normalization) are most
appropriate at this level. A normalized database design ensures maximum consistency
and integrity of the data. It also provides the most flexible data structure-new data can be
easily added to the warehouse in a modular way, and the database structure will support
any analysis requirements. Aggregation or demoralization at this stage will lose.
Information and restrict the kind of analyses which can be carried out. An enterprise data
model, if one exists, should be used as the basis for structuring the central data
warehouse.
Data Mart Design
Data marts represent the “retail” level of the data warehouse, where data is accessed
directly by end users.Data is extracted from the central data warehouse into data marts to
support particular analysis requirements. The most important requirement at this level is
that data is structured in a way that is easy for users to understand and use. For this
reason, dimensional modeling techniques are most appropriate at this level. This ensures
that data structures are as simple as possible in order to simplify user queries. Next
describes an approach for developing dimensional models from an enterprise data model.
DATA WAREHOUSE DESIGN
A simple example is used to illustrate the design approach. Following figure shows an
operational data model for a sales application. The highlighted attributes indicate the
primary keys of each entity.

Such a model is typical of data models that are used by operational (OLTP) systems.
Such a model is well suited to a transaction processing environment. It contains no
redundancy, thus maximizing efficiency of updates, and explicitly shows all the data and
the relationships between them. Unfortunately most decision makers would find this
schema incomprehensible. Even quite simple queries require multi-table joins and
complex subqueries. As a result, end users will be dependent on technical specialists to
write queries for them.
Step 1. Classify Entities
The first step in producing a dimensional model from an Entity Relationship model is to
classify the
entities into three categories:
Transaction Entities
Transaction entities record details about particular events that occur in the business .for
example, orders,
insurance claims, salary payments and hotel bookings. Invariably, it is these events that
decision makers want to understand and analyze. The key characteristics of a transaction
entity are:
 It describes an event that happens at a point in time
 It contains measurements or quantities that may be summarized e.g. dollar amounts,
weights, volumes.
For example, an insurance claim records a particular business event and (among other
things) the amount claimed. Transaction entities are the most important entities in a data
warehouse, and form the basis for constructing fact tables in star schemas. Not all
transaction entities will be of interest for decision support, so user input will be required
in identifying which transactions are important.
Component Entities
A component entity is one which is directly related to a transaction entity via a one-to-
many relationship.
Component entities define the details or “components” of each business transaction.
Component entities answer the “who”, “what”, “when”, “where”, “how” and “why” of a
business event. For example, a sales transaction may be defined by a number of
components:
 Customer: who made the purchase
 Product: what was sold
 Location: where it was sold
 Period: when it was sold
An important component of any transaction is time- historical analysis is an important
part of any data warehouse. Component entities form the basis for constructing dimension
tables in star schemas.
Classification Entities
Classification entities are entities which are related to component entities by a chain of
one-to-many relationships-that is, they are functionally dependent on a component entity
(directly or transitively). Classification entities represent hierarchies embedded in the
data model, which may be collapsed into component entities to form dimension tables in
a star schema.
Figure shows the classification of the entities in the example data model. In the diagram,
 Black entities represent Transaction entities
 Grey entities indicate Component entities
 White entities indicate Classification entities

Resolving Ambiguities
In some cases, entities may fit into multiple categories. We therefore define a precedence
hierarchy for resolving such ambiguities:
1. Transaction entity (highest precedence)
2. Classification entity
3. Component entity (lowest precedence)
For example, if an entity can be classified as either a classification entity or a component
entity, it should be classified as a classification entity. In practice, some entities will not
fit into any of these categories. Such entities do not fit the hierarchical structure of a
dimensional model, and cannot be included in star schemas.
This is where real world data sometimes does not fit the star schema “mould”.
Step 2. Identify Hierarchies
Hierarchies are an extremely important concept in dimensional modelling, and form the
primary basis for deriving dimensional models from Entity Relationship models. As
mentioned , most dimension tables in star schemas contain embedded hierarchies. A
hierarchy in an Entity Relationship model is any sequence of entities joined together by
one-to-many relationships, all aligned in the same direction. Figure shows a hierarchy
extracted from the example data model, with State at the top and Sale Item at the bottom.

In hierarchical terminology:
 State is the “parent” of Region
 Region is the “child” of State
 Sale Item, Sale, Location and Region are all “descendants” of State
 Sale, Location, Region and State are all “ancestors” of Sale Item
Maximal Hierarchy
A hierarchy is called maximal if it cannot be extended upwards or downwards by
including another entity. In all, there are 14 maximal hierarchies in the example data
model:
 Customer Type-Customer-Sale-Sale Fee
 Customer Type-Customer-Sale-Sale Item
 Fee Type-Sale Fee
 Location Type-Location-Sale-Sale Fee
 Location Type-Location-Sale-Sale Item
 Period (posted)-Sale-Sale Fee
 Period (posted)-Sale-Sale Item
 Period (sale)-Sale-Sale Fee
 Period (sale)-Sale-Sale Item
 Product Type-Product-Sale Item
 State-Region-Customer-Sale-Sale Fee
 State-Region-Customer-Sale-Sale Item
 State-Region-Location-Sale-Sale Fee
 State-Region-Location-Sale-Sale Item
An entity is called minimal if it is at the bottom of a maximal hierarchy and maximal if it
is at the top of one. Minimal entities can be easily identified as they are entities with no
one-to-many relationships (or “leaf” entities in hierarchical terminology), while maximal
entities are entities with no many to one relationships (or “root” entities). In the example
data model there are
 Two minimal entities: Sale Item and Sale Fee
 Six maximal entities: Period, Customer_Type, State,
 Location Type, Product Type and Fee Type.
Step 3. Produce Dimensional Models
Operators For Producing Dimensional Models
We use two operators to produce dimensional models from Entity Relationship models.
Higher level entities can be “collapsed” into lower level entities within hierarchies.
Figure shows the State entity being collapsed into the Region entity. The Region entity
contains its original attributes plus the attributes of the collapsed table. This introduces
redundancy in the form of a transitive dependency, which is a violation to third normal
form. Collapsing a hierarchy is therefore a form of denormalisation.

Figure 8. State Entity “collapsed” into region


Figure shows Region being collapsed into Location.
We can continue doing this until we reach the bottom of the hierarchy, and end up with a
single table (Sale Item).

Aggregation
The aggregation operator can be applied to a transaction entity to create a new entity
containing summarized data. A subset of attributes is chosen from the source entity to
aggregate (the aggregation attributes) and another subset of attributes chosen to
aggregate by (the grouping attributes). Aggregation attributes must be numerical
quantities.
For example, we could apply the aggregation operator to the Sale Item entity to create a
new entity called Product Summary as in Figure. This aggregated entity shows for each
product the total sales amount (quantity*price), the average quantity per order and
average price per item on a daily basis. The aggregation attributes are quantity and price,
while the grouping attributes are Product ID and Date. The key of this entity is the
combination of the attributes used to aggregate by (grouping attributes). Note that
aggregation loses information: we cannot reconstruct the details of individual sale items
from the product summary table.

Figure 10. Aggregation Operator


Dimensional Design Options
There is a wide range of options for producing dimensional
models from an Entity Relationship model.
These include:
 Flat schema
 Terraced schema
 Star schema
 Snowflake schema
 Star cluster schema
Each of these options represent different trade-offs between complexity and redundancy.
Here we discuss how the operators previously defined may be used to produce different
dimensional models.
Option 1: Flat Schema
A flat schema is the simplest schema possible without losing information. This is formed
by collapsing all entities in the data model down into the minimal entities. This
minimizes the number of tables in the database and therefore the possibility that joins will
be needed in user queries. In a flat schema we end up with one table for each minimal
entity in the original data model. Figure 11 shows the flat schema which results from the
example data model.
Figure 11. Flat Schema
Such a schema is similar to the “flat files” used by analysts using statistical packages
such as SAS and SPSS. Note that this structure does not lose any information from the
original data model. It contains redundancy, in the form of transitive and partial
dependencies, but does not involve any aggregation. One problem with a flat schema is
that it may lead to aggregation errors when there are hierarchical relationships between
transaction entities. When we collapse numerical amounts from higher level transaction
entities into another
they will be repeated. In the example data model, if a Sale consists of three Sale Items,
the discount amount will be stored in three different rows in the Sale Item table. Adding
the discount amounts together then results in double-counting (or in this case, triple
counting). Another problem with flat schemas is that they tend to result in tables with
large numbers of attributes, which may be unwieldy. While the number of tables (system
complexity) is minimised, the complexity
of each table (element complexity) is greatly increased.
Option 2: Terraced Schema
A terraced schema is formed by collapsing entities down maximal hierarchies, but
stopping when they reach a transaction entity. This results in a single table for each
transaction entity in the data model. Figure shows the terraced schema that results from
the example
data model. This schema is less likely to cause problems for an inexperienced user,
because the separation between levels of transaction entities is explicitly shown.
Figure 12. Terraced Schema
Option 3: Star Schema
A star schema can be easily derived from an Entity Relationship model. Each star schema
is formed in the following way:
 A fact table is formed for each transaction entity. The key of the table is the
combination of the keys of its associated component entities.
 A dimension table is formed for each component entity, by collapsing hierarchically
related classification entities into it.
 Where hierarchical relationships exist between transaction entities, the child entity
inherits all dimensions (and key attributes) from the parent entity.
This provides the ability to “drill down” between transaction levels.
 Numerical attributes within transaction entities should be aggregated by key
attributes (dimensions). The aggregation attributes and functions used depend on the
application.
Figure shows the star schema that results from the Sale transaction entity. This star
schema has four dimensions, each of which contains embedded hierarchies. The
aggregated fact is Discount amount.

Figure 13. Sale Star Schema


Figure shows the star schema which results from the Sale Item transaction entity. This
star schema has five dimensions. This includes four dimensions from its “parent”
transaction entity (Sale) and one of its own (Product). The aggregated facts are quantity
and item cost (quantity * price).

Figure 14. Sale item star schema


A separate star schema is produced for each transaction table in the original data model.
Constellation Schema
Instead of a number of discrete star schemas, the example data model can be transformed
into a constellation schema. A constellation schema consists of a set of star schemas with
hierarchically linked fact tables.
The links between the various fact tables provide the ability to “drill down” between
levels of detail (e.g. from Sale to Sale Item). The constellation schema which results from
the example data model is shown in Figure links between fact tables are shown in bold.

Figure 15. Sales Constellation Schema


Galaxy Schema
More generally, a set of star schemas or constellations can be combined together to form
a galaxy. A galaxy is of a collection of star schemas with shared dimensions. Unlike a
constellation schema, the fact tables in a galaxy do not need to be directly related.
Option 4: Snowflake Schema
In a star schema, hierarchies in the original data model are collapsed or demoralized to
form dimension tables. Each dimension table may contain multiple independent
hierarchies. A snowflake schema is a star schema with all hierarchies explicitly shown. A
snowflake
schema can be formed from a star schema by expanding out (normalizing) the hierarchies
in each dimension. alternatively, a snowflake schema can be produced directly from an
Entity Relationship model by the following procedure:
 A fact table is formed for each transaction entity. The key of the table is the
combination of the keys of the associated component entities.
 Each component entity becomes a dimension table.
 Where hierarchical relationships exist between transaction entities, the child entity
inherits all relationships to component entities (and key attributes) from the parent
entity.
 Numerical attributes within transaction entities should be aggregated by the key
attributes. The attributes and functions used depend on the application.

Figure shows the snowflake schema which results from the Sale transaction entity.

Option 5: Star Cluster Schema


Kimball (1996) argues that “snowflaking” is undesirable, because it adds complexity to
the schema and requires extra joins. Clearly, expanding all hierarchies defeats the
purpose of producing simple, user friendly database designs-in the example above, it
more than doubles the number of tables in the schema. here, we argue that neither a
“pure” star schema (fully collapsed hierarchies) nor a “pure” snowflake schema (fully
expanded hierarchies) results in the best solution. As in many design problems, the
optimal solution is a balance between two extremes.
The problem with fully collapsing hierarchies occurs when hierarchies overlap, leading to
redundancy between dimensions when they are collapsed. This can result in confusion for
users, increased complexity in extract processes and inconsistent results from queries if
hierarchies become inconsistent. For these reasons, we require that dimensions should be
orthogonal.
Overlapping dimensions can be identified via “forks” in hierarchies. A fork occurs when
an entity acts as a “parent” in two different dimensional hierarchies. This results in the
entity and all of its ancestors being collapsed into two separate dimension tables.
Fork entities can be identified as classification entities with multiple one-to-many
relationships. The exception to this rule occurs when the hierarchy converges again lower
down-Dampney (1996) calls this a commuting loop.
In the example data model, a fork occurs at the Region entity. Region is a parent of
Location and Customer, which are both components of the Sale transaction. In the star
schema representation, State and Region would be included in both the Location and
Customer
dimensions when the hierarchies are collapsed. This results in overlap between the
dimensions.

Figure 17. Intersecting Hierarchies in Example Data Model


We define a star cluster schema as one which has the minimal number of tables while
avoiding overlap between dimensions. It is a star schema which is selectively
“snowflaked” to separate out hierarchical segments or subdimensions which are shared
between different dimensions. Subdimensions effectively represent the “highest common
factor” between dimensions. A star cluster schema may be produced from an Entity
relationship model using the following procedure.
Each star cluster is formed by:
 A fact table is formed for each transaction entity. The key of the table is the
combination of the keys f the associated component entities.
 Classification entities should be collapsed down their hierarchies until they reach
either a fork entity or a component entity. If a fork is reached, a subdimension table
should be formed. The subdimension
 table will consist of the fork entity plus all its ancestors. Collapsing should begin
again after the fork entity. When a component entity is reached, a dimension table
should be formed.
 Where hierarchical relationships exist between transaction entities, the child entity
should inherit all dimensions (and key attributes) from the parent entity.
 Numerical attributes within transaction entities should be aggregated by the key
attributes (dimensions). The attributes and functions used depend on the application.
Figure shows the star cluster schema that results from the model fragment of Figure .

Figure 18. Star Cluster Schema


Figure shows how entities in the original data model were clustered to form the star
cluster schema. The overlap between hierarchies has now been removed.

Figure 19. Revised Clustering


If required, views may be used to reconstruct a star schema from a star cluster schema.
This gives the best of both worlds: the simplicity of a star schema while preserving
consistency between dimensions. As with star schemas, star clusters may be combined
together to
form constellations or galaxies.
Step 4. Evaluation and Refinement
In practice, dimensional modelling is an iterative process. The clustering procedure
described in Step 3 is useful for producing a first cut design, but this will need to be
refined to produce the final data mart design. Most of these modifications have to do with
further
simplifying the model and dealing with nonhierarchical patterns in the data.
Combining Fact Tables
Fact tables with the same primary keys (i.e. the same dimensions) should be combined.
This reduces the number of star schemas and facilitates comparison between related facts
(e.g. budget and actual figures).
Combining Dimension Tables
Creating dimension tables for each component entity often results in a large number of
dimension tables. To simplify the data mart structure, related dimensions should be
consolidated together into a single dimension table.
Many to Many Relationships
Most of the complexities which arise in converting a traditional Entity Relationship
model to a dimensional model result from many-to-many relationships or intersection
entities. Many-to-many relationships cause problems in dimensional modelling because
they represent a “break” in the hierarchical chain, and cannot be collapsed. There are a
number of options for dealing with many-to-many relationships:
(a) Ignore the intersection entity (eliminate it from the data mart)
(b) Convert the many-to-many relationship to a oneto- many relationship, by defining a
“primary” relationship
(c) Include it as a many-to-many relationship in the data mart such entities may be
useful to expert analysts but will not be amenable to analysis using an OLAP tool.
For example, in the model below, each client may be involved in a number of industries.
The intersection entity Client Industry breaks the hierarchical chain and cannot be
collapsed into Client.
Figure 20. Multiple Classification
The options are (a) to exclude the industry hierarchy, (b) convert it to a one-to-many
relationship or (c) include it as a many-to-many relationship.

Figure 21. Design Options


Handling Subtypes
Super type/subtype relationships can be converted to a hierarchical structure by removing
the subtypes and creating a classification entity to distinguish between subtypes. This can
then be converted to a dimensional model in a straightforward manner.

Figure 22. Conversion of subtypes to Hierarchical Form

2.
a. The following approaches can be used to optimize the back up process of a data
warehouse:
 Partitioning can be used to increase operational flexibility.
 Incremental backup to reduce elapsed time to complete an operation.
 Parallel processing can be used to divide and conquer large data volumes.
 Concurrent backup is allowed to extend availability.
 RAID is used to recover from media failure.
b. Naïve Bayesian classification and Bayesian belief networks are based on Bayes
theorem of posterior probability. Unlike Naïve Bayesian classification, Bayesian
belief networks allow class conditional independencies to be defined between subsets
of variables.

3.
Example: Analytical Characterization
 Task
○ Mine general characteristics describing graduate students using analytical
characterization
 Given
○ attributes name, gender, major, birth_place, birth_date, phone#, and gpa
○ Gen(ai) = concept hierarchies on ai
○ Ui = attribute analytical thresholds for ai
○ Ti = attribute generalization thresholds for ai
○ R = attribute relevance threshold


2. Analytical generalization using Ui
○ attribute removal
- remove name and phone#
○ attribute generalization
- generalize major, birth_place, birth_date and gpa
- accumulate counts
○ candidate relation: gender, major, birth_country, age_range and gpa
Example: Analytical characterization (2)
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18
Candidate relation for Target class: Graduate students (=120)

gender major birth_country age_range gpa count


M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20
M Business Canada <20 Fair 22
F Science Canada 20-25 Fair 24
M Engineering Foreign 20-25 Very_good 22
F Engineering Canada <20 Excellent 24

Candidate relation for Contrasting class: Undergraduate students (=130)

Example: Analytical Characterization (4)

 Calculate expected info required to classify a given


sample if S is partitioned according to the attribute
126 82 42
E(major)  I ( s11 , s 21 )  I ( s12 , s 22 )  I ( s13 , s 23 )  0.7873
250 250 250

 Calculate information gain for each attribute


Gain(major )  I(s 1, s 2 )  E(major)  0.2115

 Information gain for all attributes


Gain(gender) = 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971
Example: Analytical characterization (3)
 3. Relevance analysis
 Calculate expected info required to classify an

arbitrary tuple
120 120 130 130
I(s1, s 2 )  I(120,130)   log 2  log 2  0.9988
250 250 250 250

 Calculate entropy of each attribute: e.g. major


For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183
For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892
For major=”Business”: S13=0 S23=42 I(s13,s23)=0

Number of grad
students in “Science” Number of undergrad
students in “Science”

Example: Analytical characterization (5)

4. Initial working relation (W0) derivation


 R = 0.1
 remove irrelevant/weakly relevant attributes from candidate
relation => drop gender, birth_country
 remove contrasting class candidate relation
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18

Initial target class working relation W0: Graduate students


5. Perform attribute-oriented induction on W0 using Ti

4.
Interval-valued variables:
 Internal scaled variables are continuous measurements of a roughly linear scale.
Typical examples include weight and height, latitude and longitude coordinates. The
measurement unit used can effect the clustering analysis.
 For example changing measurement units from meters to inches for height.
 How can the data for a variable be standardized? To standardized measurements one
choice is to convert the original measurements to unit less variables. Given
measurements for a variable f this can be performed as follows
 Standardize data
○ Calculate the mean absolute deviation:
Where
○ Calculate the standardized measurement (z-score)
 Using mean absolute deviation is more robust than using standard deviation
Binary Variables:
 A contingency table for binary data
Object j

Object I

Simple matching coefficient (invariant, if the binary variable is symmetric):


Jaccard coefficient (noninvariant if the binary variable is asymmetric):
Dissimilarity between Binary Variables

Nominal Variables:
 A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
 Method 1: Simple matching
○ m: # of matches, p: total # of variables
 Method 2: use a large number of binary variables
○ creating a new binary variable for each of the M nominal states
Ratio-Scaled Variables
 Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at
exponential scale,
○ treat them like interval-scaled variables — not a good choice!
○ apply logarithmic transformation
yif = log(xif)
○ treat them as continuous ordinal data treat their rank as interval-scaled.

5.
Benefits of having a data warehouse architecture
Provides an organizing framework - the architecture draws the lines on the map in terms
of what the individual components are, how they fit together, who owns what parts, and
priorities.
Improved flexibility and maintenance - allows you to quickly add new data sources,
interface standards allow plug and play, and the model and meta data allow impact
analysis and single-point changes.
Faster development and reuse - warehouse developers are better able to understand the
data warehouse process, data base contents, and business rules more quickly.
Management and communications tool - define and communicate direction and scope to
set expectations, identify roles and responsibilities, and communicate requirements to
vendors.
Coordinate parallel efforts - multiple, relatively independent efforts have a chance to
converge successfully. Also, data marts without architecture become the stovepipes of
tomorrow.

Section C: Applied Theory


6.
In the information systems world, an architecture adds value in mush the same way
blueprint for a construction of a project. An effective architecture will increase the
flexibility of the system, facilitate learning, and improve productivity.
For data warehousing, the architecture is a description of the elements and services of the
warehouse, with details showing how the components will fit together and how the
system will grow over time. Like the house analogy, the warehouse architecture is a set
of documents, plans, models, drawings, and specifications, with separate sections for
each key component area and enough detail to allow their implementation by skilled
professionals.
Key Component Areas
 A complete data warehouse architecture includes data and technical elements.
 Thornthwaite breaks down the architecture into three broad areas.
 The first, data architecture, is centered on business processes.
 The next area, infrastructure, includes hardware, networking, operating systems, and
desktop machines.
 Finally, the technical area encompasses the decision-making technologies that will
be needed by the users, as well as their supporting structures.

Data Architecture (Columns)


The data architecture portion of the overall data warehouse architecture is driven by
business processes. For example, in a manufacturing environment the data model might
include orders, shipping, and billing. Each area draws on a different set of dimensions.
But where dimensions intersect in the data model and these data items should have a
common structure and content, and involve a single process to create and maintain.
Business requirements essentially drive the architecture, so talk to business managers,
analysts, and power users. From your interviews look for major business issues, as well
as indicators of business strategy, direction, frustrations, business processes, timing,
availability, and performance expectations. Document everything well.
From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP
application groups, and DBAs; as well as networking, OS, and desktop support staff.
Also speak with architecture and planning professionals. Here you want to get their
opinions on data warehousing considerations from the IT viewpoint. Learn if there are
existing architecture documents, IT principles, organizational power centers, etc.
Not many standards exist for data warehousing, but there are standards for a lot of the
components. The following are some to keep in mind:
 Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC.
 Data base connectivity - ODBC, JDBC, OLE DB, and others.
 Data management - ANSI SQL and FTP.
 Network access - DCE, DNS, and LDAP.
Technical Architecture
When you develop the technical architecture model, draft the architecture requirements
document first.
Next to each business requirement write down its architecture implications. Group these
implications according to architecture areas (remote access, staging, data access tools,
etc.) Understand how it fits in with the other areas.
Capture the definition of the area and its contents. Then refine and document the model.
Technical Architecture covers the processes and tools we apply to the data. This area
answers the question ”How”.
i.e. how do we get data this data at its source, put it in a form that meets the business
requirements, and move it to a place that is accessible. The technical architecture is made
up of the tools, utilities, code, and so on that bring the warehouse to life.
Two main subsets of the technical architecture area have different requirements to
warrant independent consideration.
These two area are back room and the front room.
The back room is the part responsible gathering data and preparing the data. Another
common term for the back room is data access.
The front room is the part responsible for delivering data to the user community. Another
common term for the front room is data access.
Infrastructure Architecture Area.
Infrastructure is about the platforms that host the data and processes.
The infrastructure is the physical plant of the data warehouse.
Defining the levels of Detail( the rows)
Business Requirements level.
The business requirements level is explicitly non-technical.
The systems planner must understand the major business forces and boundary conditions
that affect the data warehouse project.
Architectural Level
Architecture models are the first level of serious response to the requirements.
An architecture model proposes the major components of the architecture that must be
available to address the requirements.
At this level the system perspective addresses whether the various technical components
can communicate with each other or not.
Detailed Models Level
Detailed models are the functional specifications each of the architectural components, at
a significant level of detail.
The detail models must include enough information to serve as a reliable implementation
guide for the team member.
Detailed models also must be enough to create a legal contract so that when the work is
done, it can be held up to the functional specification to see whether the implementation
is complete and confirms to the specification.
Implementation Level
The implementation level is the response to the detailed models. For software deliverable,
it is the code itself. For the data area, it is the data definition language used to build the
data base.
In implementation level all the above must be documented
7.
a. A factless fact table captures the many-to-many relationships between dimensions,
but contains no numeric or textual facts. They are often used to record events or
coverage information.
Common examples of factless fact tables include:
 Identifying product promotion events (to determine promoted products that
didn’t sell)
 Tracking student attendance or registration events
 Tracking insurance-related accident events
 Identifying building, facility, and equipment schedules for a hospital or
University"
b. Web mining, when looked upon in data mining terms, can be said to have three
operations of interests - clustering (finding natural groupings of users, pages etc.),
associations (which URLs tend to be requested together), and sequential analysis (the
order in which URLs tend to be accessed). As in most real-world problems, the
clusters and associations in Web mining do not have crisp boundaries. And often
overlap considerably. In addition, bad exemplars (outliers) and incomplete data can
easily occur in the data set, due to a wide variety of reasons inherent to web
browsing and logging. Thus, Web Mining and Personalization requires modeling of
an unknown number of overlapping sets in the presence of significant noise and
outliers, (i. e., bad exemplars). Moreover, the data sets in Web Mining are extremely
large.
c. Market basket analysis studies the buying habits of customers by searching for sets
of items that are frequently purchased together OR in sequence.

You might also like