Professional Documents
Culture Documents
Introduction:
Data warehouses generalize and consolidate data in multidimensional space.
The construction of data warehouses involves data cleaning, data integration, and
data transformation, and can be viewed as an important preprocessing step for data
mining.
Moreover, data warehouses provide online analytical processing (OLAP) tools for the
interactive analysis of multidimensional data of varied granularities, which facilitates
effective data generalization and data mining.
Many other data mining functions, such as association, classification, prediction, and
clustering, can be integrated with OLAP operations to enhance interactive mining of
knowledge at multiple levels of abstraction.
Data warehousing:
OLTP OLAP
Orientation Transaction Analysis
Users clerk, DBA knowledge workers (e.g., manager,
executive, analyst)
Function day to day operations decision support
DB design ER based, application- star/snowflake, subject-oriented
oriented
Data current, up-to-date historical, summarized,
detailed, flat relational multidimensional
integrated, consolidated
View detailed, flat relational summarized, multidimensional
Access read/write Mostly read
Unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
Operations index/hash on primary key lots of scans
A major reason for such a separation is to help promote the High performance for
both systems
– DBMS— tuned for OLTP: access methods (Read/write), indexing on Primary
Key, concurrency control, recovery mechanism.
– Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,
consolidation
Different functions and different data:
– Missing data: Decision support requires historical data which operational
DBs do not typically maintain
– Data consolidation: DS requires consolidation (aggregation, summarization)
of data from heterogeneous sources
– Data quality: Different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
2. A multi-dimensional data model
Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.
What is a data cube? A data cube allows data to be modeled and viewed in multiple
dimensions (Eg. 2D,3D etc.).
It is defined by dimensions and facts
A data cube, such as sales, allows data to be modeled and viewed in multiple
dimensions
o Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-
D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube
A Lattice of Cuboids: Each cuboid represents a different degree of summarization
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models
The most popular data model for a data warehouse is a multidimensional model, which can exist in the
formof a star schema, a snowflake schema, or a fact constellation schema
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2)
a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph
resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact
table.
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snowflake.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a
fact constellation.
A Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts.
Consider a concept hierarchy for the dimension location. City values for location include
Vancouver, Toronto, New York, and Chicago.
Each city, however, can be mapped to the province or state to which it belongs.
Typical OLAP Operations
In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels ofabstraction defined by concept hierarchies.
This organization provides users with the flexibility to view data from different perspectives.
A number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand.
Hence, OLAP provides a user-friendly environment for interactive data analysis.
• Reverse of roll-up
• From higher level summary to lower level summary or detailed data, or introducing new
dimensions
3. Dice: Dice operation defines a subcube by performing a selection on two or more
dimensions.
4. Slice: Slice operation performs a selection on one dimension of the given cube, resulting in a
subcube.
5. Pivot (rotate):
Visualization operation that rotates the data axes in view in order to provide an alternative
presentation of the data.
Iceberg Cube
Computing only the cuboid cells whose count or other aggregates satisfying the
condition like
HAVING COUNT(*) >= minsup
o Motivation
Only a small portion of cube cells may be “above the water’’ in a sparse
cube (For many cells in a cuboid, the measure (count or sum)value will be
zero. If zero valued tuples are stored in the cuboid, then we say that the
cuboid is sparse)
Only calculate “interesting” cells—data above certain threshold
Eg. Count >= 10 , sales $100.
Avoid explosive growth of the cube
Partially materialized cubes are known as iceberg cubes.
The minimum threshold is called the minimum support threshold, or minimum
support (min_sup)
Eg. Iceberg cube.
compute cube sales_iceberg as
select month, city, customer_group, count(*)
from salesInfo
cube by month, city, customer_group
having count(*) >= min_sup
2. Join indexing method gained popularity from its use in relational database query
processing.
Join indexing registers the joinable rows of two relations from a relational
database.
The linkage between a fact table and its corresponding dimension tables
comprises the fact table’s foreign key and the dimension table’s primary key
Example: the “Main Street” value in the location dimension table joins with tuples
T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item
dimension table joins with tuples T57 and T459 of the sales fact table.
Suppose that there are 360 time values, 100 items, 50 branches, 30 locations, and 10
million sales tuples in the sales star data cube. If the sales fact table has recorded
sales for only 30 items, the remaining 70 items will obviously not participate in joins