You are on page 1of 68

Data Warehousing -1

What is Data Warehouse?


 Defined in many different ways, but not rigorously.
◦ A decision support database that is maintained
separately from the organization’s operational database
◦ Support information processing by providing a solid
platform of consolidated, historical data for analysis.

 “A data warehouse is a subject-oriented, integrated, time-variant,


and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon

 Data warehousing:
◦ The process of constructing and using data warehouses

01/03/23 2
Data Warehouse: Subject-Oriented
 Organized around major subjects, such as customer, product,
sales.
 Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
 Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

01/03/23 3
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data
sources
◦ relational databases, flat files, on-line transaction
records
 Data cleaning and data integration techniques are applied.
◦ Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
◦ When data is moved to the warehouse, it is
converted.
01/03/23 4
Data Warehouse—Time Variant
 Thetime horizon for the data warehouse is significantly longer
than that of operational systems.
◦ Operational database: current value data.
◦ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
◦ Contains an element of time, explicitly or implicitly
◦ But the key of operational data may or may not
contain “time element”.

01/03/23 5
Data Warehouse—Non-Volatile

A physically separate store of data transformed from the


operational environment.
 Operationalupdate of data does not occur in the data
warehouse environment.
◦ Does not require transaction processing, recovery,
and concurrency control mechanisms
◦ Requires only two operations in data accessing:
 initial loading of data and access of data.

01/03/23 6
Data Warehouse vs. Heterogeneous DBMS

 Traditional heterogeneous DB integration:


◦ Build wrappers/mediators on top of heterogeneous databases
◦ Query driven approach
 When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
 Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance
◦ Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis

01/03/23 7
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
◦ Major task of traditional relational DBMS
◦ Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
◦ Major task of data warehouse system
◦ Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
◦ User and system orientation: customer vs. market
◦ Data contents: current, detailed vs. historical, consolidated
◦ Database design: ER + application vs. star + subject
◦ View: current, local vs. evolutionary, integrated
◦ Access patterns: update vs. read-only but complex queries

01/03/23 8
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

01/03/23 9
Why Separate Data Warehouse?
 High performance for both systems
◦ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◦ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation.

 Different functions and different data:


◦ missing data: Decision support requires historical data which operational
DBs do not typically maintain
◦ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◦ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

01/03/23 10
A Multidimensional Data Model
 Data warehouses and OLAP tools are based on a
multidimensional data model.

 This model views data in the form of a data cube.

 Datacube allows data to be modeled and viewed in multiple


dimensions.

 Data cube defined by dimensions and facts.

 Facts are numerical measures.


◦ dollars sold (sales amount in dollars), units sold (number of units sold),
and amount budgeted.

01/03/23 11
Data Cubes
• Dimensions are the perspectives or entities with respect to
which an organization wants to keep records.
– sales data warehouse: time, item, branch, and location.

• Each dimension may have a table associated with it, called a


dimension table, which further describes the dimension.
– For example, a dimension table for item may contain the attributes item name, brand,
and type.

• Dimension tables can be specified by users or experts, or


automatically generated and adjusted based on data
distributions.

• The fact table contains the names of the facts, or measures,


as well as keys to each of the related dimension tables.
01/03/23 12
A Simple 2-D Data Cube

01/03/23 13
3-D Data Cube

01/03/23 14
Conceptual Representation of
3-D Data Cube

01/03/23 15
4-D Data Cube Representation

01/03/23 16
Base and Apex Cuboids
 The important thing to remember is that data cubes are
n- dimensional and do not confine data to 3-D.

 Inthe data warehousing research literature, a data cube such as


each of the above is often referred to as a cuboid.

 The cuboid that holds the lowest level of summarization is called


the base cuboid.
 The 0-D cuboid, which holds the highest level of summarization,
is called the apex cuboid.

01/03/23 17
Cube: A Lattice of Cuboids

01/03/23 18
Conceptual Modeling of Data
Warehouses
 The entity-relationship data models are appropriate for on-line
transaction processing.

A data warehouse requires a concise, subject-oriented schema


that facilitates on-line data analysis.

 Themost popular data model for a data warehouse is a


multidimensional model.
◦ Exists in the form of star schema, a snowflake schema, or a fact
constellation schema.

01/03/23 19
Star Schema
 The most common modeling paradigm.

 Data warehouse contains


◦ a large central table (fact table) containing the bulk of
the data, with no redundancy,
◦ a set of smaller attendant tables (dimension tables),
one for each dimension.

 The schema graph resembles a starburst, with the dimension


tables displayed in a radial pattern around the central fact table.

01/03/23 20
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
01/03/23 21
Star Schema
• In the star schema, each dimension is represented by only one
table, and each table contains a set of attributes.

• For example, the location dimension table contains the attribute


set location_key, street, city, state, country.

• This constraint may introduce some redundancy.

• For example, “Vijayawada” and “Hyderabad” are both cities in


the Andhra Pradesh state of India. Entries for such cities in the
location dimension table will create redundancy among the
attributes state and country, that is, (..., Hyderabad, Andhra
Pradesh, India) and (..., Vijayawada, Andhra Pradesh, India).

• Moreover, the attributes within a dimension table may form


either a hierarchy (total order) or a lattice (partial order).
01/03/23 22
Snowflake Schema
 The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further
splitting the data into additional tables.

 The dimension tables are kept in normalized form to reduce


redundancies.
 Such a table is easy to maintain and saves storage space.

 This saving of space is negligible.


 The snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query.
 The system performance may be adversely impacted.

01/03/23 23
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country

01/03/23 24
Fact Constellation
• Sophisticated applications may require multiple fact tables to
share dimension tables.

• This kind of schema can be viewed as a collection of stars, and


hence is called a galaxy schema or a fact constellation.

• A fact constellation schema allows dimension tables to be


shared between fact tables.

• For example, the dimensions tables for time, item, and location
are shared between both the sales and shipping fact tables.
– See the figure…

01/03/23 25
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
01/03/23
shipper_type26
Data Marts
A data warehouse collects information about subjects that
span the entire organization, such as customers, items, sales,
assets, and personnel, and thus its scope is enterprise-wide.
◦ For data warehouses, the fact constellation schema is commonly used,
since it can model multiple, interrelated subjects.

A data mart, on the other hand, is a department subset of the


data warehouse that focuses on selected subjects, and thus its
scope is department wide.
◦ For data marts, the star or snowflake schema are commonly used, since
both are geared toward modeling single subjects, although the star
schema is more popular and efficient.

01/03/23 27
DMQL: Language Primitives
A data mining query language (DMQL) can be used to specify
data mining tasks.
 Data warehouses and data marts can be defined using two
language primitives.

 The cube definition statement has the following syntax:


define cube <cube name> [<dimension list>]: <measure list>

 The dimension definition statement has the following syntax:


define dimension <dimension name> as (<attribute or
dimension list>)

01/03/23 28
Defining a Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

01/03/23 29
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

01/03/23 30
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

01/03/23 31
Measures: Their Categorization and
Computation
• A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.

• A measure value is computed for a given point by aggregating


the data corresponding to the respective dimension-value pairs
defining the given point.

• Based on the kind of aggregate functions used, measures can


be organized into three categories:
– distributive, algebraic, holistic

• Most large data cube applications require efficient computation


of distributive and algebraic measures.

01/03/23 32
Measures: Three Categories
• Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning.
• count(), sum(), min(), max().
• Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function.
• avg(), min_N(), standard_deviation().
• Holistic: if there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the
computation.
• median(), mode(), rank().

01/03/23 33
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set
of low-level concepts to higher-level, more general concepts.
 Many concept hierarchies are implicit within the database
schema.
◦ Location: street < city < province or state < country.
◦ Time: day < {month <quarter; week} < year
 Concept hierarchies may also be defined by discretizing or
grouping values for a given dimension or attribute, resulting in
a set-grouping hierarchy.
 There may be more than one concept hierarchy for a given
attribute or dimension, based on different user viewpoints.
◦ Provided manually by system users, domain experts, or knowledge
engineers, or may be automatically generated based on statistical analysis
of the data distribution.
01/03/23 34
Concept Hierarchy of location

01/03/23 35
Hierarchical and lattice structures of
attributes

01/03/23 36
Concept hierarchy of price

01/03/23 37
OLAP Operations in the
Multidimensional Data Model
 In the multidimensional model, data are organized into
multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies.
 This organization provides users with the flexibility to view
data from different perspectives.
 A number of OLAP data cube operations exist to materialize
these different views, allowing interactive querying and
analysis of the data at hand.
 OLAP provides a user-friendly environment for interactive
data analysis.

01/03/23 38
Data cube for sales

01/03/23 39
Roll-up
• The roll-up operation performs aggregation on a data
cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.

• Also called the drill-up operation.

• When roll-up is performed by dimension reduction,


one or more dimensions are removed from the given
cube.
– Location hierarchy: “street < city < province or state < country.”
– The roll-up operation shown aggregates the data by ascending the location
hierarchy from the level of city to the level of country.
– In other words, rather than grouping the data by city, the resulting cube
groups the data by country.

01/03/23 40
roll-up on location
(form cities to
countrries)
01/03/23 41
Drill-down
 Drill-down is the reverse of roll-up.

 It navigates from less detailed data to more detailed data.

 Drill-down can be realized by either stepping down a concept


hierarchy for a dimension or introducing additional
dimensions.

 Because a drill-down adds more detail to the given data, it can


also be performed by adding new dimensions to a cube

01/03/23 42
drill-down on time
(from quarters to months)

01/03/23 43
Slice, Dice and Pivot
 The slice operation performs a selection on one dimension of
the given cube, resulting in a subcube.

 The dice operation defines a subcube by performing a


selection on two or more dimensions.

 Pivot (rotate): Pivot (also called rotate) is a visualization


operation that rotates the data axes in view in order to provide
an alternative presentation of the data.

01/03/23 44
slice for time

01/03/23 45
dice for
(location=“Toronto”or”vencouver”) and
(time =“Q1” or “Q2”) and
(item=“home entertinement” or ” Computer”)

01/03/23 46
Pivot

01/03/23 47
Other OLAP operations
 Some OLAP systems offer additional drilling
operations.
◦ drill across: involving (across) more than one
fact table
◦ drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)

01/03/23 48
Data Warehouse Architecture
 How to design and construct a data warehouse.
 Three-tier data warehouse architecture.
 Back-end tools and utilities for data warehouses.
 Metadata repository.
 Various types of warehouse servers for OLAP
processing.

01/03/23 49
Design of a Data Warehouse: A
Business Analysis Framework
 Four views regarding the design of a data warehouse
◦ Top-down view
 allows selection of the relevant information necessary for the
data warehouse
◦ Data source view
 exposes the information being captured, stored, and managed
by operational systems
◦ Data warehouse view
 consists of fact tables and dimension tables
◦ Business query view
 sees the perspectives of data in the warehouse from the view
of end-user
01/03/23 50
Data Warehouse Design Process
 Top-down, bottom-up approaches or a combination of both
◦ Top-down: Starts with overall design and planning (mature)
◦ Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
◦ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◦ Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
 Typical data warehouse design process
◦ Choose a business process to model, e.g., orders, invoices, etc.
◦ Choose the grain (atomic level of data) of the business process
◦ Choose the dimensions that will apply to each fact table record
◦ Choose the measure that will populate each fact table record

01/03/23 51
Back-end tools and Utilities
Three-tire architecture

01/03/23 52
Three-tiered Architecture

Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


01/03/23 53
Three-tire architecture
 The bottom tier is a warehouse database server that is almost
always a relational database system.
 Back-end tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources.
 The data are extracted using application program interfaces known
as gateways.
 A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server.
◦ ODBC (Open Database Connection)
◦ OLEDB (Open Linking and Embedding for Databases)
◦ JDBC (Java Database Connection).

01/03/23 54
Back-End Tools and Utilities
 Data warehouse systems use back-end tools and utilities to
populate and refresh their data.
 These tools and utilities include the following functions:

◦ Data extraction
◦ Data cleaning
◦ Data transformation
◦ Load
◦ Refresh
 Data cleaning and data transformation are important steps in
improving the quality of the data and, subsequently, of the data
mining results

01/03/23 55
Metadata Repository
 Metadata are data about data.
 In a data warehouse, metadata are the data that define warehouse
objects.
 A metadata repository should contain the following:

◦ A description of the structure of the data warehouse


◦ Operational metadata
◦ The algorithms used for summarization
◦ The mapping from the operational environment to the
data warehouse
◦ Data related to system performance
◦ Business metadata

01/03/23 56
Three-tire architecture
• The middle tier is an OLAP server that is typically implemented
using
– a relational OLAP (ROLAP) model or
– a multidimensional OLAP (MOLAP) model.

• ROLAP model is an extended relational DBMS that maps operations


on multidimensional data to standard relational operations.

• MOLAP model, that is, a special-purpose server that directly


implements multidimensional data and operations.

• The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).

01/03/23 57
Three Data Warehouse Models
 From the architecture point of view, there are three data warehouse models:
enterprise warehouse, data mart, and virtual warehouse.
 Enterprise warehouse
◦ Collects all of the information about subjects spanning the entire organization
 Data Mart
◦ A subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
 Independent vs. dependent (data sourced directly from
warehouse) data mart
 Virtual warehouse
◦ A set of views over operational databases
◦ Only some of the possible summary views may be materialized

01/03/23 58
Data Warehouse Development
Approaches
Top-down approach:
◦ A systematic solution and minimizes integration problems.
◦ It is expensive, takes a long time to develop, and lacks
flexibility due to the difficulty in achieving consistency and
consensus for a common data model for the entire
organization.
Bottom-up approach:
◦ Develop, and deploy independent data marts.
◦ Provides flexibility, low cost, and rapid return of investment.
◦ It can lead to problems when integrating various disparate
data marts into a consistent enterprise data warehouse .

01/03/23 59
Data Warehouse Development: A
Recommended Approach
incremental
and Multi-Tier Data
evolutionary Distributed Warehouse
approach Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


01/03/23 60
Types of OLAP Servers
 Logically, OLAP servers present business users with
multidimensional data from data warehouses or data marts, without
concerns regarding how or where the data are stored.

 Implementations of a warehouse server for OLAP processing


include the following types:
◦ Relational OLAP (ROLAP) servers
◦ Multidimensional OLAP (MOLAP) servers
◦ Hybrid OLAP (HOLAP) servers
◦ Specialized SQL servers:

01/03/23 61
OLAP Servers
 Relational OLAP (ROLAP)
◦ These are the intermediate servers that stand in between a relational back-
end server and client front-end tools.
◦ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces.
 Greater scalability
 Example:

01/03/23 62
OLAP Servers
• Multidimensional OLAP (MOLAP)
– Array-based multidimensional storage engine (sparse matrix techniques)
 Faster Computation
– These will be discussed further in detailed in subsequent Sections.

• Hybrid OLAP (HOLAP)


– Combines both ROLAP and MOLAP Technology.
– Provides greater scalability and the faster computation.
– User flexibility, e.g., low level: relational, high-level: array
– Detailed data is stored in relational database where a the summarized
data is stored in a separate MOLAP store (arrays)

• Specialized SQL servers


– specialized support for SQL queries over star/snowflake schemas

01/03/23 63
From Data Warehousing to Data Mining
Data Warehouse applications
◦ Information processing
 supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◦ Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
◦ Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
01/03/23 64
From Data Warehousing to Data Mining

 Information processing is not data mining!!


◦ Information processing, based on queries, can find
useful information.
◦ Answers to such queries reflect the information
directly stored in databases or computable by
aggregate functions. They do not reflect sophisticated
patterns or regularities buried in the database.

01/03/23 65
From Data Warehousing to Data Mining
• On-line analytical processing Vs. Data mining
– OLAP is a data summarization/aggregation tool that helps simplify data
analysis, whereas data mining allows the automated discovery of
implicit patterns and interesting knowledge hidden in large amounts of
data.
– OLAP tools are targeted toward simplifying and supporting interactive
data analysis, whereas the data mining tools aim to automate as much of
the process as possible, while still allowing users to guide the process.
– OLAP systems can present general descriptions of data from data
warehouses (by drilling, pivoting, slicing, dicing, and other operations),
whereas data mining is not confined to the analysis of data stored in
data warehouses.
– Data mining involves more automated and deeper analysis than OLAP.
• Hence data mining is one step beyond traditional
OLAP.
01/03/23 66
From On-Line Analytical Processing to
On Line Analytical Mining (OLAM)
 Why online analytical mining?
◦ High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
◦ Available information processing structure surrounding data warehouses
 ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
◦ OLAP-based exploratory data analysis
 mining with drilling, dicing, pivoting, etc.
◦ On-line selection of data mining functions
 integration and swapping of multiple mining functions,
algorithms, and tasks.
 Architecture of OLAM
01/03/23 67
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository68
01/03/23

You might also like