You are on page 1of 23

Chapter-2

Data warehouse

1
What is Data Mining?

• Data mining is the process of identifying valid,


novel, useful and understandable patterns in large
amount of data.
• Also known as KDD (Knowledge Discovery in
Databases).
• “We’re drowning in information, but starving for
knowledge.” (John Naisbett)

2
What is a Data Warehouse?
• Defined in many different ways, but not rigorously.
• A decision support methods that is maintained
separately from the organization’s operational
database
• Support information processing by providing a
solid platform of consolidated, historical data for
analysis.
• “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making process.”—
W. H. Inmon
• Data warehousing:
• The process of constructing and using data
warehouses 3
Data Warehouse—Subject-Oriented

• Organized around major subjects, such as customer,


product, sales.
• Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
• Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
4
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous
data sources
• relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques are
applied.
• Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
• When data is moved to the warehouse, it is
converted.
5
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly
longer than that of operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not
contain “time element”. 6
Data Warehouse—Non-Volatile

• A physically separate store of data transformed from


the operational environment.
• Operational update of data does not occur in the data
warehouse environment.
• Does not require daily transaction processing,
recovery, and concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
7
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated 8
• Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
(DB) OLTP (DW) OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

9
Conceptual Modeling of Data
Warehouses

• Modeling data warehouses: dimensions & measures


• Star schema: A fact table in the middle connected to a set of
dimension tables

• Snowflake schema: A refinement of star schema where


some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to star.

10
When compared star with snowflake model, star
model is the best one, because the snowflake is the
normalized form to reduce redundancies.
-easy to maintain.
-save storage space.
-reduce the effectiveness of browsing.
-More joins will be needed to execute the query.

11
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
12
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country

13
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

14
Typical OLAP Operations
1. Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension
reduction
• Example: street<city<state<country.
• Aggregates the data by ascending the location
hierarchy from the level city to the level country.
• Dimension reduction –one or more dimensions
are removed
• From the given cube.

15
2. Drill down (roll down): reverse of roll-up
• Navigate from less detailed data to more detailed
data, or introducing new dimensions.
• Day<month<quarter<year
• Aggregates from the level month to day.
• By descending order.
• Additional Dimension-adding new dimension to
the given cube.
Cont..
3. Slice and dice:
• Slice operation performs selection on one
dimension of a given cube.
• Example: time=q1.
• Dice operation performs selection on two or
more operations.
• Example: location=q1 or q2, time= t1 or t2.
4. Pivot (rotate):
• Visualization operation that rotates the data axes
in view in order to provide an alternative
presentation of data. 17
Data warehouse architecture
Steps:
1.top-down view
-select relevant information.
-information matches current and future
business needs.
2.Data source view
-information being collected, stored
managed by operating systems.
- information may be documented at various
levels of detail and accuracy.
18
3.Data warehouse view
-fact table and dimension table.
-represents the information that is stored
inside the data warehouse.

4.Business query view


-perspective of data in warehouse from view
point of end user.

19
Process of data warehouse design

1.top-down approach
-planning and designing
-technology mature and well known
-business problem solved clearly and
well understood.
2.Bottom-up approach
-experiments and prototypes.
-business modeling and technology
development.
20
Data Warehouse Design Process

• Choose the grain (atomic level of data) of the business


process
• Choose a business process to model, e.g., orders,
invoices, etc.
• Choose the dimensions that will apply to each fact
table record
• Choose the measure that will populate each fact table
record

21
Data Warehouse Usage

• Three kinds of data warehouse applications


• Information processing
• supports querying, basic statistical analysis,
and reporting using crosstables, tables, charts
and graphs
• Analytical processing
• multidimensional analysis of data warehouse
data
• supports basic OLAP operations, slice-dice,
drilling, pivoting
22
Data Warehouse Usage(Cont..)
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical
models, performing classification and
prediction, and presenting the mining results
using visualization tools

You might also like