Datamining Lecture 2

DATA MINING
LECTURE 2
Data warehouse
By: Basha K | MSc in Information Technology | FCSE

DM BY Basha K, 2022 Data Mining 2
What is Data Mining?

• Data mining is the process of identifying valid, novel,
useful and understandable patterns in large amount of
data.
• Also known as KDD (Knowledge Discovery in
Databases).
• “We’re drowning in information, but starving for
knowledge.” (John Naisbett)
What is a Data Warehouse?

• Defined in many different ways, but not rigorously.
• A decision support methods that is maintained separately
from the organization’s operational database
• Support information processing by providing a solid platform
of consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
…
• Subject Oriented:
• Data warehouse is subject oriented because it provides us the
information around a subject rather than the organization's ongoing
operations.
• These subjects can be product, customers, suppliers, sales, revenue,
etc.
• The data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision-
making. Customer
Customer Data
(1988 - 1990)
Customer activity
Customer Data (1986- 1989)
(1985 - 1987)
Customer Activity detail Customer Activity detail

(1985 - 1987) (1990 - 1991)
…
• Integrated:
• Data warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat
files etc.
• This integration enhances the effective analysis of data.
• Data Preprocessing are applied to ensure consistency in
naming conventions, encoding structures, attribute
measures, and so on.
…
• Time Variant:
• The data collected in a data warehouse is identified with
a particular time period.
• The data in a data warehouse provides information from
a historical point of view. e.g. past 5-10 years
• Data warehouse stores historical data.
…
• Non- volatile:
• Nonvolatile means the previous data is not removed when new data is
added to it.
• The data warehouse is kept separate from the operational database
therefore frequent changes in operational database is not reflected in the
data warehouse.
• Data once recorded cannot be updated.
• Data warehouse requires two operations:
• Initial loading of data
• Access of data
Need/importance of data warehouse
source of information for report

generation
Increase quality and flexibility of

enterprise analysis
Ability to maintain better customer

relationships
source for data analysis and data

mining
More cost – effective decision

making and policy formulation
Knowledge Discovery (KDD) Process

Steps of KDD:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)1
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods
are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or

even misleading statistics.
 Data warehouse needs consistent integration of quality data
 Data extraction, cleaning, and transformation comprises the

majority of the work of building a data warehouse
Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP

OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
unit of work short, simple transaction complex query

# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures

• Star schema: A fact table in the middle connected to a set of
dimension tables
• Snowflake schema: A refinement of star schema where some

dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to star.
• Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema
or fact constellation
…
 When compared star with snowflake model,
 Star model is the best one, but the snowflake is the normalized
form to reduce redundancies.
-easy to maintain.
-save storage space.
-reduce the effectiveness of browsing.
-More joins will be needed to execute the query.
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month Brand_type
quarter time_key supplier_type
year
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema

time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
A Concept Hierarchy: Dimension (location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind

Typical OLAP Operations

• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Example: street<city<state<country.
• Aggregates the data by ascending the location hierarchy
from the level city to the level country.
• Dimension reduction –one or more dimensions are removed
From the given cube.
• Drill down (roll down): reverse of roll-up

• Navigate from less detailed data to more detailed data, or
introducing new dimensions.
Cont..
• Day<month<quarter<year
• Aggregates from the level month to day.
• By descending order.
• Additional Dimension-adding new dimension to the
given cube.
Cont..
• Slice and dice:
• Slice operation performs selection on one dimension of a
given cube.
• Example:time=q1.
• Dice operation performs selection on two or more operations.
• Example:location=q1 or q2, time= t1 or t2.
• Pivot (rotate):
• Visualization operation that rotates the data axes in view in
order to provide an alternative presentation of data.
Data warehouse architecture

• Steps:
1.Top-down view
-select relevant information.
-information matches current and future
business needs.
2.Data source view
-information being collected, stored
managed by operating systems.
…
-information may be documented at various
levels of detail and accuracy.
3.Data warehouse view
-fact table and dimension table.
-represents the information that is stored
inside the data warehouse.
4.Business query view
-perspective of data in warehouse from view
point of end user.
Process of data warehouse design

1.top-down approach
-planning and designing
-technology mature and well known
-business problem solved clearly and
well understood.
2.Bottom-up approach
-experiments and prototypes.
-business modeling and technology
development.
Data Warehouse Design Process
• Choose the grain (atomic level of data) of the business process

• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record
Three tier architecture

• Back end tools-feed data into bottom tier.
• These tools perform data extraction, clean, transform,
load, refresh.
• Data extracted using application program
interface(API) called gateways, like JDBC,ODBC
and OLEDB.
Three-Tiered Architecture
Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

Recommended approach
• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Data Warehouse Development: A

Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data Data Enterprise

Mart Mart Data
Warehouse
Model refinement Model refinement
Define a high-level corporate data model

Types of OLAP
• Relational OLAP (ROLAP)

• Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware to support
query language extension.
• greater scalability
• Multidimensional OLAP (MOLAP)
• Array-based multidimensional storage engine
• Handle sparse and dense datasets.
• fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)=ROLAP+MOLAP
• User flexibility=scalable + fast computation.
• Mysql 2000–example.
Metadata Repository
• Meta data is the data defining warehouse objects. It has the following
kinds
• Description of the structure of the warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of
data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization(aggregation, reports..).
• Business metadata(policy)
Data Warehouse Back-End Tools and Utilities

• Data extraction:
• get data from multiple, heterogeneous, and external sources
• Data cleaning:
• detect errors in the data and rectify them when possible
• Data transformation:
• convert data from legacy or host format to warehouse
format
• Load:
• sort, summarize, consolidate, compute views, check
integrity, and partitions
• Refresh
• propagate the updates from the data sources to the
warehouse
Data Warehouse Usage

• Three kinds of data warehouse applications
• Information processing
• supports querying, basic statistical analysis, and reporting using
crosstables, tables, charts and graphs
• Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
OLAM ARCHITECTURE
• OLAM and OLAP servers both accept on-line queries.
• Via graphical user interface and work with data cube via cube
API.
• Metadata data –access of the data cube .
• Data cube –constructed by accessing and integrating multiple
database via MDDB.
• Filtering a datawarehouse via database API.
Cont..
• OLAM –data mining tasks like classification,
prediction, clustering, concept description..
• Sophisticated than an OLAP server.
• High quality of data.
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository

Datamining Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datamining Lecture 2

Uploaded by

Copyright:

Available Formats

DATA MINING

By: Basha K | MSc in Information Technology | FCSE

What is Data Mining?

What is a Data Warehouse?

• “A data warehouse is a subject-oriented, integrated, time-

Customer Activity detail Customer Activity detail

Need/importance of data warehouse

source of information for report

Increase quality and flexibility of

Ability to maintain better customer

source for data analysis and data

More cost – effective decision

Knowledge Discovery (KDD) Process

Why Is Data Preprocessing Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect or

 Data extraction, cleaning, and transformation comprises the

Data Warehouse vs. Operational DBMS

OLTP vs. OLAP

unit of work short, simple transaction complex query

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

• Snowflake schema: A refinement of star schema where some

Example of Star Schema

Example of Snowflake Schema

A Concept Hierarchy: Dimension (location)

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

Typical OLAP Operations

• Drill down (roll down): reverse of roll-up

Data warehouse architecture

Process of data warehouse design

Data Warehouse Design Process

• Choose the grain (atomic level of data) of the business process

Three tier architecture

Data Sources Data Storage OLAP Engine Front-End Tools

Data Warehouse Development: A

Data Data Enterprise

Model refinement Model refinement

Define a high-level corporate data model

• Relational OLAP (ROLAP)

Data Warehouse Back-End Tools and Utilities

Data Warehouse Usage

Data Cube API

You might also like