You are on page 1of 36

DATA MINING

LECTURE 2
Data warehouse

By: Basha K | MSc in Information Technology | FCSE


DM BY Basha K, 2022 Data Mining 2

What is Data Mining?


• Data mining is the process of identifying valid, novel,
useful and understandable patterns in large amount of
data.
• Also known as KDD (Knowledge Discovery in
Databases).
• “We’re drowning in information, but starving for
knowledge.” (John Naisbett)
DM BY Basha K, 2022 Data Mining 3

What is a Data Warehouse?


• Defined in many different ways, but not rigorously.
• A decision support methods that is maintained separately
from the organization’s operational database
• Support information processing by providing a solid platform
of consolidated, historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-


variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
DM BY Basha K, 2022 Data Mining 4


• Subject Oriented:
• Data warehouse is subject oriented because it provides us the
information around a subject rather than the organization's ongoing
operations.
• These subjects can be product, customers, suppliers, sales, revenue,
etc.
• The data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision-
making. Customer

Customer Data
(1988 - 1990)
Customer activity
Customer Data (1986- 1989)
(1985 - 1987)

Customer Activity detail Customer Activity detail


(1985 - 1987) (1990 - 1991)
DM BY Basha K, 2022 Data Mining 5


• Integrated:
• Data warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat
files etc.
• This integration enhances the effective analysis of data.
• Data Preprocessing are applied to ensure consistency in
naming conventions, encoding structures, attribute
measures, and so on.
DM BY Basha K, 2022 Data Mining 6


• Time Variant:
• The data collected in a data warehouse is identified with
a particular time period.
• The data in a data warehouse provides information from
a historical point of view. e.g. past 5-10 years
• Data warehouse stores historical data.
DM BY Basha K, 2022 Data Mining 7


• Non- volatile:
• Nonvolatile means the previous data is not removed when new data is
added to it.
• The data warehouse is kept separate from the operational database
therefore frequent changes in operational database is not reflected in the
data warehouse.
• Data once recorded cannot be updated.
• Data warehouse requires two operations:
• Initial loading of data
• Access of data
DM BY Basha K, 2022 Data Mining 8

Need/importance of data warehouse

source of information for report


generation

Increase quality and flexibility of


enterprise analysis

Ability to maintain better customer


relationships

source for data analysis and data


mining

More cost – effective decision


making and policy formulation
DM BY Basha K, 2022 Data Mining 9

Knowledge Discovery (KDD) Process


DM BY Basha K, 2022 Data Mining 10

Steps of KDD:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)1
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods
are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
DM BY Basha K, 2022 Data Mining 11

Why Is Data Preprocessing Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect or


even misleading statistics.
 Data warehouse needs consistent integration of quality data

 Data extraction, cleaning, and transformation comprises the


majority of the work of building a data warehouse
DM BY Basha K, 2022 Data Mining 12

Data Warehouse vs. Operational DBMS


• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries
DM BY Basha K, 2022 Data Mining 13

OLTP vs. OLAP


OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans

unit of work short, simple transaction complex query


# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
DM BY Basha K, 2022 Data Mining 14

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures


• Star schema: A fact table in the middle connected to a set of
dimension tables

• Snowflake schema: A refinement of star schema where some


dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to star.
• Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema
or fact constellation
DM BY Basha K, 2022 Data Mining 15


 When compared star with snowflake model,
 Star model is the best one, but the snowflake is the normalized
form to reduce redundancies.
-easy to maintain.
-save storage space.
-reduce the effectiveness of browsing.
-More joins will be needed to execute the query.
DM BY Basha K, 2022 Data Mining 16

Example of Star Schema


time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month Brand_type
quarter time_key supplier_type
year
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
DM BY Basha K, 2022 Data Mining 17

Example of Snowflake Schema


time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
DM BY Basha K, 2022 Data Mining 18

A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind


DM BY Basha K, 2022 Data Mining 19

Typical OLAP Operations


• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Example: street<city<state<country.
• Aggregates the data by ascending the location hierarchy
from the level city to the level country.
• Dimension reduction –one or more dimensions are removed
From the given cube.

• Drill down (roll down): reverse of roll-up


• Navigate from less detailed data to more detailed data, or
introducing new dimensions.
DM BY Basha K, 2022 Data Mining 20

Cont..
• Day<month<quarter<year
• Aggregates from the level month to day.
• By descending order.
• Additional Dimension-adding new dimension to the
given cube.
DM BY Basha K, 2022 Data Mining 21

Cont..
• Slice and dice:
• Slice operation performs selection on one dimension of a
given cube.
• Example:time=q1.
• Dice operation performs selection on two or more operations.
• Example:location=q1 or q2, time= t1 or t2.
• Pivot (rotate):
• Visualization operation that rotates the data axes in view in
order to provide an alternative presentation of data.
DM BY Basha K, 2022 Data Mining 22

Data warehouse architecture


• Steps:
1.Top-down view
-select relevant information.
-information matches current and future
business needs.
2.Data source view
-information being collected, stored
managed by operating systems.
DM BY Basha K, 2022 Data Mining 23


-information may be documented at various
levels of detail and accuracy.
3.Data warehouse view
-fact table and dimension table.
-represents the information that is stored
inside the data warehouse.
4.Business query view
-perspective of data in warehouse from view
point of end user.
DM BY Basha K, 2022 Data Mining 24

Process of data warehouse design


1.top-down approach
-planning and designing
-technology mature and well known
-business problem solved clearly and
well understood.
2.Bottom-up approach
-experiments and prototypes.
-business modeling and technology
development.
DM BY Basha K, 2022 Data Mining 25

Data Warehouse Design Process

• Choose the grain (atomic level of data) of the business process


• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record
DM BY Basha K, 2022 Data Mining 26

Three tier architecture


• Back end tools-feed data into bottom tier.
• These tools perform data extraction, clean, transform,
load, refresh.
• Data extracted using application program
interface(API) called gateways, like JDBC,ODBC
and OLEDB.
DM BY Basha K, 2022 Data Mining 27

Three-Tiered Architecture

Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


DM BY Basha K, 2022 Data Mining 28

Recommended approach

• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
DM BY Basha K, 2022 Data Mining 29

Data Warehouse Development: A


Recommended Approach

Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


DM BY Basha K, 2022 Data Mining 30

Types of OLAP

• Relational OLAP (ROLAP)


• Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware to support
query language extension.
• greater scalability
• Multidimensional OLAP (MOLAP)
• Array-based multidimensional storage engine
• Handle sparse and dense datasets.
• fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)=ROLAP+MOLAP
• User flexibility=scalable + fast computation.
• Mysql 2000–example.
DM BY Basha K, 2022 Data Mining 31

Metadata Repository
• Meta data is the data defining warehouse objects. It has the following
kinds
• Description of the structure of the warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of
data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization(aggregation, reports..).
• Business metadata(policy)
DM BY Basha K, 2022 Data Mining 32

Data Warehouse Back-End Tools and Utilities


• Data extraction:
• get data from multiple, heterogeneous, and external sources
• Data cleaning:
• detect errors in the data and rectify them when possible
• Data transformation:
• convert data from legacy or host format to warehouse
format
• Load:
• sort, summarize, consolidate, compute views, check
integrity, and partitions
• Refresh
• propagate the updates from the data sources to the
warehouse
DM BY Basha K, 2022 Data Mining 33

Data Warehouse Usage


• Three kinds of data warehouse applications
• Information processing
• supports querying, basic statistical analysis, and reporting using
crosstables, tables, charts and graphs
• Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
DM BY Basha K, 2022 Data Mining 34

OLAM ARCHITECTURE
• OLAM and OLAP servers both accept on-line queries.
• Via graphical user interface and work with data cube via cube
API.
• Metadata data –access of the data cube .
• Data cube –constructed by accessing and integrating multiple
database via MDDB.
• Filtering a datawarehouse via database API.
DM BY Basha K, 2022 Data Mining 35

Cont..
• OLAM –data mining tasks like classification,
prediction, clustering, concept description..
• Sophisticated than an OLAP server.
• High quality of data.
DM BY Basha K, 2022 Data Mining 36

An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository

You might also like