Part 1 Data Warehousing and Data Mining

DATA WAREHOUSING AND DATA
MINING
10BI23
Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
DATA WAREHOUSING AND DATA MINING
10BI23
Text Books:
1. Fundamentals of Data Warehouses – M. Jarke, M. Lenzerirni,
Y.Vassiliou, P. Vassiliadis – Springer Verlag – 2003
2. The Data Warehouse Toolkit – Ralph Kimball Wiley 2002
3. Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations by I. Witten and E Frank, Morgan
Kaufmann, 1999
4. Data Mining: Concepts and Techniques- J.Han and M.Kamber,
Morgan Kaufmann , 2000
10BI23
M Jarke
Ralph Kimball
Bill Inmon
Ian Witten Jiawei Han

10BI23
Introduction to Data Warehouses:

• Introduction
• Heterogeneous information, Integration problem
• Warehouse architecture
• Data Warehousing, Warehouse Vs DBMS
DW Best Practices:
The Most Important Metrics
• Employee satisfaction
– Without it, long-term customer satisfaction is impossible
• Customer satisfaction
– That’s the nature of the Information Services career field
– Some people in our profession still don’t get it
• We are here to serve
• The Organizational Laugh Metric

– How many times do you hear laughter in the day-to-day operations of
your team?
– It is the single most important vital sign to organizational health and
business success
5
Data Warehousing History
“Newspaper Rock” American Retail

100 B.C. 2005 A.D.
Lots of stuff happened
6
What Happened in the Cloud?
• Stage 1: Laziness
– Operators grew tired of hanging tapes
• In response to requests for historical financial data
– They stored data on-line, in “unauthorized” mainframe databases
• Stage 2: End of the mainframe bully

– Computing moved out from finance to the rest of the business
– Unix and relational databases
– Distributed computing created islands of information
• Stage 2.1: The government gets involved

– Consolidating IRS and military databases to save money on mainframes
– “Hey, look what I can do with this data…”
• Stage 3: Deming comes along

– Push towards constant business “reengineering”
– Cultural emphasis on “continuous quality improvement” and “business
innovation” drives the need for data
• Stage 4: Data warehousing has it’s own language

– Ralph Kimball publishes “The Data Warehouse Toolkit”
7
The Real Truth
• Data warehousing is a symptom of a problem
– Technological inability to deploy single-platform
information systems that:
• Capture data once and reuse it throughout an
enterprise
• Support high-transaction rates (single record CREATE,
SELECT, UPDATE, DELETE) and analytic queries on the same
computing platform, with the same data, at the same
time
– Someday, maybe we will address the root cause

• Until then, it’s a good way to make a living
8
The “Ideal Library” Analogy
• Stores all of the books and other reference material you need to conduct your
research
– The Enterprise data warehouse
• A single place to visit
– One database environment
• Contents are kept current and refreshed
– Timely, well-choreographed data loads
• Staffed with friendly, knowledgeable people that can help you find your way
around
– Your Data Warehouse team
• Organized for easy navigation and use
– Metadata
– Data models
– “User friendly” naming conventions
• Solid architectural infrastructure
– Hardware, software, standards, metrics
9
Example of DW in Healthcare
VCT – Voluntary Counselling Therapy

ART – Anti Retroviral Therapy
ARV- Anti RetroViral

10BI23

• Introduction
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
l Different interfaces
l Different data representations
l Duplicate and inconsistent information
16
Problem: Data Management in Large
Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...
Sales Administration Finance Manufacturing ...

17
Technological Issues
 Problem: Acquire data from a set of
sources for a particular application
 Typical architecture: wrappers and mediators
 Core problem: specify and implement
mediators
 Paper focus: Data warehouses
Data Warehouse Integration
 Most sources internal to organization
 Need global corporate view of data
 Conceptual model defines sources and
data warehouse (local-as-view)
 Three levels of architecture
 Conceptual: Global model
 Logical: Query specifications for sources and
warehouse
 Physical: Wrappers and mediators
implementing query specifications
Goal: Unified Access to Data
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
 Collects and combines information

 Provides integrated view, uniform user interface
 Supports sharing
20
Why a Warehouse?
 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)
Source Source
21
Data warehouses
 Interfaces
 provide intuitive access to
the data
 possibly change data format
to meet user expectations
Data
Warehouse  Warehouse
 stores a consistent view of
data in a local repository
Mediator
 Mediator
 transform data from source
Wrapper Wrapper
format to warehouse format
Swiss
 Wrappers
PDB Prot SCoP dbEST  read data from source into
internal representation
Metadata
Metadata is defined as data providing information about one or more aspects
of the data, such as:
 Means of creation of the data
 Purpose of the data
 Time and date of creation
 Creator or author of data
 Placement on a computer network where the data was created
 Standards used
Examples: Libraries
Metadata has been used in various forms as a means of cataloguing archived
information
The Dewey Decimal System employed by libraries for the classification of
library materials is an early example of metadata usage. Library catalogues
used 3x5 inch cards to display a book's title, author, subject matter, and a
brief plot synopsis along with an abbreviated alpha-numeric identification
system
23
Metadata (continued)
Photographs
 Metadata may be written into a digital photo file that will identify who owns
it, copyright & contact information, what camera created the file, along with
exposure information and descriptive information such as keywords about the
photo, making the file searchable on the computer and/or the Internet
Video
 Metadata is particularly useful in video, where information about its contents
(such as transcripts of conversations and text descriptions of its scenes) are not
directly understandable by a computer, but where efficient search is desirable.
Web pages
 Web pages often include metadata in the form of meta tags
 Description and keywords meta tags are commonly used to describe the Web
page's content
 Most search engines use this data when adding pages to their search index.
CS 336 24
Metadata (continued)
• Data Warehouse Metadata
– All of the information in the data warehouse environment that is not the
actual data itself
• Operational Database Metadata Elements
– Tables (names, descriptions, definitions)
– Attributes (names, descriptions, definitions)
– Relationships
– Formulae
• Data Warehouse Metadata Elements
– – Transformations
Tables (names, descriptions, definitions)
– – Synonyms
Attributes (names, descriptions, definitions)
– – Alias
Relationships
– – Source/target info
Formulae
– Versions
– etc.
25
The Traditional Research Approach
 Query-driven (lazy, on-demand)
Clients
Integration System Metadata
...
Wrapper Wrapper Wrapper
...
Source Source Source
26
Disadvantages of Query-Driven
Approach
 Delay in query processing
 Slow or unavailable information sources
 Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry
27
Advantages of Warehousing Approach
 High query performance
 But not necessarily most current information
 Doesn’t interfere with local processing at sources
 Complex queries at warehouse
 OLTP at information sources
 Information copied at warehouse
 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing
 Has caught on in industry
28
10BI23

• Introduction
Warehouse Architecture
Clients
Central Data
Warehouse
Source Source
......
Central Architecture
Central DW
• All data in one, central DW

• All client queries directly on the central DW
• Pros
Simplicity
Easy to manage
• Cons
Bad performance due to no redundancy/ workload
distribution
End Users
Local Finan-
Data marts
Market Distri-
cial ting bution
Logical
Data
Warehouse
Source Source
Federated architecture
Federated DW
• Data stored in separate data marts, aimed at special
departments
• The DW is only logical, i.e., ”virtual”
• The data marts contain detail data
• Like Kimball’s DW Bus concept
Pros
Performance due to distribution
Cons
More complex
Central
Data
Warehouse
Tiered Architecture
Tiered Architecture
• Central DW is materialized
• Data is distributed to data marts in one or more tiers
• Only aggregated data in cube tiers
• Data is aggregated/reduced as it moves through tiers
• Pros
• Best performance due to redundancy + distribution
• Cons
Most complex
Hard to manage
Traditional data warehouse architecture
Hend MADHOUR Data warehouse

EPFL 2005 36
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

August 21, 2020 Data Mining: Concepts and Techniques 37
Organizational Data Flow and Data
Storage Components
Loading the Data Warehouse
Data is periodically
extracted
Data is cleansed and

transformed
Users query the data

warehouse
Source Systems Data Staging Area Data Warehouse

(OLTP)
Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational
Data Store
 Logical Data Mart and @ctive Warehouse
 Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)

ETL

EPFL 2005 40
Views on DW Metadata
 Most DW projects see DW architecture as a ”stepwise

flow” of information from source to analyst
 No conceptual domain model used for integration ?
Some questions cannot be answered
 DWQ project: extended metamodel to capture all
relevant aspects

EPFL 2005 41
DWQ Metadata
 Three metadata perspectives must be captured

Conceptual (enterprise)
Logical (data model)
Physical (data flow)
 This framework is instantiated by conceptual, logical,
and physical information models
 However, DW quality is mostly dependent on the DW
processes rather than schemas
 Thus, a process meta model is needed to capture
process definitions, and the relationships to DW quality
EPFL 2005 42
Data Warehousing in the context of an
enterprise
5)
4)
3)
1)
2)

EPFL 2005 43
Using DW Metadata in the Enterprise
 Analyst wants to know some-thing about the enterprise (?)

 Must gather information from operational departments through OLTP
systems
 Question travels through 1) - 5)
 However, a traditional DW (like on the previous slide) only describes step
3)+4) precisely
 Cannot answer questions like ”why can’t I answer quest. X?”
 Conceptual relationships between enterprise model, operational models +
DW must be captured
 Everything is a view on the enterprise model ! (”local as view”) –unlike
previous slide

EPFL 2005 44
Proposed Data Warehouse metadata
framework
Repository structure product, process and
quality of data warehousing

EPFL 2005 46
Generic two-level architecture (src: J.Hoffer, M. Prescott, F. McFadden)
L
One,
company-
wide
T warehouse
Periodic extraction  data is not completely current in warehouse

Data warehouse
47 Hend MADHOUR
EPFL 2005
Data marts:
Independent Data Mart Mini-warehouses, limited in scope
T
E
Separate ETL for each Data access complexity

independent data mart due to multiple data marts
Data warehouse
48 Hend MADHOUR
EPFL 2005
ODS provides option for
Dependent data mart with operational data store obtaining current data
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data warehouse
49 Hend MADHOUR
EPFL 2005
ODS and data warehouse
are one and the same
Logical data mart and @ctive data warehouse
T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Data warehouse
50 Hend MADHOUR
EPFL 2005
CUSTOMIZATIO
Data
Marts
N Corporate Data Warehouse

AGGREGATION
High level aggregation
History
Data Archiving Operational

Integrated
PREPARATION INTEGRATIO
Data Source
data
Data Integration
N
Data Cleaning
logs, deltas
Data Archival
and histories
Data Extraction
Data sources
Operational View of a construction of a Data Warehouse

ETL/DW Refreshment
Refreshment Workflow
Data Reconciliation
• Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due to denormalization for
performance)
– Restricted in scope – not comprehensive
– Sometimes poor quality – inconsistencies and errors
• After ETL, data should be:
– Detailed – not summarized yet
– Historical – periodic
– Normalized – 3rd normal form or higher
– Comprehensive – enterprise-wide perspective
– Quality controlled – accurate with full integrity
Data warehouse
54 Hend MADHOUR
EPFL 2005
The ETL Process
• Capture
• Scrub or data cleansing
• Transform
• Load and Index
ETL = Extract, transform, and load
Data warehouse
55 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (src: J.Hoffer, M. Prescott, F. McFadden)
Capture = extract…obtaining a snapshot

of a chosen subset of the source data for
loading into the data warehouse
Static extract = capturing a Incremental extract =

snapshot of the source data at capturing changes that have
a point in time occurred since the last static
extract
Data warehouse
56 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (continued)
Scrub = cleanse…uses pattern

recognition and AI techniques to
upgrade data quality
Fixing errors: misspellings, Also: decoding, reformatting, time

erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
Data warehouse
data
57 Hend MADHOUR
EPFL 2005
Transform = convert data from format

of operational system to format of data
warehouse
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Data warehouse
58 Hend MADHOUR
EPFL 2005
Load/Index= place transformed data

into the warehouse and create indexes
Refresh mode: bulk rewriting of Update mode: only changes in

target data at periodic intervals source data are written to data
warehouse
Data warehouse
59 Hend MADHOUR
EPFL 2005
Architecture is the proper arrangement of the components.
Source Data
External
Information Delivery
Management & Control
Production
Metadata
Data Mining
Archived Internal
Data Warehouse Multi-

DBMS dimensional
DBs OLAP
Data Storage Report/Query

Data Marts
Data Staging
 This function is time-consuming
 Initial load moves very large volumes of data
 The business conditions determine the refresh cycles
Data
Sources
Yearly refresh
Quarterly refresh
Monthly refresh
Daily refresh
DATA
WAREHOUSE
Base data load
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

August 21, 2020 Data Mining: Concepts and Techniques 63
10BI23

• Introduction
Definition
 Data Warehouse:
 A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
 Subject-oriented: e.g. customers, patients, students,
products
 Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
 Time-variant: Can study trends and changes
 Nonupdatable: Read-only, periodically refreshed
 Data Mart:
 A data warehouse that is limited in scope
65
Characteristics of a Data Warehouse
 Subject oriented – organized based on use

 Integrated – inconsistencies removed
 Nonvolatile – stored in read-only format
 Time variant – data are normally time series
 Summarized – in decision-usable format
 Large volume – data sets are quite large
 Non normalized – often redundant
 Metadata – data about data are stored
 Data sources – comes from nonintegrated sources
W.H.INMON
A Data Warehouse is
Subject Oriented
Data in a Data Warehouse are Integrated
The extracted data are derived from processing systems, where data
must be accurate when accessed by end users.
At the moment the data are extracted for the data warehouse, they are
accurate, but moments later they will no longer reflect the true state of
the business.
But because the data were accurate as of some meaningful moment of

time (e.g., last hour, last month, and so forth) they are said to be
time-variant.
Another distinguishing characteristic of the data collected for a data

warehouse is that there is absolutely no attempt to keep the data
up to date. Therefore, there is essentially no UPDATE operation on
the data warehouse data.
The data are read-only and are said to be nonvolatile.

OLTP vs. OLAP
• OLTP: On-Line • OLAP: On-Line
Transaction Processing Analytical Processing
– Many short transactions – Long transactions, complex
(queries + updates) queries
– Examples: – Examples:
• Update account balance • Report total sales for each
• Enroll in course department in each month
• Add book to shopping cart • Identify top-selling books
– Queries touch small • Count classes with fewer
amounts of data (one than 10 students
record or a few records) – Queries touch large
– Updates are frequent amounts of data
– Concurrency is biggest – Updates are infrequent
performance concern – Individual queries can
require lots of resources
Data Warehouse vs. Operational DBMS
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why OLAP & OLTP don’t mix (1)
Different performance requirements
• Transaction processing (OLTP):
– Fast response time important (< 1 second)
– Data must be up-to-date, consistent at all times
• Data analysis (OLAP):
– Queries can consume lots of resources
– Can saturate CPUs and disk bandwidth
– Operating on static “snapshot” of data usually OK
• OLAP can “crowd out” OLTP transactions
– Transactions are slow → unhappy users
• Example:
– Analysis query asks for sum of all sales
– Acquires lock on sales table for consistency
– New sales transaction is blocked
Different data modeling requirements
• Transaction processing (OLTP):

– Normalized schema for consistency
– Complex data models, many tables
– Limited number of standardized queries and updates
• Data analysis (OLAP):
– Simplicity of data model is important
• Allow semi-technical users to formulate ad hoc queries
– De-normalized schemas are common
• Fewer joins → improved query performance
• Fewer tables → schema is easier to understand
Analysis requires data from many sources
• An OLTP system targets one specific process
– For example: ordering from an online store
• OLAP integrates data from different processes
– Combine sales, inventory, and purchasing data
– Analyze experiments conducted by different labs
• OLAP often makes use of historical data
– Identify long-term patterns
– Notice changes in behavior over time
• Terminology, schemas vary across data sources
– Integrating data from disparate sources is a major challenge
Data Warehouses
• Doing OLTP and OLAP in the same database
system is often impractical
– Different performance requirements
– Different data modeling requirements
– Analysis queries require data from many sources
• Solution: Build a “data warehouse”
– Copy data from various OLTP systems
– Optimize data organization, system tuning for OLAP
– Transactions aren’t slowed by big analysis queries
– Periodically refresh the data in the warehouse
Need for Data Warehousing
 Integrated, company-wide view of high-quality
information (from disparate databases)
 Separation of operational and informational systems
and data (for improved performance)
76
Source: adapted from Strange (1997).
77

Part 1 Data Warehousing and Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 1 Data Warehousing and Data Mining

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSING AND DATA

Ian Witten Jiawei Han

Introduction to Data Warehouses:

• The Organizational Laugh Metric

“Newspaper Rock” American Retail

• Stage 2: End of the mainframe bully

• Stage 2.1: The government gets involved

• Stage 3: Deming comes along

• Stage 4: Data warehousing has it’s own language

– Someday, maybe we will address the root cause

VCT – Voluntary Counselling Therapy

ARV- Anti RetroViral

Introduction to Data Warehouses:

Sales Administration Finance Manufacturing ...

 Collects and combines information

Integration System Metadata

Introduction to Data Warehouses:

• All data in one, central DW

Hend MADHOUR Data warehouse

Data Sources Data Storage OLAP Engine Front-End Tools

Data is cleansed and

Users query the data

Source Systems Data Staging Area Data Warehouse

All involve some form of extraction, transformation and loading (ETL)

Hend MADHOUR Data warehouse

 Most DW projects see DW architecture as a ”stepwise

Hend MADHOUR Data warehouse

 Three metadata perspectives must be captured

Hend MADHOUR Data warehouse

 Analyst wants to know some-thing about the enterprise (?)

Hend MADHOUR Data warehouse

Hend MADHOUR Data warehouse

Periodic extraction  data is not completely current in warehouse

Separate ETL for each Data access complexity

N Corporate Data Warehouse

High level aggregation

Data Archiving Operational

Operational View of a construction of a Data Warehouse

ETL = Extract, transform, and load

Capture = extract…obtaining a snapshot

Static extract = capturing a Incremental extract =

Scrub = cleanse…uses pattern

Fixing errors: misspellings, Also: decoding, reformatting, time

Transform = convert data from format

Load/Index= place transformed data

Refresh mode: bulk rewriting of Update mode: only changes in

Data Warehouse Multi-

Data Storage Report/Query

Data Sources Data Storage OLAP Engine Front-End Tools

Introduction to Data Warehouses:

 Subject oriented – organized based on use

But because the data were accurate as of some meaningful moment of

Another distinguishing characteristic of the data collected for a data

The data are read-only and are said to be nonvolatile.

• Transaction processing (OLTP):

You might also like