You are on page 1of 77

DATA WAREHOUSING AND DATA

MINING
10BI23

Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
DATA WAREHOUSING AND DATA MINING
10BI23

Text Books:
1. Fundamentals of Data Warehouses – M. Jarke, M. Lenzerirni,
Y.Vassiliou, P. Vassiliadis – Springer Verlag – 2003
2. The Data Warehouse Toolkit – Ralph Kimball Wiley 2002
3. Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations by I. Witten and E Frank, Morgan
Kaufmann, 1999
4. Data Mining: Concepts and Techniques- J.Han and M.Kamber,
Morgan Kaufmann , 2000
DATA WAREHOUSING AND DATA MINING
10BI23

M Jarke
Ralph Kimball
Bill Inmon

Ian Witten Jiawei Han


DATA WAREHOUSING AND DATA MINING
10BI23

Introduction to Data Warehouses:


• Introduction
• Heterogeneous information, Integration problem
• Warehouse architecture
• Data Warehousing, Warehouse Vs DBMS
DW Best Practices:
The Most Important Metrics
• Employee satisfaction
– Without it, long-term customer satisfaction is impossible
• Customer satisfaction
– That’s the nature of the Information Services career field
– Some people in our profession still don’t get it
• We are here to serve

• The Organizational Laugh Metric


– How many times do you hear laughter in the day-to-day operations of
your team?
– It is the single most important vital sign to organizational health and
business success
5
Data Warehousing History

“Newspaper Rock” American Retail


100 B.C. 2005 A.D.
Lots of stuff happened

6
What Happened in the Cloud?
• Stage 1: Laziness
– Operators grew tired of hanging tapes
• In response to requests for historical financial data
– They stored data on-line, in “unauthorized” mainframe databases

• Stage 2: End of the mainframe bully


– Computing moved out from finance to the rest of the business
– Unix and relational databases
– Distributed computing created islands of information

• Stage 2.1: The government gets involved


– Consolidating IRS and military databases to save money on mainframes
– “Hey, look what I can do with this data…”

• Stage 3: Deming comes along


– Push towards constant business “reengineering”
– Cultural emphasis on “continuous quality improvement” and “business
innovation” drives the need for data

• Stage 4: Data warehousing has it’s own language


– Ralph Kimball publishes “The Data Warehouse Toolkit”

7
The Real Truth
• Data warehousing is a symptom of a problem
– Technological inability to deploy single-platform
information systems that:
• Capture data once and reuse it throughout an
enterprise
• Support high-transaction rates (single record CREATE,
SELECT, UPDATE, DELETE) and analytic queries on the same
computing platform, with the same data, at the same
time

– Someday, maybe we will address the root cause


• Until then, it’s a good way to make a living

8
The “Ideal Library” Analogy
• Stores all of the books and other reference material you need to conduct your
research
– The Enterprise data warehouse
• A single place to visit
– One database environment
• Contents are kept current and refreshed
– Timely, well-choreographed data loads
• Staffed with friendly, knowledgeable people that can help you find your way
around
– Your Data Warehouse team
• Organized for easy navigation and use
– Metadata
– Data models
– “User friendly” naming conventions
• Solid architectural infrastructure
– Hardware, software, standards, metrics

9
Example of DW in Healthcare

VCT – Voluntary Counselling Therapy


ART – Anti Retroviral Therapy
Example of DW in Healthcare
Example of DW in Healthcare

ARV- Anti RetroViral


Example of DW in Healthcare
Example of DW in Healthcare
DATA WAREHOUSING AND DATA MINING
10BI23

Introduction to Data Warehouses:


• Introduction
• Heterogeneous information, Integration problem
• Warehouse architecture
• Data Warehousing, Warehouse Vs DBMS
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries
l Different interfaces
l Different data representations
l Duplicate and inconsistent information

16
Problem: Data Management in Large
Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


17
Technological Issues
 Problem: Acquire data from a set of
sources for a particular application
 Typical architecture: wrappers and mediators
 Core problem: specify and implement
mediators
 Paper focus: Data warehouses
Data Warehouse Integration
 Most sources internal to organization
 Need global corporate view of data
 Conceptual model defines sources and
data warehouse (local-as-view)
 Three levels of architecture
 Conceptual: Global model
 Logical: Query specifications for sources and
warehouse
 Physical: Wrappers and mediators
implementing query specifications
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

 Collects and combines information


 Provides integrated view, uniform user interface
 Supports sharing
20
Why a Warehouse?
 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)

Source Source

21
Data warehouses
 Interfaces
 provide intuitive access to
the data
 possibly change data format
to meet user expectations
Data
Warehouse  Warehouse
 stores a consistent view of
data in a local repository
Mediator
 Mediator
 transform data from source
Wrapper Wrapper
format to warehouse format
Swiss
 Wrappers
PDB Prot SCoP dbEST  read data from source into
internal representation
Metadata
Metadata is defined as data providing information about one or more aspects
of the data, such as:
 Means of creation of the data
 Purpose of the data
 Time and date of creation
 Creator or author of data
 Placement on a computer network where the data was created
 Standards used
Examples: Libraries
Metadata has been used in various forms as a means of cataloguing archived
information
The Dewey Decimal System employed by libraries for the classification of
library materials is an early example of metadata usage. Library catalogues
used 3x5 inch cards to display a book's title, author, subject matter, and a
brief plot synopsis along with an abbreviated alpha-numeric identification
system

23
Metadata (continued)
Photographs
 Metadata may be written into a digital photo file that will identify who owns
it, copyright & contact information, what camera created the file, along with
exposure information and descriptive information such as keywords about the
photo, making the file searchable on the computer and/or the Internet
Video
 Metadata is particularly useful in video, where information about its contents
(such as transcripts of conversations and text descriptions of its scenes) are not
directly understandable by a computer, but where efficient search is desirable.
Web pages
 Web pages often include metadata in the form of meta tags
 Description and keywords meta tags are commonly used to describe the Web
page's content
 Most search engines use this data when adding pages to their search index.

CS 336 24
Metadata (continued)
• Data Warehouse Metadata
– All of the information in the data warehouse environment that is not the
actual data itself
• Operational Database Metadata Elements
– Tables (names, descriptions, definitions)
– Attributes (names, descriptions, definitions)
– Relationships
– Formulae
• Data Warehouse Metadata Elements
– – Transformations
Tables (names, descriptions, definitions)
– – Synonyms
Attributes (names, descriptions, definitions)
– – Alias
Relationships
– – Source/target info
Formulae
– Versions
– etc.
25
The Traditional Research Approach
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source

26
Disadvantages of Query-Driven
Approach
 Delay in query processing
 Slow or unavailable information sources
 Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry

27
Advantages of Warehousing Approach
 High query performance
 But not necessarily most current information
 Doesn’t interfere with local processing at sources
 Complex queries at warehouse
 OLTP at information sources
 Information copied at warehouse
 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing
 Has caught on in industry

28
DATA WAREHOUSING AND DATA MINING
10BI23

Introduction to Data Warehouses:


• Introduction
• Heterogeneous information, Integration problem
• Warehouse architecture
• Data Warehousing, Warehouse Vs DBMS
Warehouse Architecture

Clients

Central Data
Warehouse

Source Source
......

Central Architecture
Central DW

• All data in one, central DW


• All client queries directly on the central DW
• Pros
Simplicity
Easy to manage
• Cons
Bad performance due to no redundancy/ workload
distribution
Warehouse Architecture
End Users

Local Finan-
Data marts
Market Distri-
cial ting bution

Logical
Data
Warehouse

Source Source

Federated architecture
Federated DW
• Data stored in separate data marts, aimed at special
departments
• The DW is only logical, i.e., ”virtual”
• The data marts contain detail data
• Like Kimball’s DW Bus concept
Pros
Performance due to distribution
Cons
More complex
Warehouse Architecture

Central
Data
Warehouse

Tiered Architecture
Tiered Architecture
• Central DW is materialized
• Data is distributed to data marts in one or more tiers
• Only aggregated data in cube tiers
• Data is aggregated/reduced as it moves through tiers
• Pros
• Best performance due to redundancy + distribution
• Cons
Most complex
Hard to manage
Traditional data warehouse architecture

Hend MADHOUR Data warehouse


EPFL 2005 36
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


August 21, 2020 Data Mining: Concepts and Techniques 37
Organizational Data Flow and Data
Storage Components
Loading the Data Warehouse
Data is periodically
extracted

Data is cleansed and


transformed

Users query the data


warehouse

Source Systems Data Staging Area Data Warehouse


(OLTP)
Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational
Data Store
 Logical Data Mart and @ctive Warehouse
 Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)


ETL

Hend MADHOUR Data warehouse


EPFL 2005 40
Views on DW Metadata

 Most DW projects see DW architecture as a ”stepwise


flow” of information from source to analyst
 No conceptual domain model used for integration ?
Some questions cannot be answered
 DWQ project: extended metamodel to capture all
relevant aspects

Hend MADHOUR Data warehouse


EPFL 2005 41
DWQ Metadata

 Three metadata perspectives must be captured


Conceptual (enterprise)
Logical (data model)
Physical (data flow)
 This framework is instantiated by conceptual, logical,
and physical information models
 However, DW quality is mostly dependent on the DW
processes rather than schemas
 Thus, a process meta model is needed to capture
process definitions, and the relationships to DW quality
Hend MADHOUR Data warehouse
EPFL 2005 42
Data Warehousing in the context of an
enterprise
5)

4)

3)

1)

2)

Hend MADHOUR Data warehouse


EPFL 2005 43
Using DW Metadata in the Enterprise

 Analyst wants to know some-thing about the enterprise (?)


 Must gather information from operational departments through OLTP
systems
 Question travels through 1) - 5)
 However, a traditional DW (like on the previous slide) only describes step
3)+4) precisely
 Cannot answer questions like ”why can’t I answer quest. X?”
 Conceptual relationships between enterprise model, operational models +
DW must be captured
 Everything is a view on the enterprise model ! (”local as view”) –unlike
previous slide

Hend MADHOUR Data warehouse


EPFL 2005 44
Proposed Data Warehouse metadata
framework
Repository structure product, process and
quality of data warehousing

Hend MADHOUR Data warehouse


EPFL 2005 46
Generic two-level architecture (src: J.Hoffer, M. Prescott, F. McFadden)

L
One,
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse


Data warehouse
47 Hend MADHOUR
EPFL 2005
Data marts:
Independent Data Mart Mini-warehouses, limited in scope

T
E

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
Data warehouse
48 Hend MADHOUR
EPFL 2005
ODS provides option for
Dependent data mart with operational data store obtaining current data

T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data warehouse
49 Hend MADHOUR
EPFL 2005
ODS and data warehouse
are one and the same
Logical data mart and @ctive data warehouse

T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Data warehouse
50 Hend MADHOUR
EPFL 2005
CUSTOMIZATIO
Data
Marts

N Corporate Data Warehouse


AGGREGATION

High level aggregation

History

Data Archiving Operational


Integrated
PREPARATION INTEGRATIO

Data Source
data

Data Integration
N

Data Cleaning

logs, deltas
Data Archival
and histories

Data Extraction

Data sources

Operational View of a construction of a Data Warehouse


ETL/DW Refreshment
Refreshment Workflow
Data Reconciliation
• Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due to denormalization for
performance)
– Restricted in scope – not comprehensive
– Sometimes poor quality – inconsistencies and errors
• After ETL, data should be:
– Detailed – not summarized yet
– Historical – periodic
– Normalized – 3rd normal form or higher
– Comprehensive – enterprise-wide perspective
– Quality controlled – accurate with full integrity

Data warehouse
54 Hend MADHOUR
EPFL 2005
The ETL Process

• Capture
• Scrub or data cleansing
• Transform
• Load and Index

ETL = Extract, transform, and load

Data warehouse
55 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (src: J.Hoffer, M. Prescott, F. McFadden)

Capture = extract…obtaining a snapshot


of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract =


snapshot of the source data at capturing changes that have
a point in time occurred since the last static
extract
Data warehouse
56 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
Data warehouse
data
57 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (continued)

Transform = convert data from format


of operational system to format of data
warehouse

Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Data warehouse
58 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (continued)

Load/Index= place transformed data


into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse

Data warehouse
59 Hend MADHOUR
EPFL 2005
Architecture is the proper arrangement of the components.

Source Data
External

Information Delivery
Management & Control
Production

Metadata

Data Mining
Archived Internal

Data Warehouse Multi-


DBMS dimensional
DBs OLAP

Data Storage Report/Query


Data Marts
Data Staging
 This function is time-consuming
 Initial load moves very large volumes of data
 The business conditions determine the refresh cycles

Data
Sources
Yearly refresh

Quarterly refresh

Monthly refresh

Daily refresh
DATA
WAREHOUSE
Base data load
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


August 21, 2020 Data Mining: Concepts and Techniques 63
DATA WAREHOUSING AND DATA MINING
10BI23

Introduction to Data Warehouses:


• Introduction
• Heterogeneous information, Integration problem
• Warehouse architecture
• Data Warehousing, Warehouse Vs DBMS
Definition
 Data Warehouse:
 A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
 Subject-oriented: e.g. customers, patients, students,
products
 Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
 Time-variant: Can study trends and changes
 Nonupdatable: Read-only, periodically refreshed
 Data Mart:
 A data warehouse that is limited in scope

65
Characteristics of a Data Warehouse

 Subject oriented – organized based on use


 Integrated – inconsistencies removed
 Nonvolatile – stored in read-only format
 Time variant – data are normally time series
 Summarized – in decision-usable format
 Large volume – data sets are quite large
 Non normalized – often redundant
 Metadata – data about data are stored
 Data sources – comes from nonintegrated sources
W.H.INMON
A Data Warehouse is
Subject Oriented
Data in a Data Warehouse are Integrated
The extracted data are derived from processing systems, where data
must be accurate when accessed by end users.

At the moment the data are extracted for the data warehouse, they are
accurate, but moments later they will no longer reflect the true state of
the business.

But because the data were accurate as of some meaningful moment of


time (e.g., last hour, last month, and so forth) they are said to be
time-variant.

Another distinguishing characteristic of the data collected for a data


warehouse is that there is absolutely no attempt to keep the data
up to date. Therefore, there is essentially no UPDATE operation on
the data warehouse data.

The data are read-only and are said to be nonvolatile.


OLTP vs. OLAP
• OLTP: On-Line • OLAP: On-Line
Transaction Processing Analytical Processing
– Many short transactions – Long transactions, complex
(queries + updates) queries
– Examples: – Examples:
• Update account balance • Report total sales for each
• Enroll in course department in each month
• Add book to shopping cart • Identify top-selling books
– Queries touch small • Count classes with fewer
amounts of data (one than 10 students
record or a few records) – Queries touch large
– Updates are frequent amounts of data
– Concurrency is biggest – Updates are infrequent
performance concern – Individual queries can
require lots of resources
Data Warehouse vs. Operational DBMS

OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why OLAP & OLTP don’t mix (1)
Different performance requirements
• Transaction processing (OLTP):
– Fast response time important (< 1 second)
– Data must be up-to-date, consistent at all times
• Data analysis (OLAP):
– Queries can consume lots of resources
– Can saturate CPUs and disk bandwidth
– Operating on static “snapshot” of data usually OK
• OLAP can “crowd out” OLTP transactions
– Transactions are slow → unhappy users
• Example:
– Analysis query asks for sum of all sales
– Acquires lock on sales table for consistency
– New sales transaction is blocked
Why OLAP & OLTP don’t mix (2)
Different data modeling requirements

• Transaction processing (OLTP):


– Normalized schema for consistency
– Complex data models, many tables
– Limited number of standardized queries and updates
• Data analysis (OLAP):
– Simplicity of data model is important
• Allow semi-technical users to formulate ad hoc queries
– De-normalized schemas are common
• Fewer joins → improved query performance
• Fewer tables → schema is easier to understand
Why OLAP & OLTP don’t mix (3)
Analysis requires data from many sources
• An OLTP system targets one specific process
– For example: ordering from an online store
• OLAP integrates data from different processes
– Combine sales, inventory, and purchasing data
– Analyze experiments conducted by different labs
• OLAP often makes use of historical data
– Identify long-term patterns
– Notice changes in behavior over time
• Terminology, schemas vary across data sources
– Integrating data from disparate sources is a major challenge
Data Warehouses
• Doing OLTP and OLAP in the same database
system is often impractical
– Different performance requirements
– Different data modeling requirements
– Analysis queries require data from many sources
• Solution: Build a “data warehouse”
– Copy data from various OLTP systems
– Optimize data organization, system tuning for OLAP
– Transactions aren’t slowed by big analysis queries
– Periodically refresh the data in the warehouse
Need for Data Warehousing
 Integrated, company-wide view of high-quality
information (from disparate databases)
 Separation of operational and informational systems
and data (for improved performance)

76
Source: adapted from Strange (1997).

77

You might also like