You are on page 1of 62

Data Warehouse

and
Data Mining
Prof. Dr. M. S. Memon
sulleman@quest.edu.pk

M.S. Memon
05/20/23 Department of CSE, QUEST 1
Outline
• Introduction to Data Warehouse

• Data Warehouse versus Operational Database

• OLTP vs. DW

• Applications of DW

Source: www.stonebridgegroup.com
M.S. Memon
05/20/23 Department of CSE, QUEST 2
Data Warehouse
• Purpose of the Data Warehouse
– Value of the DATA - Realize!!!
• Data / Information is an asset
• Data / Information can be sold
• Methods to realize the VALUE – Reporting, Analysis, Data
Mining, etc
• Make better decisions!!!
– Turn data into Information
– Create competitive advantages
– Methods to support decision making process – DSS etc

M.S. Memon
05/20/23 Department of CSE, QUEST 3
Why data
warehouse?
• Bad decisions can lead to disasters
– Data Warehousing is at the base of decision support
systems
• Data warehousing is a data-driven decision-
support system
• Data warehousing helps to
– Understand the information hidden within the
organization’s data
• See data from different angles: product, client, time,
geographical area
• Get a glimpse of the future.
M.S. Memon
05/20/23 Department of CSE, QUEST 4
Why data
warehouse?
• DBMS Approach
– List of all items that were sold last month?

– List of all items purchased by Naeem?

– The total sales of the last month grouped by branch?

– How many sales transactions occurred during the month


of January?

M.S. Memon
05/20/23 Department of CSE, QUEST 5
Why data
warehouse?
• Intelligent Enterprise
– Which items sell together? Which items to stock?

– Where and how to place the items? What discounts to


offer?

– How best to target customers to increase sales at a


branch?

– Which customers are most likely to respond to the next


promotional campaign, and why?
M.S. Memon
05/20/23 Department of CSE, QUEST 6
Why data
warehouse?
• Businesses want much more …
– What happened?
– Why it happened?
– What will happen?
– What is happening?
– What do you want to happen?

M.S. Memon
05/20/23 Department of CSE, QUEST 7
What is Data warehouse?
• Basically a very large database…
– Not all very large databases are data warehouses, but
all data warehouses are pretty large databases

– Nowadays a warehouse is considered to start at around


800 GB and goes up to several TB

– It spans over several servers and needs an impressive


amount of computing power

M.S. Memon
05/20/23 Department of CSE, QUEST 8
What is Data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data (history)
– Obtained through data cleansing ETL
(Extract-Transform- Load)
– Useful for analytics

M.S. Memon
05/20/23 Department of CSE, QUEST 9
What is Data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strategic focus

– Implies a small number of transactions

– Implies large transactions spanning over a long period


of time

M.S. Memon
05/20/23 Department of CSE, QUEST 10
Definition
• Ralph Kimball: “a copy of transaction data
specifically structured for query and analysis”

M.S. Memon
05/20/23 Department of CSE, QUEST 11
Data Warehouse (definitions)
• Used for decision making, Duplicates existing
data, Combination of hardware, specialized
software and data – Dyche
• A copy of transaction data specifically structured
for query and analysis – Kimball
• A single, complete and consistent store of data
obtained from a variety of different sources made
available to end users in a way that can be
understood and used in business context – Barry
Devlin
M.S. Memon
05/20/23 Department of CSE, QUEST 12
Data Warehouse (definitions)
• A data warehouse is a database where data is
collected for the purpose of being analyzed

• A data warehouse is used to help people make


better decisions

• A data warehouse is defined by the use to


which it is put, not its underlying architecture
M.S. Memon
05/20/23 Department of CSE, QUEST 13
Attributes of DWH
• Bill Inmon (father of data warehousing, in
1993):
A Data Warehouse is a:
• subject oriented
• integrated
• non-volatile
• time-variant
collection of data in support of management’s decisions

M.S. Memon
05/20/23 Department of CSE, QUEST 14
Data Warehouse
• Subject oriented: Data is arranged by subject
area rather than by application. Data is
organized so that all the data elements relating
to the same real-world event or object are
linked together

– Typical subject areas in DWs are Customer, Product,


Order, Claim, Account,…

M.S. Memon
05/20/23 Department of CSE, QUEST 15
Data Warehouse
• Subject oriented:
– Example: customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10, 100 or more physical tables, all
related

M.S. Memon
05/20/23 Department of CSE, QUEST 16
Data Warehouse
• Integrated: Data is collected and consistently
stored from multiple, diverse sources of an
organization's operational systems and this data
is made consistent
– E.g. gender, measurement, conflicting keys, consistency,

M.S. Memon
05/20/23 Department of CSE, QUEST 17
Data Warehouse
• Non-volatile: Data in the data warehouse is never
over-written or deleted - once committed, the
data is static, read-only, and retained for future
reporting. Data is loaded, but not updated
– When subsequent changes occur, a new snapshot record
is written.

M.S. Memon
05/20/23 Department of CSE, QUEST 18
Data Warehouse
• Time-variant: The changes to the data in the
data warehouse are tracked and recorded so
that reports can be produced showing changes
over time.
– Different environments have different time horizons
• associated
• While for operational systems a 60-to-90 day time horizon is
normal, data warehouse has a 5-to-10 year horizon

M.S. Memon
05/20/23 Department of CSE, QUEST 19
General Definition
• More general, a DW is a

– Repository of an
organization’s
electronically stored data

– Designed to facilitate
reporting and analysis

M.S. Memon
05/20/23 Department of CSE, QUEST 20
General Definition
A complete repository of historical corporate data
extracted from transaction systems that is available
for ad-hoc access by knowledge workers
•Transaction Systems
– Management Information System (MIS)
•Ad-hoc access
– Dose not have a certain access pattern
– Queries not known in advance
– Difficult to write SQL in advance
•Knowledge workers
– Typically NOT IT literate (Executives, Analysts, Managers)
M.S. Memon
05/20/23 Department of CSE, QUEST 21
Data Warehousing
• A paradigm specifically designed for
strategic business information or decision
making

• Data warehousing is a data-driven decision-


support system

M.S. Memon
05/20/23 Department of CSE, QUEST 22
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DBMS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data
model

M.S. Memon
05/20/23 Department of CSE, QUEST 23
What can be
warehoused?
• Customer records
• Customer purchases
• Click stream, web traffic
• Product records
• Product purchase records
• Inventory movement

M.S. Memon
05/20/23 Department of CSE, QUEST 24
How does it work?
Business user
needs info

Answers result
User requests
in more questions
IT people

?
Business user
may get answers
 IT people do
system analysis
and design

IT people
send reports to IT people
business user create reports

M.S. Memon
05/20/23 Department of CSE, QUEST 25
Data Warehouse vs. Operational Database

Data Warehouse Operational Database


•Subject oriented • Application oriented

•Integrated • Multiple diverse


sources

•Non-volatile • Updateable

•Time-variant • Real-time, current

M.S. Memon
05/20/23 Department of CSE, QUEST 26
On Line Transaction Processing
• OLTP (OnLine Transaction Processing):
– Also known under the name of operational data, it
represents day-to-day operational business activities:
• Purchasing, sales, production distribution, …
– Typically for data entry and retrieval transaction
processing
– Reflects only the current state of the data

M.S. Memon
05/20/23 Department of CSE, QUEST 27
On Line Transaction Processing
• OLAP (OnLine Analytical Processing):
– Represents front-end analytics based on a DW
repository
– It provides information for activities like:
• Resource planning, capital budgeting, marketing initiatives,...
– It is decision oriented

M.S. Memon
05/20/23 Department of CSE, QUEST 28
OLTP vs. DW
• Properties
Operational DB DW
Mostly updates Mostly reads
Many small transactions Queries long, complex
MB-TB of data GB-PB of data
Raw data Summarized data
Clerical users Decision makers
Up-to-date data May be slightly outdated

M.S. Memon
05/20/23 Department of CSE, QUEST 29
OLTP vs. DW
OLTP Data Warehouse
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, historical,
flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

M.S. Memon
05/20/23 Department of CSE, QUEST 30
OLTP vs. DW
• Consider a normalized database for a store,
tables would like:

M.S. Memon
05/20/23 Department of CSE, QUEST 31
OLTP vs. DW
• DW for that store would start by building the
following star schema:

M.S. Memon
05/20/23 Department of CSE, QUEST 32
OLTP vs. DW
• Basic insights from comparing OLTP and DWs
– A DW is a separate (RDBMS) installation that contains
copies of data from on-line systems
• Physically separate hardware may not be absolutely necessary
if one has lots of extra computing power, but it is
recommended
– With an optimistic locking DBMS one might even be
able to get away for a while with keeping just one copy
of its data

M.S. Memon
05/20/23 Department of CSE, QUEST 33
OLTP vs. DW
• There is an essentially different pattern of
hardware utilization between on-line and
analytical processing

M.S. Memon
05/20/23 Department of CSE, QUEST 34
Applications of DW
• Typical questions which can be answered with
DW & OLAP
– How much did sales unit A earn in January?
– How much did sales unit B earn in February?
– What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans

M.S. Memon
05/20/23 Department of CSE, QUEST 35
Applications of DW
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and pre-
aggregated (known as computing cubes)
• The tables arrangement is subject oriented, usually some star
schema

M.S. Memon
05/20/23 Department of CSE, QUEST 36
Applications of DW
• A DW is the base repository for front-end analytics
– OLAP
– KDD
– Data visualization
– Reporting

KDD (Knowledge
Discovery in
Databases) a data
mining process

M.S. Memon
05/20/23 Department of CSE, QUEST 37
Applications of DW
• OLAP is a form of information processing and thus
needs to provide timely, accurate and
understandable information
– timely is however a relative term:
• In OLTP one expects an update to go through in a matter of
seconds
• In OLAP the time to answer a query can take minutes, hours
or even longer
• There are many flavors of OLAP
– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…

M.S. Memon
05/20/23 Department of CSE, QUEST 38
Applications of DW
– Data mining might return the following set of rules for
customers spending more than €100:
• IF AGE > 35 AND CAR = ‘MINIVAN’ THEN TOTAL SPENT >
€100
• IF SEX = ‘M’ AND ZIP = 38106 THEN TOTAL SPENT > €100
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special promotions
• Which outlets should be closed
M.S. Memon
05/20/23 Department of CSE, QUEST 39
DW User
• Users of DW are called DSS analysts and usually
are business persons
– Their primary job is to define and discover information
used in corporate decision-making
– The way they think
• “Give me what I say I want, and then I can tell you what I really
want”
• They work in explorative manner

M.S. Memon
05/20/23 Department of CSE, QUEST 40
DW User
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are, I can tell what I
really want to see. But until I know what the possibilities are, I
cannot describe exactly what I want...”

– This usage has profound effect on the way a DW is


developed
• The classical system development life cycle assumes that the
requirements are known at the start of design
• The DSS analyst starts with existing requirements, but
factoring in new requirements is almost impossible

M.S. Memon
05/20/23 Department of CSE, QUEST 41
Lifecycle of Data warehouse

M.S. Memon
05/20/23 Department of CSE, QUEST 42
Outline
• Lifecycle of DW

• Classical SDLC vs. DW SDLC

• Operating DW

M.S. Memon
05/20/23 Department of CSE, QUEST 43
Lifecycle of DW

DW System Development Life Cycle (SDLC)


•Design
– End-user interview cycles
– Source system cataloging
– Definition of key performance indicators
– Mapping of decision-making processes underlying
information needs
– Logical and physical schema design

M.S. Memon
05/20/23 Department of CSE, QUEST 44
Lifecycle of DW
• Prototype
– Objective is to constrain and in some cases reframe
end-user requirements
• Deployment
– Development of documentation
– Training
– Operations and management processes
• Operation
– Day-to-day maintenance of the DW needs a good
management of ongoing Extraction, Transformation and
Loading (ETL)M.S.
process
Memon
05/20/23 Department of CSE, QUEST 45
Lifecycle of DW
• Enhancement needs the modification of
– HW - physical components
– Operations and management processes
– Logical schema designs

M.S. Memon
05/20/23 Department of CSE, QUEST 46
Lifecycle of DW

• Classical SDLC vs. DW SDLC

• DW SDLC is almost the opposite of classical


SDLC
M.S. Memon
05/20/23 Department of CSE, QUEST 47
Lifecycle of DW
• Classical SDLC vs. DW SDLC

• Because it is the opposite of SDLC, DW SDLC is


also called CLDS
M.S. Memon
05/20/23 Department of CSE, QUEST 48
Lifecycle of DW
• CLDS is a data driven development life cycle
• It starts with data
– Once data is at hand it is integrated and tested against
bias
– Programs are written against the data and the results
are analyzed and finally the requirements of the
system are understood
– Once requirements are understood, adjustments are
made to the design and the cycle starts all over
• “spiral development methodology”
M.S. Memon
05/20/23 Department of CSE, QUEST 49
Operating a DW
• In Operating a DW the following phases can be
identified
– Monitoring
– Extraction
– Transforming
– Loading
– Analyzing

M.S. Memon
05/20/23 Department of CSE, QUEST 50
Operating a DW: Monitoring
• Monitoring
– Surveillance of the data sources
– Identification of data modification which is relevant to the
DW
– Monitoring has an important role over the whole process
deciding on which data the next steps will be applied on
• Monitoring techniques
– Active mechanisms - Event Condition Action (ECA)
rules:

M.S. Memon
05/20/23 Department of CSE, QUEST 51
Operating a DW: Monitoring
• Monitoring techniques
– Replication mechanisms
• Snapshot:
– Local copy of data, similar to a View
– Used by Oracle 9i
• Data replication
– Replicates and maintains data in destination tables through data
propagation processes
– Used by IBM

M.S. Memon
05/20/23 Department of CSE, QUEST 52
Operating a DW: Monitoring
• Monitoring techniques
– Protocol based mechanisms
• Since DBMS write protocol data for transaction management,
the protocol can be used also for monitoring
• Difficult due to the fact that the protocol format is proprietary and
subject to change
– Application managed mechanisms
• Hard to implement for legacy systems
• Based on time stamping or data comparison

M.S. Memon
05/20/23 Department of CSE, QUEST 53
Operating a DW: Extraction
• Extraction
– Reads the data which was selected throughout the
monitoring phase and inserts it in the data structures of
the workplace
– Due to large data volume, compression can be used
– The time-point for performing extraction can be:
• Periodical:
– Weather or stock market information can be actualized more times
in a day, while product specification can be actualized in a longer
period of time
• On request:
– For example when a new item is added to a product group

M.S. Memon
05/20/23 Department of CSE, QUEST 54
Operating a DW: Extraction
• Extraction
– The time-point for performing extraction can be:
• Event driven:
– Event driven extraction can be helpful in scenarios where time,
or the number of modifications over passing a specified
threshold triggers the extraction. For example each night at
03:00 or each time 50 new modifications took place, an
extraction is performed
• Immediate:
– In some special cases like the stock market it can be necessary
that the changes propagate immediately to the warehouse
– The extraction largely depends on hardware and the
• software used for the DW and the data source
M.S. Memon
05/20/23 Department of CSE, QUEST 55
Operating a DW: Transforming
• Transforming
– Implies adapting data, schema as well as data quality
to the application requirements
– Data integration:
• Transformation in de-normalized data structures
• Handling of key attributes
• Adaptation of different types of the same data
• Conversion of encoding:
– “Buy”,“Sell”  1,2 vs. B,S  1,2
• Normalization:
– “Michael Schumacher”  “Michael, Schumacher” vs. “Schumacher Michael” 
“Michael, Schumacher”
M.S. Memon
05/20/23 Department of CSE, QUEST 56
Operating a DW: Transforming
• Transforming
– Data integration:
• Date handling:
– “MM-DD-YYYY”  “MM.DD.YYYY”
• Measurement units and scaling:
– 10 inch  25,4 cm
– 30 mph  48,279 km/h
• Save calculated values
– Price_incl_VAT = Price_excl_VAT * 1.19
• Aggregation
– Daily sums can be added into weekly ones
– Different levels of granularity can be used

M.S. Memon
05/20/23 Department of CSE, QUEST 57
Operating a DW: Transforming
• Transforming
– Data cleaning:
• Consistency check
– Delivery_date < Order_date
• Completeness
– Management of missing values as well as NULL values

M.S. Memon
05/20/23 Department of CSE, QUEST 58
Operating a DW: Loading
• Loading
– Loading usually takes place during weekends or nights
when the system is not under user stress
– Split between initial load to initialize the DW and the
periodical load to keep the DW updated
– Initial loading
• Implies big volumes of data and for this reason a bulk loader is
used
– Usually performed by partitioning, parallelization and
incremental actualization

M.S. Memon
05/20/23 Department of CSE, QUEST 59
Operating a DW: Analyzing
• Analyze
– Data access
• Useful for extracting goal oriented information:
– How many iPhones 3G were sold in the Braunschweig stores of T-
Mobile in the last 3 calendar weeks of 2008?
– Although it is a common OLTP query, it might be to complex for the
operational environment to handle
– OLAP
• Falsely used as representing DW because it is used to analyze
data contained in DW
• Used to answer requests like:
– In which district does a product group register the highest profit
– How did the profit change in comparison to the previous month?
M.S. Memon
05/20/23 Department of CSE, QUEST 60
Operating a DW: Analyzing
• Analyze
– OLAP
• Used to answer requests like:
– Mostly known as organized on a multidimensional data model
– Common operations for analyze are:
» Pivoting/Rotation
» Roll-up, Drill-down and Drill-across
» Slice and Dice
– Data mining
• Useful for identifying hidden patterns
• Refers to two separate processes:
– KDD (Knowledge Discovery in Databases)
– Prediction
M.S. Memon
05/20/23 Department of CSE, QUEST 61
Operating a DW: Analyzing
• Analyze
– Data mining
• Useful for answering questions like:
– How did the sales of this product group evolve?
• Methods and procedures for data mining
– Clustering, Classification, Regression, Association rule learning

M.S. Memon
05/20/23 Department of CSE, QUEST 62

You might also like