You are on page 1of 7

Overview

Introduction to Data Warehousing and


Business Intelligence
• Why Business Intelligence?
• Data analysis problems
• Data Warehouse (DW) introduction
• A tour of the coming DW lectures
• DW Applications
• Loosely covers [Jarke et al.] chapter 1

Original slides were written by


Torben Bach Pedersen
Aalborg University 2007 - DWML course 2

What is Business Intelligence (BI)? BI Is Important

• BI is different from Artificial Intelligence (AI)


 AI systems make decisions for the users • Worldwide BI revenue in 2005 = US$ 5.7 billion
 BI systems help the users make the right decisions,  10% growth each year
based on available data • The Web makes BI more necessary
• Combination of technologies  Customers do not appear “physically” in the store
 Data Warehousing (DW)  Customers can change to other stores more easily
 On-Line Analytical Processing (OLAP) • Thus:
 Data Mining (DM)
 Know your customers using data and BI!
 Data Visualization (VIS)
 Utilize Web logs, analyze customer behavior in a more detail
 Decision Analysis (what-if) than before (e.g., what was not bought?)
 Customer Relationship Management (CRM)  Combine web data with traditional customer data

Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
Data Analysis Problems Data Warehousing
• The same data found in many different systems • Solution: new analysis environment (DW) where data are
 Example: customer data across different departments
 Subject oriented (versus function oriented)
 The same concept is defined differently
 Integrated (logically and physically)
• Heterogeneous sources  Time variant (data can always be related to time)
 Relational DBS, On-Line Transaction Processing (OLTP)  Stable (data not deleted, several versions)
 Unstructured data in files (e.g., MS Excel) and  Supporting management decisions (different organization)
documents (e.g., MS Word)
• A good DW is a prerequisite for successful BI
• Data is suited for operational systems
 Accounting, billing, etc.
• “Getting multidimensional data into the DW”
 Do not support analysis across business functions • Data from the operational systems are
• Data quality is bad  Extracted
 Cleansed
 Missing data, imprecise data, different use of systems
 Transformed
• Data are “volatile”  Aggregated?
 Data deleted in operational systems (6 months)
 Loaded into DW
 Data change over time – no historical information

Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6

DW: Purpose and Definition DW Architecture – Data as Materialized Views


Existing databases
and systems (OLTP) New databases
• DW is a store of information organized in a unified Appl. and systems (OLAP)
OLAP
data model DB DM
• Data collected from a number of different sources Appl.
Finance, billing, web logs, personnel, … Data
 DB
DM mining
• The purpose of a data warehouse (DW) is to support Trans.
decision making Appl.
DB DW

• Easy to perform advanced analyses (Global) Data


Warehouse
 Ad-hoc analyses and reports Appl. Visua-
 Data mining: discovery of hidden patterns and trends DM
lization
DB
Appl.
(Local)
Data Marts
DB

Analogy: suppliers ↔ supermarket ↔ customers


Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Quick review of normalized database OLTP vs. OLAP
Customer ID Product Category Price Date
3301 Beer Beverage 6.00 02-02-2007 OLTP OLAP
3301 Rice Cereal 4.00 02-02-2007
3302 Beer Beverage 6.00 05-02-2007 Target operational needs business analysis
3303 Wheat Cereal 5.00 07-02-2007
Data small, operational data large, historical data
Customer ID ProductID Date
Model normalized denormalized/
3301 013 02-02-2007 ProductID Product Category Price
3301 052 02-02-2007
multidimensional
013 Beer Beverage 6.00
3302 013 05-02-2007 052 Rice Cereal 4.00 Query language SQL not unified
3303 067 07-02-2007 067 Wheat Cereal 5.00
Queries small large
• Normalized database avoids Updates frequent and small infrequent and batch
 Redundant data
 Modification anomalies Transactional recovery necessary not necessary
• How to get the original table? (join them) Optimized for update operations query operations
• No redundancy in OLTP, controlled redundancy in OLAP
Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10

Queries hard or infeasible for OLTP Function- vs. Subject Orientation


Function-oriented
systems Subject-oriented
Appl. systems
• Business analysis DB DM
D-Appl.

 In the past five years, which product is the most Appl.


profitable? D-Appl.
DB
 Which public holiday we have the largest sales? DM
Trans.
 Does the sales of dairy products increase over time? Appl.
DB DW
 Is there any pattern (correlation) between the sales of All subjects,
beers and the sales of diapers? integrated
Appl. D-Appl.
 ……
DB DM
Appl.
Selected
subjects
DB

Bus architecture
Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
n x m versus n + m Top-down vs. Bottom-up

Appl. Appl.
D-App D-Appl.
DB DM DB DM
Trans.
Appl. Appl.
D-Appl.
DB DB
D-App
DM
DM Trans.
Appl.
DB Appl.
DB DW
Trans.

In-between:
Appl. D-App Appl. 1. Design of DW for D-Appl.
DM
DM1 DM
DB DB
Trans. 2. Design of DM2 and Bottom-up:
Appl. Appl. integration with DW 1. Design of DMs
DB Top-down:DB 3. Design of DM3 and 2. Maybe integration
1. Design of DW integration with DW of DMs in DW
inflexible, expensive 2. Design of DMs 4. ... 3. Maybe no DW
Aalborg University 2007 - DWML course 13 Aalborg University 2007 - DWML course 14

Multidimensional database design Cube Example

• Text-based results difficult


for managers to understand
• Why Cube?
• Motivation: Why not use ER model?
 Good for visualization
• Cubes: Dimensions, Facts, Measures  Multidimensional, intuitive
• OLAP queries  Support OLAP operations
• Advanced multidimensional modeling
 Mainly handling changes in dimensions Sales

• MS SQL server and Analysis Services 350


300
250
200
Total
150 Aalborg
100 Copenhagen
50 Copenhagen
0 Aalborg City
2000
2001
Year

Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
On-Line Analytical Processing (OLAP) Performance Optimization

• Performance optimization
• On-Line Analytical Processing
 Fine tune performance for important queries
 Interactive analysis 102  Aggregates, indexing, other optimizations (environment,
 Explorative discovery partitioning)
 Fast response times required 250 • Using aggregates
• OLAP operations  How can aggregates improve performance?
 Aggregation, e.g., SUM
All Time • Choosing aggregates
 Starting level, (Year, City)  Which aggregates should we materialize?
 Roll Up: Less detail • Maintaining views
 Drill Down: More detail  How do we keep the (aggregate) views up to date?
20 25
 Slice/Dice: Selection, Year=2000
70 57
• Bitmapped indices

Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18

Materialization Example Extract, Transform, Load (ETL)


• Imagine 1 billion sales rows, 1000 products, 100 locations
• CREATE VIEW TotalSales (pid,locid,total) AS
SELECT s.pid,s.locid,SUM(s.sales)
FROM Sales s
• “Getting multidimensional
GROUP BY s.pid,s.locid data into the DW”
• The materialized view has 100,000 rows • Extract
• Rewrite the query to use the view • Transformations / cleansing
 SELECT p.category,SUM(s.sales) FROM Products p, Sales s • Load
WHERE p.pid=s.pid GROUP BY p.category
 Rewritten to
 SELECT p.category,SUM(t.total) FROM Products p, TotalSales t
WHERE p.pid=t.pid GROUP BY p.category
 Query becomes 10,000 times faster !

Aalborg University 2007 - DWML course 19 Aalborg University 2007 - DWML course 20
Data’s Way To The DW DW Applications: Visualization
• Extraction • Graphical presentation of complex result
 Extract from many heterogeneous systems • Color, size, and form help to give a better overview
• Staging area
 Large, sequential bulk operations => flat files best?
• Cleansing
 Data checked for missing parts and erroneous values
 Default values provided and out-of-range values marked
• Transformation
 Data transformed to decision-oriented format
 Data from several sources merged, optimize for querying
• Aggregation?
 Are individual business transactions needed in the DW?
• Loading into DW
 Large bulk loads rather than SQL INSERTs
 Fast indexing (and pre-aggregation) required
Aalborg University 2007 - DWML course 21 Aalborg University 2007 - DWML course 22

DW Applications: Data Mining Data Mining Examples


• Data mining is automatic knowledge discovery
• Wal-Mart: USA’s largest supermarket chain
• Roots in AI and statistics
 Has DW with all ticket item sales for the last 5 years (huge!)
• Classification
 Use DW and mining intensively to gain business advantages
 Partition data into pre-defined classes
 Analysis of association within sales tickets
• Prediction ◆ Discovery: Beer and diapers on the same ticket
 Predict/estimate unknown value based on similar cases
◆ Men buy diapers, and must “just have a beer”
• Clustering ◆ Put the expensive beers next to the diapers
 Partition data into groups so the similarity within individual groups  Wal-Mart's suppliers use the DW to optimize delivery
are greatest and the similarity between groups are smallest
◆ The supplier puts the product on the shelf
• Association rule ◆ The supplier only get paid when the product is sold
 Find associations/dependencies between data that occur together
• Web log mining
 Rules: A -> B (c%,s%): if A occurs, B occurs with confidence c and
support s  What is the association between time of day and requests?
• Important to choose the granularity for mining  What user groups use my site?
 No useful results at too small granularity (shirt brand,..)  How many requests does my site get in a month? (Yahoo)

Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
Common DW Issues Summary
• Metadata management
 Need to understand data = metadata needed
• Why Business Intelligence?
 Greater need that in OLTP applications as “raw” data is used
 Need to know about: • Data analysis problems
◆ Data definitions, dataflow, transformations, versions, usage, security • Data Warehouse (DW) introduction
• DW project management • Analysis technologies that use the DW
 DW projects are large and different from ordinary SW projects  OLAP
◆ 12-36 months and US$ 1+ million per project  Data mining
◆ Data marts are smaller and “safer” (bottom up approach)  Visualization
 Reasons for failure
• BI can provide many advantages to your organization
◆ Lack of proper design methodologies
 A good DW is a prerequisite for BI
◆ High HW+SW cost (not so much anymore)
◆ Deployment problems (lack of training)  But, a DW is a means rather than a goal…it is only when it is
heavily used that success is achieved
◆ Organizational change is hard… (new processes, data ownership,..)
◆ Ethical issues (security, privacy,…)

Aalborg University 2007 - DWML course 25 Aalborg University 2007 - DWML course 26

DWML Mini Project and Exam DWML Software

• Groups to be formed today!


• Performed in groups of ~4 persons  Inform MLY about the groups at 16.00
• Documented in report of 20 pages • MS software via MSDNAA
• Deadline: April 20  Talk to msdnaa@cs.aau.dk about accounts
But every part should be done when indicated on home page

• DW software
• Basis for discussion at the oral exam (20 mins per person)  MS SQL Server 2005 RDBMS
 Maximum 4 persons at a time in exam
 MS Analysis Services, Integration Services,
• Exam also covers literature Reporting Services
 Not just mini project  Read the mini-project webpage (part 1c) for
 Questions in theoretical background, too installation details
• Data mining software
 Presented by Thomas D. Nielsen

Aalborg University 2007 - DWML course 27 Aalborg University 2007 - DWML course 28

You might also like