You are on page 1of 50

Data Warehouse

CS 408
Concepts and Architectures
Database vs Data warehouse
• A database is any collection of data organized for storage,
accessibility, and retrieval.

• A data warehouse is a type of database the integrates


copies of transaction data from disparate source systems
and provisions them for analytical use.

2
Database vs Data warehouse
• A database is a collection of related data which
represents some elements of the real world. It is designed
to be built and populated with data for a specific task. It is
also a building block of your data solution.
• A data warehouse is an information system which stores
historical and commutative data from single or multiple
sources. It is designed to analyze, report, integrate
transaction data from different sources.
• Data Warehouse eases the analysis and reporting
process of an organization. It is also a single version of
truth for the organization for decision making and
forecasting process.
3
Business intelligence
• Business intelligence is the delivery of accurate, useful
information to the appropriate decision makers with
necessary timeframe to support effective decision-
making.

4
Data warehouse
• Data warehouse is a system that retrieves and
consolidates data periodically from the source systems
into a dimensional or normalized data store. It usually
keeps years of history and is queried for business
intelligence or other analytical activities. It is typically
updated in batches, not every time a transaction happens
in the source system

5
Data Mart
• Data Mart is a subset of data warehouse and is defined
as body of historical data in electronic repository that does
not participate in the daily operations of the organization.
Instead, this data is used to create business intelligence.
The data in the data mart usually applies to a specific
area of organization.

6
Fact Table
• Fact Table is the primary table in a dimensional model
where the numerical performance measurements of the
business are stored. We try to store the measurement
data resulting from a business process in a single data
mart.

• The most common transformations to fact data include


transformation of null values, pivoting or unpivoting the
data, and precomputing derived calculations.

7
Dimension Table
• Dimension Table is an integral companion to a fact table.
The dimension tables contain the textual descriptors of
the business. In a well-designed dimensional model,
dimension tables have many columns or attributes. These
attributes describe the rows in the dimension table.
Dimension tables tend to be relatively shallow in terms of
the number of rows (often far fewer than 1 million rows)
but are wide with many large columns. Dimension tables
are the entry points into the fact table. The dimensions
implement the user interface to the data warehouse.

8
OLAP DB
• Online analytic processing (OLAP) database is a
technology for storing, managing, and querying data
specifically designed to support business intelligence
uses.

9
ETL
• Extract, Transformation, and Load (ETL) system is a set
of processes that clean, transform, combine, de-duplicate,
archive, conform, and structure data for use in the data
warehouse.

10
PivotTable
• A PivotTable is a powerful tool to calculate, summarize,
and analyze data that lets you see comparisons, patterns,
and trends in your data.

A PivotTable is an interactive way to quickly


summarize large amounts of data. You can use a
PivotTable to analyze numerical data in detail, and
answer unanticipated questions about your data. A
PivotTable is especially designed for: Querying large
amounts of data in many user-friendly ways.

11
Data Warehousing for Business Intelligence
Database management
essentials

Data warehouse concepts,


design and data integration

Relational database support for


data warehouses

Business intelligence concepts,


tools, and applications

Design and build a data


warehouse for business 12

intelligence implementation
Targeted Learners

University students IT professionals

Project managers Business analysts

13
Broad Course Objectives
• Establish an initial foundation of data warehouse background for
business intelligence careers
• Gain conceptual background about business architectures,
management practices, and data warehouse development
methodologies
• Create data warehouse designs, data integration workflows, and
pivot table operations
• Reflect on business architecture selection, data warehouse design
methodologies, and data integration goals and constraints

14
Prerequisite
• Introductory database course
• Background about relational databases, query
formulation, data modeling, and normalization
• Basic knowledge of Algorithms
• Basic Data Structure Concepts

15
Course Topics
Data warehouse User
Data mart tier
server departments

Operational
database Staging Extraction
Area
process
Transformation
process

Operational Data mart


database

Detailed and
summarized data
EDM
External
data source Data warehouse
Data mart

Architectures Pivot Table Manipulation

Schema Design Data Integration 16


Course Flow
• Motivation and characteristics
• Architectures
• Project characteristics and maturity model
1
• Employment opportunities

• Multi-dimensional data model


• Microsoft MDX language
2 • Pivot table tool practice

• Schema patterns and summarizability problems


• Schema integration practice
3 • Enterprise data warehouse development

• Data integration process concepts


• Change data characteristics
4 • Data integration techniques

• Architectures and features of data integration tools


• Overview of Talend and Pentaho tools
5 • Practice with Pentaho Data Integration
17
Tools

18
Decision Making Hierarchy

Decision making hierarchy Typical decisions

Top Identify new markets,


(strategic) choose store locations

Middle Choose suppliers,


(tactical) forecast sales

Lower Resolve order delays,


(operational) schedule employees
19
Technology and Deployment Limitations

Lack of
integration

Missing
Performance
DBMS
limitations
features

Data
warehouse
technology
and
deployments
20
Technology and Deployment Limitations
* Performance limitation
- Performance problems with a separate database for both transaction
processing and business intelligence decision making
- Never solved. Use a separate database

* Lack of integration
- Lack of integration with transaction databases and external data
sources
- Add value: integrate, standardize, clean, and summarize both internal
and external data sources

21
Technology and Deployment Limitations
* Missing features for summary data
- Storage and optimization techniques for summary queries
- Data modeling approaches
- Support for precomputed query results
- Support for different business analyst query tools

22
Data Warehouse Characteristics
• Essential part of infrastructure for business intelligence
• Logically centralized repository for decision making
– Populated from operational databases and external data sources
– Integrated and transformed data
– Optimized for reporting and periodic integration

23
Comparison of Processing Environments
Transaction
processing
• Primary data from
transactions
• Daily operations and
short term decisions

Business intelligence
processing
• Transformed secondary
data
• Medium and long-term
decisions

24
Data Comparison
Characteristic Operational Data Warehouse
Database
Currency Current Historical
Details level Individual trans. Individual and summary
Orientation Process Subject
Records per Few Thousands
request
Normalization level Mostly normalized Normalization relaxed (not
important)
Update level Highly volatile Mostly refreshed / fetched (non
volatile)
Data model Relational Relational (star schemas) and
multidimensional (data cubes)
* A star schema is a conference for constructing the data into dimension tables, fact tables,
and materialized views. All data is saved in columns, and metadata is needed to identify the 25
columns that function as multidimensional objects.
Schema Comparison
Operational database Data warehouse
Manages Store
Item StoreId
ItemId StoreManager
Employee ItemName StoreStreet
ItemUnitPrice StoreCity
EmpNo StoreState
EmpFirstName ItemBrand StoreSales
ItemCategory StoreZip
EmpLastName StoreNation
... DivId
ItemSales Sales
DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Takes Customer SalesCost
TimeDim
CustId TimeNo
Product CustName TimeSales TimeDay
Customer CustPhone
Order ProdNo TimeMonth
CustNo CustStreet CustSales TimeQuarter
ProdName
OrdNo CustCity TimeYear
CustFirstName Places Contains ProdQOH
OrdDate CustState TimeDayOfWeek
CustLastName ...
... CustZip TimeFiscalYear
...
CustNation
Qty

26
Challenges in Data Warehouse Projects

• Substantial coordination across organizational units


• Uncertain data quality in data sources
• Difficult to scale data warehouse

27
Intangible Benefits
• Includes:
– Brand Recognition
– Employee expertise
– Management skills
• Not easily quantified but important for an organization’s
success
• May also include Increased data quality
– Fewer missing values
– More matched entities
– More data availability
– Higher levels of compliance with data standards

28
Intangible Benefits
• Intangible Benefits may become tangible over
time: e.g.
– Increased revenue and reduced expenses
– A data warehouse may enable reduced losses due to
improved fraud detection.
– Improved customer attention through target marketing
– Reduction of inventory carrying costs through
improved demand forecasting

29
Learning Curve for Skills

30
Learning Curve for Production

Learning Curve for Production


21
19
17
15
Effort

13
11
9
7
5
3
1
0 1 2 3 4 5 6 7 8 9 10 11

Units

31
Maturity Relationships
Business Value Learning Curve Data Transformation Learning Curve
1.2 25

To resolve data quality problems


20
Business value

0.8

Transformation Cost
15
0.6

10
0.4

5
0.2

0 0
0 10 20 30 40 50 60 70
0 2 4 6 8 10 12
Time
Time
Between a data warehouse is deployed
32
Project Relationships

Potential Value Project Risk


1.2 1.2
Time to completion,
Uncertain completion
1 1

0.8
Business value

0.8

Risk
0.6 0.6

0.4
0.4

0.2
0.2

0
0 10 20
Scope40
30 50 60 70
0
0 10 20
Scope40
30 50 60 70

Number of Data Sources Used Number of Organizational Units 33


Served by DWH

Do not build DWH with a large scope in a single project


Important Issues
• Involves: DWH Architecture, Scope & Integration level
• Architecture
– refers to “Organizational components to support specified goals”
– For Data Warehouses, “business goals” drive technology choices
– That means, Architecture of DWH is more an organizational issues
rather than technology
• Data warehouse scope
– Refers to “Breadth of an organization support by DWH”, measured by:
Number of Data Sources used, number of Organizational units providing
inputs or using a DWH
• Integration level
– Refers to “several quality indicators across data sources, involving
completeness, consistency, conformity and duplication across data 34

sources
Architecture Choices

Top Down
• Enterprise data warehouse
• Higher integration levels
• Logically centralized
• Larger project scope

Bottom Up
• Independent data marts
• Lower integration levels
• Logically decentralized
• Smaller project scope

35
Top-Down Architecture
Data warehouse User
Data mart tier
server departments

Operational
database Staging Extraction
Area
process
Transformation
process

Operational Data mart


database

Detailed and
summarized data
EDM
External
data source Data warehouse
Data mart
36
Bottom-up Architecture
User
Data mart tier
departments

Operational
database
Transformation
process

Data mart
Operational
database

External
data source

Data mart 37
Maturity Model Stages

Data Warehouse Maturity Model


38

@ Eckerson 2007
39
Maturity Model Insights

• Stages provide a framework to view an organization’s


progress
• Guidance for investment decisions
• Difficulty moving between stages
– Infant to child stages because of investment level
– Teenager to adult because of strategic importance of data
warehouse

40
Advantages of Business Intelligence
• To gain competitive advantage
• To shift from product focus to customer focus
• To identify new markets
• To focus more on profitable customers
• To improve retention of customers
• To reduce inventory costs

41
Traditional Applications

Industry Key Applications

Airline Yield management, route assessment

Telecommunications Customer retention, network design

Insurance Risk assessment, product design, fraud detection

Retail Target marketing, supply-chain management

42
Data Mining
• Discover significant, implicit patterns
– Target promotions
– Change mix and collocation of items
• Requires large volumes of transaction data including
sensor data and social media interactions
• Important tools for business intelligence

43
Market Shares and Trends
• Major vendors: Teradata, Oracle, IBM, Microsoft, SAP
• Large projected market growth
• Trends
– Real time load and analysis
– Increased storage and analysis of social interactions
– Increased usage of cloud services and appliances

44
Cloud Influence

Server

Database

Server Server

Database Database

• Reduces local expertise to procure technology and manage a data


warehouse
• Economies of scale
• Improved scalability
• Higher variable costs but lower fixed costs 45
Cloud Service Models

User
Organization Application
(SaaS)
Development
Platform Cloud Vendor
(PaaS)
Infrastructure
Infrastructure
(IaaS)

46
Employment Opportunities

• Recommend technology solutions


DW Analyst • Define user interfaces
• Collaborate with business analysts and DW managers

• Design, develop, and maintain data warehouses


DW Manager • Ensure conformance to enterprise standards
• Develop and implement data integration procedures

• Develop data analysis and reporting solutions


• Mine and analyze data from multiple sources
BI Analyst • Communicate results to management
• Prepare data (reduction and missing values)
• Document data elements
• Use reporting tools
Data Analyst • Collaborate with business analysts and data architects
• Develop data extraction procedures

47
Skill-Position Mapping
Position
Competency
DW Manager DW Analyst BI Analyst
Communication ▄ █ █
Data cube tools ▄ █ █
Dashboards ▄ █
Data mining ▄ █
Data integration █ █
tools
DW schema █ ▄
design
Performance █
analysis
Quantitative █
modeling
48
SQL extensions █ █ ▄
Salary Trends (USA)

Job Title 2017 2018 % Change


DB manager $101,750 – $140,750 $107,750 – $149,000 5.9%

DB developer $80,500 – $128,250 $92,000 – $134,500 5.5%

Data analyst $64,250 – $96,000 $67,750 – $101,000 5.3%

DW manager $108,750 – $145,750 $115,250 – $154,250 5.9%

DW analyst $93,500 – $126,500 $99,000 – $133,750 5.8%

BI analyst $94,250 – $132,500 $101,250 – $142,250 7.4%

Robert Half Salary Survey 49


Salary Trends (Europe)

Job Title Country 2017 2018


DBA Germany €40,000 – €55,000 €40,000 – €60,000

Business Analyst Germany €55,000 – €85,000 €55,000 – €85,000

DBA London £55,000 – £85,000 £55,000 – £80,000

DBA France €50,000 – €90,000 €50,000 – €70,000

DBA Australia $75,000 – $125,000 $75,000 – $125,000

Business analyst Australia $80,000 – $120,000 $80,000 – $120,000

Robert Half Salary Survey 50

You might also like