You are on page 1of 44

SCCS 453

Data Warehousing and Data Mining


Lecture 2
Data Warehouse Architecture

Songsri Tangsripairoj, Ph.D.


ccsts@mahidol.ac.th

Department of Computer Science


Faculty of Science, Mahidol University

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 1


Semester 2, Year 2006
Topics
 Data Warehouse Environment
 Characteristics of a Data Warehouse
 Data Warehouse Architecture
 Data Sources
 Data Warehouse & Data Mart
 Back-End Tools and Utilities
 Metadata Repository
 Front-End Tools

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 2


Semester 2, Year 2006
Data Warehouse Environment …
 A data warehouse is a collection of integrated databases
designed to support a DSS.
 An operational data store (ODS) stores data for a
specific application. It feeds the data warehouse a
stream of desired raw data.
 A data mart is a lower-cost, scaled-down version of a
data warehouse, usually designed to support a small
group of users (rather than the entire firm).
 The metadata is information that is kept about the
warehouse.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 3


Semester 2, Year 2006
Organizational Data Flow and
Data Storage Components

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 4


Semester 2, Year 2006
… Data Warehouse Environment
 The organization’s legacy systems and data stores
provide data to the data warehouse or mart.
 During the transfer of data from the various sources,
cleansing or transformation may occur, so the data in
the DW is more uniform.
 Simultaneously, metadata is recorded.
 Finally, the DW or mart may be used to create one or
more “personal” warehouses.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 5


Semester 2, Year 2006
Data Warehouse Defined
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.

 “A data warehouse is a subject-oriented, integrated, time-


variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 6


Semester 2, Year 2006
Data Warehouse—Subject-Oriented
 DW is organized around major subjects of the enterprise
(e.g. customers, supplier, products, sales) rather than
major application areas (e.g. customer invoicing, stock
control, product sales)
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 7


Semester 2, Year 2006
Subject-Oriented Data
Operational Applications Data Warehouse Subjects

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 8


Semester 2, Year 2006
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data
sources, which often include data that is inconsistent.
 relational databases, flat files, on-line transaction records

 Data cleaning and data integration techniques are applied.


 Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
 Naming convention
 Gender – M or F, 1 or 2, or 0 or 1
 Attribute measures
 Employee’s term of employment – Days, Weeks, Months, or Years
 Temperature – Fahrenheit, Celsius, or Kelvin scales
 When data is moved to the warehouse, it is converted.
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 9
Semester 2, Year 2006
Integrated Data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 10


Semester 2, Year 2006
The Data Warehouse is Integrated
Data inconsistencies are removed; data from diverse
operational applications is integrated.
Data from Applications

Savings Data Warehouse Subjects


Account

Checking Subject
Account = Account

Loans
Account
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 11
Semester 2, Year 2006
Data Warehouse—Time-variant
 Data in the warehouse is only accurate and valid at
some point in time or over some time interval
 The time horizon for the data warehouse is significantly
longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time
element”
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 12
Semester 2, Year 2006
Data Warehouse—Nonvolatile
 Data in the warehouse is not updated in real-time but is
refreshed from operational systems on a regular basis
 New data is always added as a supplement to the
database, rather than a replacement
 Requires only two operations in data accessing:
 initial loading of data and access of data

Read / Add / Change / Remove Read

OLTP Databases Load/Refresh Data Warehouse


SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 13
Semester 2, Year 2006
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 14
Semester 2, Year 2006
Data Warehouse vs. Operational
DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 15
Semester 2, Year 2006
OLTP vs. OLAP
Feature OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 16


Semester 2, Year 2006
Why Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 17
Semester 2, Year 2006
Data Warehouse Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 18
Semester 2, Year 2006
Data Sources
(System-oriented point-of-view)
Other  Mainframe first generation hierarchical and
sources network databases.
 Departmental proprietary file systems (e.g.
Operational VSAM, RMS) and relational DBMSs (e.g.
DBs Informix, Oracle).
 Private workstations and servers.
 External systems such as the Internet,
commercially available databases, or
databases associated with an
Data Sources organization’s suppliers or customers
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 19
Semester 2, Year 2006
Data Sources
(Data-oriented point-of-view)
Other  Production Data
sources  E.g., from various operational systems.
 Internal Data
Operational  E.g., from “private” spreadsheets, documents,
DBs customer profiles, and departmental databases
 Archived Data
 E.g., from a separate archival database, flat files
on disk storage, tape cartridges, etc.
 External Data
 E.g., market share data of competitors produced
by external agencies
Data Sources
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 20
Semester 2, Year 2006
Data Warehouse
 Many organizations want to implement an integrated
enterprise warehouse that collects information about all
subjects (e.g., customers, products, sales, etc.)
 Building an enterprise warehouse is a long and complex
process and requires massive investment of time and
resources
 Alternatives: Building of
 a virtual warehouse : Provide views of operational databases that
are materialized for efficient access
 a set of data marts : Departmental subsets focused on selected
subjects (e.g., marketing data mart may include customer, product
and sales information)

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 21


Semester 2, Year 2006
Data Warehouse & Data Mart
“The single most important issue facing the IT
manager this year is whether to build the data
warehouse first or the data mart first”
[Inmon 1998]
Data
Warehouse

Data Marts
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 22
Semester 2, Year 2006
Data Warehouse & Data Mart
 Before building a data warehouse, consider the
following issues:
 Top-down or Bottom-up approach?
 Enterprise-wide or Departmental-wide?
 Data warehouse or Data mart?
 Build pilot or Go with a full-fledged implementation?
 Dependent or Independent data marts?

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 23


Semester 2, Year 2006
Top-Down Approach
(Overall data warehouse feeding dependent data marts.)

Advantages Disadvantages
 A truly corporate effort, an  Takes longer to build even
enterprise view of data with iterative method
 Inherently architected (e.g.,  High exposure / risk to
not a union of disparate data failure
marts)
 Needs high level of cross-
 Single, central storage of data functional skills
about the content  High outlay without proof of
 Centralized rules and control concept
 May see quick results if
implemented with iterations

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 24


Semester 2, Year 2006
Bottom-Up Approach
(Several departmental or local data marts combining into
a data warehouse.)
Advantages Disadvantages
 Faster and easier  Each data mart has its
implementation of manageable own narrow view of data
pieces  Permeates redundant
 Favorable return on data in every data mart
investment and proof of  Perpetuates inconsistent
concept and irreconcilable data
 Less risk of failure  Proliferates
 Inherently incremental; can unmanageable interfaces
schedule important data marts
first
 Allows project team to learn
and grow
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 25
Semester 2, Year 2006
Practical Approach [Ralph Kimball]

 Plan and define requirements at the overall corporate


level
 Create a surrounding architecture for a complete
warehouse
 Conform and standardize the data content for each
supermart (carefully architected data marts)
 Implement the data warehouse as a series of
supermarts, one at a time

Top-Down Design, Bottom-Up Implement


SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 26
Semester 2, Year 2006
Data Warehouse & Data Mart
 Corporate/Enterprise-wide  Departmental
 Union of all data marts  A single business process
 Data received from staging  Star-join (facts &
area dimensions)
 Queries on presentation  Technology optimal for
resource data access and analysis
 Structure for corporate  Structure to suit the
view of data departmental view of data
 Organized on ER model

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 27


Semester 2, Year 2006
Back-End Tools and Utilities
 Data Extraction
 Data Transformation
 Data Loading
 Data Refreshing
 Data Purging

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 28


Semester 2, Year 2006
Data Extraction
 Appropriate technique is needed for each type of data
source (relational database, legacy network and
hierarchical data models, flat files, spreadsheets,
documents)
 Data extraction from foreign sources is usually
implemented via gateways and standard interfaces (e.g.,
ODBC, Sybase Enterprise Connect, Information Builders
EDA/SQL, etc.)
 Out-side tools may entail high initial costs
 In-house programs may need ongoing costs for
development and maintenance
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 29
Semester 2, Year 2006
Data Transformation
 Cleaning
 Standardization
 Merging
 Summarization

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 30


Semester 2, Year 2006
Data Transformation: Cleaning
 Collection of misspelling
 Resolution of conflicts (e.g., state code and zip
code)
 Filling in missing information when possible
(e.g., default value)
 Elimination of duplicates

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 31


Semester 2, Year 2006
Data Transformation: Standardization
 To minimize errors and anomalies in the large
volumes of data from multiple sources
 Structural standardization
 Data types
 Field lengths
 Integrity constraints
 Value assignments
 Missing entries
 Semantic standardization
 Synonyms
 Homonyms
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 32
Semester 2, Year 2006
Data Transformation: Merging
 Combine pieces of data from different sources
 Purge source data that is not useful
 Separate out source record into new combinations
 Sorting may be needed
 Key handling
 Keys in operational systems have built-in meaning
 Keys in data warehouse systems can not have built-in
meaning but surrogate keys which are derived from the
source system primary keys

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 33


Semester 2, Year 2006
Data Transformation: Summarization
 In operational systems, Point-of-Sale keeps the
unit sales and revenue amount by individual
transactions at each check-out counter at each
store
 In data warehouse systems, summary totals of
the sale units and revenue may be needed

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 34


Semester 2, Year 2006
Data Loading
 Materialize views and store them in the warehouse
 Additional preprocessing may be required:
 Checking integrity constraints
 Sorting
 Building derived information by using summarization,
aggregation and other computation
 Building indices and other access paths
 Partitioning to multiple target storage areas

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 35


Semester 2, Year 2006
Data Loading
 Much larger volumes of loaded data (compare
to loading data into operational databases)

 Sequential loading a terabyte of data can takes


weeks or months!

 Pipelined and partitioned parallelism are


typically exploited

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 36


Semester 2, Year 2006
Data Loading
 Full loading may take too long even though
parallelism is used

 Incremental loading
 Better in timing
 But harder to manage

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 37


Semester 2, Year 2006
Data Refreshing and Purging
 Propagate updates on source data to correspondingly
update the base and derived data stored in the
warehouse
 Purge data that is too old from the warehouse
 Policies (when to refresh and how to refresh)
 Simultaneously or periodically
 Depend on user needs, traffic, characteristics of data sources
 Techniques
 Whole-source update for legacy sources
 Incremental update using a replication server

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 38


Semester 2, Year 2006
Metadata in a Data Warehouse
 Data about the data
 Table of contents for the data
 Catalog for the data
 Data warehouse roadmap
 Data warehouse atlas
 Data warehouse directory
 Glue that holds the data warehouse contents
together

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 39


Semester 2, Year 2006
Metadata Repository
 Store and maintain system catalogs

 System catalogs describe


information that is being stored in
the warehouse
Monitor
Metadata &
 Size and complexity of system Integrator
catalogs depend on
 the size and complexity of the
warehouse itself, and
 the amount of administrative information Data
that must be maintained Warehouse

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 40


Semester 2, Year 2006
Metadata Repository
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data definition, data
mart locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data
warehouse
 Data related to system performance
 warehouse schema, view and derived data definitions
 Business data
 business terms and definitions, ownership of data, charging policies
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 41
Semester 2, Year 2006
General Metadata Issues
 What tables, attributes and keys does the DW contain?
 Where did each set of data come from?
 What transformations were applied in loading the data?
 How have the metadata changed over time?
 What aliases exists and how are they related to each
other?
 What are the cross-references between technical and
business terms?
 How often do the data get reloaded?
 How many data elements are in the DW?

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 42


Semester 2, Year 2006
Metadata Element for Customer
Entity Name: Customer
Alias Name: Account, Client
Definition: A person or an organization that purchases goods or services
from the company.
Remarks: Customer entity includes regular, current, and past customers.
Source System: Finished Goods Orders, Maintenance Contracts, Online Sales.
Create Date: January 15, 1999
Last Update Date: January 21, 2001
Update Cycle: Weekly
Last Full Refresh Date: December 29, 2000
Full Refresh Date: Every six months
Data Quality Reviewed: January 25, 2001
Last Deduplication: January 10, 2001
Planned Archival: Every six months
Responsible User: Jane Brown
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 43
Semester 2, Year 2006
Front-End Tools
 Visualization/Analysis/Reporting tools
 Allow users to query, analyze and summarize
data and present it in a visual way
Analysis
 OLAP tools Query
 Provide a set of operators that users can used to Reports
create views from multidimensional data Data mining

 Data mining tools


 Provide a mechanism for users to analyze and
identify patterns

Front-End Tools
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 44
Semester 2, Year 2006

You might also like