DWM1

CHAPTER 1
INTRODUCTION TO DATA
WAREHOUSING
Chapter 9 Copyright © 2014 Pearson Education, Inc.

1
DEFINITIONS
 Data Warehouse
 A subject-oriented, integrated, time-variant, non-
updatable collection of data used in support of
management decision-making processes
 Subject-oriented: e.g. customers, patients, students,
products
 Integrated: consistent naming conventions, formats,
encoding structures; from multiple data sources
 Time-variant: can study trends and changes
 Non-updatable: read-only, periodically refreshed
 Data Mart
 A data warehouse that is limited in scope
2
COMPARISON CHART OF DATABASE TYPES
Data warehouse Operational system

Subject oriented Transaction oriented
Large (hundreds of GB up to several Small (MB up to several GB)

TB)
Historic data Current data
De-normalized table structure (few Normalized table structure (many

tables, many columns per table) tables, few columns per table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries

3
NEED FOR DATA WAREHOUSING
 Data warehouse allows business users to quickly access critical data from
some sources all in one place.
 Data warehouse provides consistent information on various cross-functional
activities. It is also supporting ad-hoc reporting and query.
 Data Warehouse helps to integrate many sources of data to reduce stress on the
production system.
 Data warehouse helps to reduce total turnaround time for analysis and
reporting.
 Restructuring and Integration make it easier for the user to use for reporting
and analysis.
 Data warehouse allows users to access critical data from the number of sources
in a single place. Therefore, it saves user's time of retrieving data from
multiple sources.
 Data warehouse stores a large amount of historical data. This helps users to
analyze different time periods and trends to make future predictions.

4
MULTI-TIER ARCHITECTURE OF DATA WAREHOUSING
 Single-tier architecture
 Two-tier architecture
 Three-Tier Data Warehouse Architecture

5
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored.

This goal is to remove data redundancy. This architecture is not
Chapter 9used in practice
frequently Copyright
. © 2014 Pearson Education, Inc. 6
Two-tier architecture of DW
Two-layer architecture is one of the Data Warehouse layers which

separates physically available sources and data warehouse.
ThisChapter
architecture
9
is not expandable and also not supporting a large
Copyright © 2014 Pearson Education, Inc.
7
number of end-users.
Three -tier
Architecture of DW

8
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse.

It consists of the Top, Middle and Bottom Tier.
1.Bottom Tier: The database of the Datawarehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
2.Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.
3.Top-Tier: The top tier is a front-end client layer. Top tier is the
tools and API that you connect and get data out from the data
warehouse. It could be Query tools, reporting tools, managed query
tools, Analysis tools and Data mining tools.
9
Data Warehouse Models
From the architecture point of view, there are three warehouse models-
Enterprise Warehouse:-
•An enterprise warehouse collects all information topics spread throughout the
organization.
•It provides corporate-wide data integration, typically from one or several operational
systems or external information providers, and is cross-functional in scope.
•It usually contains detailed data as well as summarized data and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. Can be an
enterprise data warehouse.
•The traditional mainframe, computer super server, or parallel architecture has been
implemented on platforms. This requires extensive commercial modeling and may
take years to design and manufacture.

10
Data Mart:-
•A data mart contains a subset of corporate-wide data that is important to
a specific group of users.
•The scope is limited to specific selected subjects.
•For example, a marketing data mart may limit its topics to customers,
goods, and sales.
•The data contained in the data warts are summarized. Data warts are
typically applied to low-cost departmental servers that are Unix/Linux or
Windows-based.
•The implementation cycle of a data mart is more likely to be measured
in weeks rather than months or years. However, it can be in the long run,
complex integration is involved in its design and planning were not
enterprise-wide.

11
Virtual Warehouse:-
•A virtual warehouse is a group of views on an operational database.
•For efficient query processing, only a few possible summary views can be physical.
•Creating a virtual warehouse is easy, but requires additional capacity on operational
database servers.
•A data warehouse architecture defines the arrangement of the data in different
databases. As the data must be organized and cleansed to be valuable, a modern data
warehouse structure centers on identifying the most effective technique of extracting
information from raw data in the staging area and converting it into a simple
consumable warehousing structure using a dimensional model that delivers valuable
business intelligence.
12
Extraction, Transformation And
Loading
1.Extraction:
The first step of the ETL process is extraction.
2.In this step, data from various source systems is extracted which can be
in various formats like relational databases, No SQL, XML and flat files
into the staging area.
3. It is important to extract the data from various source systems and
store it into the staging area first and not directly into the data warehouse
because the extracted data is in various formats and can be corrupted
also.
4.Hence loading it directly into the data warehouse may damage it and
rollback will be much more difficult. Therefore, this is one of the most
important steps of ETL process.

13
2.Transformation:
The second step of the ETL process is transformation.
1.In this step, a set of rules or functions are applied on the extracted data
to convert it into a single standard format.
2.It may involve following processes/tasks:
•Filtering – loading only certain attributes into the data warehouse.
•Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
•Joining – joining multiple attributes into one.
•Splitting – splitting a single attribute into multipe attributes.
•Sorting – sorting tuples on the basis of some attribute (generally
key-attribute).

14
3.Loading
The third and final step of the ETL process is loading.
1.In this step, the transformed data is finally loaded into the data
warehouse.
2.Sometimes the data is updated by loading into the data warehouse very
frequently and sometimes it is done after longer but regular intervals.
3.The rate and period of loading solely depends on the requirements and
varies from system to system.

15
16
Metadata Repository
What is Metadata?
Metadata is simply defined as data about data.
The data that is used to represent other data is known as metadata.
For example, the index of a book serves as a metadata for the contents in the book.
In other words, we can say that metadata is the summarized data that leads us to
detailed data.
In terms of data warehouse, we can define metadata as follows.
Categories of Metadata
Metadata can be broadly categorized into three categories −
•Business Metadata − It has the data ownership information, business definition, and
changing policies.
•Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
•Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
Chapter
history 9 migrated and
of data transformation
Copyright appliedEducation,
© 2014 Pearson on it. Inc. 17
BENEFITS OF A DATA WAREHOUSE
 Delivers enhanced business intelligence

 Saves times
 Enhances data quality and consistency
 Provides competitive advantage
 Improves the decision-making process
 Better Enterprise Intelligence.
 Better Retrieval of data
 Improved control of data.

18

DWM1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM1

Uploaded by

Copyright:

Available Formats

CHAPTER 1

Chapter 9 Copyright © 2014 Pearson Education, Inc.

 Non-updatable: read-only, periodically refreshed

Data warehouse Operational system

Large (hundreds of GB up to several Small (MB up to several GB)

De-normalized table structure (few Normalized table structure (many

Usually very complex queries Simple to complex queries

Chapter 9 Copyright © 2014 Pearson Education, Inc.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

 Three-Tier Data Warehouse Architecture

Chapter 9 Copyright © 2014 Pearson Education, Inc.

The objective of a single layer is to minimize the amount of data stored.

Two-layer architecture is one of the Data Warehouse layers which

Chapter 9 Copyright © 2014 Pearson Education, Inc.

This is the most widely used Architecture of Data Warehouse.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

Chapter 9 Copyright © 2014 Pearson Education, Inc.

 Delivers enhanced business intelligence

Chapter 9 Copyright © 2014 Pearson Education, Inc.

You might also like