You are on page 1of 37

LECTURE 2: Data Warehousing

Dr. John O. Oredo, PhD-


University of Nairobi
Data Warehousing Concepts

-A data warehouse (DW) is a pool o f data


produced to support decision making.
-It is also a repository o f current and historical
data o f potential interest to managers
throughout the organization.
-Data are usually structured to be available in a
form ready for analytical processing activities
(i.e., online analytical processing [OLAP], data
mining, querying and reporting.

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Concepts

-A data warehouse is a subject-oriented,


integrated, time-variant, non-volatile collection o
f data in support of management’s decision-
making process.

Dr. John O. Oredo, PhD-University of Nairobi


Characteristics of Data Warehousing

▪ Subject oriented.
-Data are organized by detailed subject, such
as sales, products, or customers, containing
only information relevant for decision support.
-Subject orientation enables users to determine
not only how their business is performing, but
why.
-A data warehouse differs from an operational
database in that most operational databases
have a product orientation and are tuned to
Dr.handle transactions
John O. Oredo, that update the database.
PhD-University of Nairobi
Characteristics of Data Warehousing

-Subject orientation provides a more


comprehensive view of the organization.
▪ Integrated.
-Data warehouses must place data from
different sources into a consistent format. To do
so, they must deal with naming conflicts and
discrepancies among units of measure.
-A data warehouse is presumed to be totally
integrated.

Dr. John O. Oredo, PhD-University of Nairobi


Characteristics of Data Warehousing

▪ Time variant (time series).


-A warehouse maintains historical data.
-The data do not necessarily provide current
status (except in real-time systems).
-Detect trends, deviations, and long-term
relationships for forecasting and comparisons,
leading to decision making. -Every data
warehouse has a temporal quality. Data for
analysis from multiple sources contains multiple
time points (e.g., daily, weekly, monthly views).
Dr. John O. Oredo, PhD-University of Nairobi
Characteristics of Data Warehousing

▪ Non-volatile.
-After data are entered into a data warehouse,
users cannot change or update the data.
-Obsolete data are discarded, and changes are
recorded as new data.
▪ Web based.
-Warehouses are typically designed to provide
an efficient computing environment for Web-
based applications.

Dr. John O. Oredo, PhD-University of Nairobi


Characteristics of Data Warehousing

▪ Relational/multidimensional.
-A data warehouse uses either a relational
structure or a multidimensional structure.
▪ Client/server.
-A data warehouse uses the client/server
architecture to provide easy access for end
users.
▪ Real time.
-Newer data warehouses provide real-time, or
active, data-access and analysis capabilities.
Dr. John O. Oredo, PhD-University of Nairobi
Characteristics of Data Warehousing

▪ Include metadata.
-A data warehouse contains metadata (data
about data) about how the data are organized
and how to effectively use them.

Dr. John O. Oredo, PhD-University of Nairobi


Data Marts

-Whereas a data warehouse combines


databases across an entire enterprise, a data
mart is usually smaller and focuses on a
particular subject or department.
-A data mart is a subset of a data warehouse,
typically consisting o f a single subject area
(e.g., marketing, operations).
-A data mart can be either dependent or
independent.
-A dependent data mart is a subset that is
Dr.created
John O. Oredo,directly
PhD-Universityfrom
of Nairobithe data warehouse.
Data Marts

-Dependent data marts support the concept of a


single enterprise-wide data model, but the data
warehouse must be constructed first.
-A dependent data mart ensures that the end
user is viewing the same version o f the data
that is accessed by all other data warehouse
users.
-The high cost of data warehouses limits their
use to large companies.
-An independent data mart is a small
Dr.warehouse designed
John O. Oredo, PhD-University of Nairobi for a strategic business
Operational Data Stores

-An operational data store (ODS) provides a


fairly recent form of customer information file
(CIF).
-This type o f database is often used as an
interim staging area for a data warehouse.
-Unlike the static contents of a data warehouse,
the contents o f an ODS are updated
throughout the course of business operations.
-An ODS is used for short-term decisions

Dr. John O. Oredo, PhD-University of Nairobi


Enterprise Data Warehouse (EDW)

-Enterprise data warehouse (EDW) is a large-


scale data warehouse that is used across the
enterprise for decision support.
- The large-scale nature provides integration of
data from many sources into a standard format
for effective BI and decision support applications.
-EDW provides data for many types of DSS
including CRM,SCM, business performance
management (BPM), business activity monitoring
(BAM), product life-cycle management (PLM),
and KMS.
Dr. John O. Oredo, PhD-University of Nairobi
Data Warehousing Process

-The following are the major components of the


data warehousing process:
▪ Data sources.
-Data are sourced from multiple independent
operational legacy systems and possibly from
external data providers.
-Data may also come from an OLTP or ERP
system.
-Web data in the form o f Web logs may also feed
a data warehouse.
Dr. John O. Oredo, PhD-University of Nairobi
Data Warehousing Process

▪ Data extraction and transformation.


-Data are extracted and properly transformed
using custom-written or commercial software
called ETL.
▪ Data loading.
-Data are loaded into a staging area, where they
are transformed and cleansed.
-The data are then ready to load into the data
warehouse and/or data marts.

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Process

▪ Comprehensive database.
-Essentially, this is the EDW to support all
decision analysis by providing relevant
summarized and detailed information originating
from many different sources.
▪ Metadata.
-Metadata are maintained so that they can be
assessed by IT personnel and users.
-Metadata include software about data and rules
for organizing data summaries that are easy to
Dr.index andPhD-University
John O. Oredo, search, especially with Web tools.
of Nairobi
Data Warehousing Process

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Architectures

-These architectures are normally known as


client/server or n-tier architectures.
-In n-tiered architectures the data warehouse is
divided into three parts :
1. The data warehouse itself, which contains the
data and associated software
2 . Data acquisition (back-end) software, which
extracts data from legacy systems and external
sources, consolidates and summarizes them, and
loads them into the data warehouse
Dr. John O. Oredo, PhD-University of
Nairobi
Data Warehousing Architectures

3 . Client (front-end) software, which allows users


to access and analyze data from the warehouse
(DSS/BI/business analytics [BA] engine)
-In a two-tier architecture, the DSS engine
physically runs on the same hardware
platform as the data warehouse.
-Therefore, it is more economical than a three-tier
structure.
-The two-tier architecture can have performance
problems for large data warehouses that work
Dr.with
John O.data-intensive applications are used.
Oredo, PhD-University of Nairobi
Data Warehousing Architectures

-In a three-tier architecture, operational systems


contain the data and the software for data
acquisition in one tier (i.e., the server), the data
warehouse is another tier, and the third tier
includes the DSS/BI/BA engine (i.e., the
application server) and the client
-The advantage o f the three-tier architecture is its
separation of the functions o f the data
warehouse, which eliminates resource constraints
and makes it possible to easily create data marts.
Dr. John O. Oredo, PhD-University of Nairobi
Data Warehousing Architectures

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Architectures

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Architectures

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Schemas

a) Star Schema
-A star schema contains a central fact table
surrounded by and connected to several
dimension tables.
-The fact table contains a large number o f rows
that correspond to observed facts and external
links (i.e., foreign keys).
-A fact table contains the descriptive attributes
needed to perform decision analysis and query
reporting, and foreign keys are used to link to
Dr.dimension tables.of Nairobi
John O. Oredo, PhD-University
Data Warehousing Schemas

-The dimension tables contain classification and


aggregation information about the central fact
rows.
-Dimension tables contain attributes that describe
the data contained within the fact table.
-Dimension tables have a one-to-many
relationship with rows in the central fact table.
-The star schema is designed to provide fast
query-response time, simplicity, and ease o f
maintenance for read-only database structures.
Dr. John O. Oredo, PhD-University of Nairobi
Data Warehousing Schemas

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Schemas

b) Snowflake Schema
-The snowflake schema is a logical arrangement
of tables in a multidimensional database in such a
way that the entity-relationship diagram
resembles a snowflake in shape.
-Related to the star schema, the snowflake
schema is represented by centralized fact tables
(usually only one) that are connected to multiple
dimensions.
-In the snowflake schema dimension tables are
Dr.normalized into multiple
John O. Oredo, PhD-University of Nairobi related tables.
Data Warehousing Schemas

Dr. John O. Oredo, PhD-University of Nairobi


Data Warehousing Schemas

Star Vs Snowflake Schema: Key Differences


Star Schema Snow Flake Schema

Hierarchies for the dimensions are stored in the dimensional


Hierarchies are divided into separate tables.
table.

One fact table surrounded by dimension table which are in turn


It contains a fact table surrounded by dimension tables.
surrounded by dimension table

In a star schema, only single join creates the relationship


A snowflake schema requires many joins to fetch the data.
between the fact table and any dimension tables.

Simple DB Design. Very Complex DB Design.


Denormalized Data structure and query also run faster. Normalized Data Structure.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Cube processing is faster. Cube processing might be slow because of the complex join.

Offers higher performing queries using Star Join Query


The Snow Flake Schema is represented by centralized fact
Optimization. Tables may be connected with multiple
table which unlikely connected with multiple dimensions.
dimensions.
Dr. John O. Oredo, PhD-University of Nairobi
OLTP Vs OLAP
-OLTP (online transaction processing system) is
a term used for a transaction system, which is
primarily responsible for capturing and storing
data related to day-to-day business functions
such as ERP, CRM, SCM, point of sale, and so
forth.
-The OLTP system a critical business need,
automating daily business transactions and
running real-time reports and routine analyses.
-But these systems are not designed for ad hoc
analysis and complex queries that deal with a
Dr. John O. Oredo, PhD-University of Nairobi
number o f data items.
OLTP Vs OLAP
-OLAP (online analytical processing) on the other
hand, is designed to address this need by
providing ad hoc analysis of organizational data
much more effectively and efficiently.
-OLAP and OLTP rely heavily on each other:
-The main operational structure in OLAP is based
on a concept called cube.
-A cube in OLAP is a multidimensional data
structure (actual or virtual) that allows fast
analysis of data.
Dr. John O. Oredo, PhD-University of Nairobi
OLTP Vs OLAP
-It can also be defined as the capability of
efficiently manipulating and analyzing data from
multiple perspectives.
-The arrangement o f data into cubes aims to
overcome a limitation of relational databases.
-Relational databases are not well suited for near
instantaneous analysis of large amounts o f data.
-Instead, they are better suited for manipulating
records (adding, deleting, and updating data) that
represent a series of transactions.
Dr. John O. Oredo, PhD-University of Nairobi
OLTP Vs OLAP
-Although many report-writing tools exist for
relational databases, these tools are slow when a
multidimensional query that encompasses many
database tables needs to be executed.
-Using OLAP, an analyst can navigate through
the database and screen for a particular subset o
f the data (and its progression over time) by
changing the data’s orientations and defining
analytical calculations.

Dr. John O. Oredo, PhD-University of Nairobi


OLTP Vs OLAP
-These types o f user-initiated navigation of data
through the specification o f slices (via rotations)
and drill down/up (via aggregation and
disaggregation) is sometimes called “slice and
dice .”
-Commonly used OLAP operations include slice
and dice, drill down, roll up, and pivot.
• Slice. A slice is a subset of a multidimensional
array (usually a two-dimensional representation)
corresponding to a single value set for one (or
more) of the dimensions not in the subset.
Dr. John O. Oredo, PhD-University of Nairobi
OLTP Vs OLAP

Dr. John O. Oredo, PhD-University of Nairobi


OLTP Vs OLAP
▪ Dice. The dice operation is a slice on more than
two dimensions of a data cube.
▪ Drill Down/Up Drilling down or up is a specific
OLAP technique whereby the user navigates
among levels o f data ranging from the most
summarized (up) to the most detailed (down).
▪ Roll-up. A roll-up involves computing all of the
data relationships for one or more dimensions.
To do this, a computational relationship or
formula might b e defined.
▪ Pivot: A pivot is a means of changing the dimensional
Dr. John O. Oredo, PhD-University of Nairobi
orientation of a report or ad hoc query-page display.
OLTP Vs OLAP
OLTP VS OLAP in a nutshell
Basis for Comparison OLTP OLAP
It is an online transactional
It is an online data retrieving and
Basic system and manages database
data analysis system.
modification.
Insert, Update, Delete Extract data for analyzing that
Focus
information from the database. helps in decision making.
Different OLTPs database
OLTP and its transactions are the
Data becomes the source of data for
original source of data.
OLAP.
Transaction OLTP has short transactions. OLAP has long transactions.
The processing time of a The processing time of a
Time transaction is comparatively less transaction is comparatively
in OLTP. more in OLAP.
Queries Simpler queries. Complex queries.
Tables in OLTP database are Tables in OLAP database are not
Normalization
normalized (3NF). normalized.
Dr. John O. Oredo, PhD-UniversityOLTP
of Nairobi OLAP database does not get
database must maintain
Integrity frequently modified. Hence, data
data integrity constraint.

You might also like