You are on page 1of 16

Data Warehouse

Business Intelligence
Combination of technologies like
Data Warehousing (DW)
On-Line Analytical Processing (OLAP)
Data Mining (DM)
Data Visualization (VIS)
Decision Analysis (what-if)
Customer Relationship Management (CRM)
Operational Data
Presents a dynamic view of the business
Must be kept up-to-date and current at all times
Updated by transactions entered by data-entry operators or specially trained end users
Is maintained in detail
Utilization is predictable. Systems can be optimized for projected workloads
High volume of transactions, each of which affects a small portion of the data
Users do not need to understand data structures
Functional orientation
Analytical Data
Presents a static view of the business
End-user access is usually read-only
More concerned with summary information
Usage is unpredictable in terms of depth of information needed by the user
Smaller number of queries, each of which may access large amounts of data
Users need to understand the structure of the data (and business rules) to draw
meaningful conclusions from the data
Subject -orientation
Database
Broadly classified into
1. OLTP (Online Transactional Processing) DB
2. OLAP (Online Analytical Processing) DB
OLAP

Slicing and dicing of data is called as Online Analytical Processing (OLAP).


OLAP only serves the needs of data warehousing than OLTP.
OLAP systems allow ad hoc processing and support access to data over time
periods.
OLAP systems are the aggregation, transformation, integration and historical
collection of OLTP data from one or more systems.

Typical OLAP operations:


1. Roll up (drill up)
- summarize data by climbing hierarchy or by dimension
reduction.
2. Drill down(roll down)
- from higher level summary to lower level summary or
detailed data, or
- introducing new dimensions
3. Slice and dice
- project and select
4. Pivot (rotate)
- reorient the cube, visualization, 3D to series of 2D planes.

OLAP vs OLTP
Slno
OLTP
1. Transaction Oriented
2. Complex data model (fully
normalized)
3. Smaller data volume (few historical
data)
4. Many, small queries
5.

Frequent updates

6.

Huge no. of users(clerks).

OLAP
Decision Oriented (Reports)
Simple data model
(multidimensional/de-normalized)
Larger data volumes (collection of
historical data)
Fewer, but bigger queries
Frequent reads, in-frequent updates
(daily)
Only few users(Management Personnel)

Objective of Data Warehouse


The primary purpose of a data warehouse is to provide easy access to specially
prepared data that can be used with decision support applications, such as management
reporting, queries, decision support systems, and executive information systems.
Decision Support
A Decision Support System (DSS) is a system that provides managers with
information they need to make decisions. These systems have the effect of empowering
employees at all levels, providing them access to business and financial information that
directly impact their productivity and quality of work
Executive information systems
An Executive information system (EIS) is a concise snapshot of how the company
is doing today. Consider it as an electronic executive briefing. EIS allows greater
flexibility in slicing-and-dicing data, i.e.; it allows exploration of data through multiple
dimensions or views.

Why Datawarehouse?
By centralizing data
1. The queries can be answered locally without accessing the original information
sources. Thus, high query performance can be obtained for complex aggregation
queries that are needed for in-depth analysis, decision support and data mining a
way of extracting relevant data from a vast database.
2. On-line Analytical Processing (OLAP) is decoupled (separated) as much as
possible from On-line Transaction Processing (OLTP). Thus making information
accessible to decision makers avoiding interference of OLAP with local
processing at the operational sources.
Data warehouse
A decision support database that is maintained separately from the organizations
operational databases
A Data Warehouse is an enterprise-wise collection of
Subject oriented
Integrated
Time variant
Non-volatile
data in support of managements decision making process.
- W. H. Inmon, 1993
*Subject Oriented - Data warehouses focuses on high-level business entities like
sales,marketing,etc.
*Integrated - Data in the warehouse is obtained from multiple sources and kept in a
consistent format.
*Time-Varying - Every data component in the date warehouse associates itself with some
point of time like weekly,monthly,quarterly, yearly
*Non-volatile - Dw stores historical data. Data does not change once it gets into the
warehouse. Only load/refresh.
Data from the operational systems are
Extracted
Cleansed
Transformed
1. case conversion,
2. data trimming,
3. concatenation,
4. datatype conversion
Aggregated
Loaded into DW

Periodically refreshed to reflect updates at the sources and purged from the
warehouse onto slower archival storage.

Use of DWH
Ad-hoc analyses and reports
Data mining: identification of trends
Management Information Systems
Designing a database for a Data Warehouse
1. Define User requirements, considering different views of users from different
departments.
2. Identify data integrity, synchronization and security issues/bottlenecks.
3. Identify technology, performance, availability & utilization requirements.
4. Review normalized view of relational data to identify entities.
5. Identify dimensions.
6. Create and organize hierarchies of dimensions.
7. Identify attributes of dimensions.
8. Identify fact table(s).
9. Create data repository (metadata).
10.
Add calculations.
Datamart
Datamart is a subset of data warehouse and it is designed for a particular line of
business, such as sales, marketing, or finance.
In a dependent data mart, data can be derived from an enterprise-wide data
warehouse.
In an independent data mart, data can be collected directly from sources
May be structured for specific access tools
Datamart is the data warehouse you really use
Why Datamart?
1. Datawarehouse projects are very expensive and time taking.
2. Success rate of DWH projects is very less
To avoid single point of loss we identify department wise needs
and build Datamart. If succeeded we go for other departments and integrate all
datamarts into a Datawarehouse.
Advantages
Improve data access performance
Simplify end-user data structures
Facilitate ad hoc reporting
Slno Data warehouse

Data mart

1.

DW Operates on an enterprise
level and contains all data used
for reporting and analysis

Data Mart is used by a specific


business department and is
focused on a specific subject
(business area).
DM is a subset of DWH

DWH ARCHITECHTURE
Data warehouse architecture is a way of representing the overall structure of data,
communication, processing and presentation that is planned, for end-user computing
within the enterprise. The architecture has the following main parts:
Operational data base
Information access layer
Data Access layer
Data dictionary (metadata) layer
Process management layer
Application messaging layer
Processing (Data Warehouse) layer
Data Staging layer

Operational data is the information related to day-to-day functioning of an


organization. An operational database stores business transactions critical to the
functioning of the organization.
Information access layer is the layer that the end-user deals with directly.
Examples of these are ad-hoc query tools like Business Objects, Power Play and
Impromptu.

Data access layer is the data interchange layer. This layer provides interface
between operational data bases and information access layers. The common data
language used is SQL. A familiar example of a data access layer is ODBC.
Metadata layer holds a repository of Metadata information. Metadata is defined
as data about data, resulting in an intelligent, efficient way to manage data. Metadata
provides the structure and content of the data warehouse, source and mapping
information, transformation / integration description and business rules. It is essential for
quality improvement in a Data Warehouse.
Process management layer is involved in scheduling the various tasks that must
be executed to build and maintain the data warehouse and data repository. It also helps to
keep the Data Warehouse up-to-date.
Application messaging layer transports information around the enterprises
computing network. It also acts as middle-ware and isolates applications from exact
data format on either end.
Processing (data warehouse) layer is the logical view of the informational data. It
also performs the summarization, loading and processing of data from operational
databases.
Data staging layer manages data replication across servers. It also manages data
transformation.

ETL
1. ETL means Extraction, transformation, and loading.
2. ETL refers to the methods involved in accessing and manipulating source data
and loading it into target database.

ETL Process
Etl is a process that involves the following tasks:
extracting data from source operational or archive systems which are the primary
source of data for the data warehouse
transforming the data - which may involve cleaning, filtering, validating and
applying business rules
loading the data into a data warehouse or any other database or application that
houses data
Transform
1. Denormalize data
2. Data cleaning.
3. Case conversion
4. Data trimming
5. String concatenation
6. datatype conversion
7. Decoding
8. calculation
9. Data correction.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data,
typically as part of the ETL process.
Data Staging Area
1. Most complex part in the architecture.
2. A place where data is processed before entering the warehouse
3. It involves...
Extraction (E)
Transformation (T)
Load (L)
Indexing
Popular ETL Tools
Tool Name
Informatica

Company Name
Informatica Corporation

DT/Studio
DataStage
Ab Initio
Data Junction
Oracle Warehouse Builder
Microsoft SQL Server Integration
TransformOnDemand
Transformation Manager

Embarcadero Technologies
IBM
Ab Initio Software Corporation
Pervasive Software
Oracle Corporation
Microsoft
Solonde
ETL Solutions

Dimensional Modeling
Means storing data in fact and dimension tables.
Here data is fully denormalized
Dimension table
1. Dimension table gives the descriptive attributes of a business.
2. They are fully denormalized
3. It has a primary key
4. Data arranged in hierarchical manner (product to category; month to year) if so
we can use for drill down and drill up analysis
5. Has less no. of records
6. Has rich no. of columns
7. Heavily indexed
8. Dimension tables are sometimes called lookup or reference tables.
Types of Dimensions
1. Normal Dimension
2. Confirmed Dimension
3. Junk Dimension
4. Degenerated Dimension
5. Role Playing Dimension
Confirmed Dimension
Dimension table used by more than one fact table is called Confirmed Dimensions
(dimensions that are linked to multiple fact tables)
D1

D2
FT1

D3

D1

D2

FT2
D4

D3

Adv:
1. To avoid unnecessary space
2. Reduce time
3. Drill across fact table

D5
FT3

Junk Dimension
is an abstract dimension it will remove number of foreign keys from fact table.
This is achieved by combining 2 or more dimensions into a single dimension.

Degenerated Dimension
Means a key value or dimension table which does not have descriptive attributes.
i.e.) a non foreign key and non numerical measure column used for grouping purpose
Ex : Invoice Number, Ticket Number

Role Playing Dimension


Means a single physical dimension table plays different role with the help of
views.

Fact Table
1. The centralized table in a star schema is called as FACT table
2. A fact table typically has two types of columns:
Numerical measures and
Foreign keys to dimension tables.
3. The primary key of a fact table is usually a composite key that is made up of all of
its foreign keys
4. Fact tables store different types of measures like
additive,
non additive and
semi additive measures
5. A fact table might contain either detail level facts or facts that have been
aggregated
6. A fact table usually contains facts with the same level of aggregation.
7. Has millions of records
Measure Types

Additive - Measures that can be summarized across all dimensions.


o Ex: sales
Non Additive - Measures that cannot be summarized across all dimensions.
o Ex: averages
Semi Additive - Measures that can be summarized across few dimensions and not
with others.
o Ex: inventory levels

Factless Fact
A fact table that contains no measures or facts is called as Factless Fact table.
Slowly Changing Dimensions
1. Dimensions that change over time are called Slowly Changing
Dimensions
2. Slowly Changing Dimensions are often categorized into three types
namely
Type1,
Type2 and
Type3

Type 1 SCD :
Used if history is not required
Overwriting the old values.
Product Price in 2004:
Product ID(PK) Year Product Name Product Price
1
2004 Product1
$150
Product Price in 2005:
Product ID(PK) Year Product Name Product Price
1
2005 Product1
$250
Type 2 SCD:
If history and current value needed
Creating another additional record.(new record with new changes and new
surrogate key)
Mostly preferred in dimensional modeling
Product
Product
ID(PK)
1

Effective
DateTime(PK)
01-01-2004

Product
Name
2004 Product1
Year

Product
Price
$150

Expiry
DateTime
12-31-2004

12.00AM
01-01-2005
12.00AM

11.59PM
2005 Product1

$250

Type 3 SCD:
Used if changes are very less
Previous one level of history available
Creating new fields.
Product Price in 2005
Current Product
Current
Old Product
Product ID(PK)
Old Year
Year
Name Product Price
Price
1
2005
Product1 $250
$150
2004
Surrogate keys
Surrogate keys are always numeric and unique on a table level which makes it
easy to distinguish and track values changed over time.
Surrogate keys are integers that are assigned sequentially as needed to populate a
dimension.
Surrogate keys merely serve to join dimensional tables to the fact table.
Surrogate keys are beneficial as the following reasons:
1. Reduces space used by fact table
2. Faster retrieval of data ( since alphanumerical retrieval is costlier than
numerical data)
3. Maintaining index is easier with numeric key.
4. Maintain all slowly changing dimenion.
Data warehouse Design
The data warehouse design essentially consists of four steps, which are as
follows:
1. Identifying facts and dimensions
2. Designing fact tables
3. Designing dimension tables
4. Designing database schemas
Types of database schemas
There are three main types of database schemas:
1. Star Schema,
2. Snowflake Schema and
3. Starflake schema.
Star Schema
1. It is the simplest form of data warehouse schema that contains one or more
dimensions and fact tables

2. It is called a star schema because the entity-relationship diagram between


dimensions and fact tables resembles a star where one fact table is connected to
multiple dimensions
3. The center of the star schema consists of a large fact table and it points towards
the dimension tables
4. Fact Table
= Highly Normalized
Dimension Table = Highly denormalized.
5. It can be very effective to treat fact data as primarily read-only data, and
dimensional data as data that will change over a period of time
Advantages:
Star schema is easy to define.
It reduces the number of physical joins.
Provides very simple metadata.
Drawbacks:
Summary data in Fact tables (such as Sales amount by region, or district-wise, or yearwise) yields poor performance for summary levels and huge dimension tables.

Steps in designing Star Schema


1. Identify a business process for analysis (like sales).
2. Identify measures or facts (sales dollar).
3. Identify dimensions for facts (product dimension, location dimension, time
dimension, organization dimension).

4. List the columns that describe each dimension. (Region name, branch name,
employee name).
5. Determine the lowest level of summary in a fact table (sales dollar).
Fact constellation:
Dimension tables will, in turn, have their own dimension tables. In this case, the
Store dimension will contain District ids and Region ids, which will reference
district and region dimensions of Store dimension, respectively. This Schema is
called Fact Constellation Schema.
Snowflake schema
1. A snowflake schema is a term that describes a star schema structure normalized
through the use of outrigger tables. i.e dimension table hierarchies are broken into
simpler tables
2. Represent dimensional hierarchy directly by normalizing the dimension tables ie)
all dimensional information is stored in third normal form
3. This implies dividing the dimension tables into more tables, thus avoiding nonkey attributes to be dependent on each other.
Advantages:
Snowflake schema provides best performance when queries involve aggregation.
Disadvantages:
Maintenance is complicated.
Increase in the number of tables.
More joins will be needed

Snowflake Schema
Starflake Schema
1. combinations of denormalized Star and normalized Snowflake schemas.

Star Schema vs Snowflake Schema


Slno
Star Schema
1. Dimension table will not have any
parent table
2. Hierarchies for the dimensions are
stored in the dimensional table itself

Snow Flake
Dimension table will have one or more
parent tables
Hierarchies are broken into separate
tables in snow flake schema

Granularity
Means what detail data to be stored in fact table
Types of Granularity
1. Transactional Level Granularity
2. Periodic Snapshot Granularity
Transactional Level Granularity
Mostly used
Each and every transaction stored in fact table
Drill down and drill up analysis can be done
Disadvantage
1. Size increases.
Periodic Snapshot Granularity
Summarizing data over a period is stored in fact table
Adv : Faster retrieval (less records)
Disadv : Detail information not available

FAQ
Hierarchy
1. Hierarchies are logical structures that use ordered levels as a means of organizing
data.
2. A hierarchy can be used to define data aggregation.
Example
country>city>state>zip
in a time dimension, a hierarchy might be used to aggregate data from the Month
level to the Quarter level, from the Quarter level to the Year level.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.
Operational Data Store

In recent times, OLAP functionality is being built into OLTP systems which is
called ODS (operational data store).
A physical set of tables sitting between the operational systems and the data
warehouse or a specially administered hot partition of the data warehouse itself.
The main reason of ODS is to provide immediate reporting of operational results
if neither the operational system nor the regular data warehouse can provide
satisfactory accsee.
Since an ODS is necessarily an extract of the operational data, it also may play the
role of source for data warehouse.

Data Staging Area


1. A storage area that clean, transform, combine, duplicate and prepare source data
for use in the data warehouse.
2. The data staging area is everything in between the source system and data
presentation server.
3. No querying should be done in the data staging area because the data staging area
normally is not set up to handle fine-grained security, indexing or aggregation for
performance.
Data Warehouse Bus Matrix
1. The matrix helps prioritize which dimensions should be tackled first for
conformity given their prominent roles.
2. The matrix allows us to communicate effectively within and across data mart
teams.
3. The columns of the matrix represent the common dimensions.
4. The rows identify the organizations business processes.
Degenerated Dimension
Operational control numbers such as invoice numbers, order numbers and bill of lading
numbers looks like dimension key in a fact table but do not join to any actual dimension
table. They give rise to empty dimension hence we refer them as Degenerated
Dimension(DD).

You might also like