Data Warehousing Concepts

Data Warehousing Concepts
Version 1.0
January 2019
CitiusTech has prepared the content contained in this document based on information and knowledge that it reasonably believes to be
reliable. Any recipient may rely on the contents of this document at its own risk and CitiusTech shall not be responsible for any error
and/or omission in the preparation of this document. The use of any third party reference should not be regarded as an indication of
an endorsement, an affiliation or the existence of any other kind of relationship between CitiusTech and such third party
Agenda
▪ Closer Look at Data warehouse and its Key Elements

▪ Data Warehousing SCHEMAS
▪ Data Warehousing Objects
▪ Types of Dimensions
▪ Kimball Vs Inmon
▪ Introducing Business Intelligence (BI)
▪ OLAP/ MOLAP/ ROLAP
▪ Other Important Key Elements
2
Data Warehouse
▪ A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. It usually contains historical data derived from transaction data.
▪ A data warehouse environment includes an extraction, transformation, and loading (ETL)
solution, online analytical processing (OLAP), data mining capabilities, client analysis tools, and
other applications that manage the process of gathering data and delivering it to business users.
▪ It is a series of processes, procedures, and tools (h/w and s/w) that help the enterprise
understand more about itself, its products, its customers and the market it services
Facts!
Data Warehouse NOT possible to
is NOT a specific purchase a Data
technology Warehouse, but it is
possible to build one.
3
Why Data Warehousing?
▪ Need of intelligent information in competitive market
Who are the potential What are the region-wise

customers? preferences?
Which products are sold the What are the competitor
most? products?
What will be the impact on

revenue? What are the projected
What are the results of sales?
promotion schemes What if you sale more
introduced? quantity of a particular
product?
4
Defining Data Warehouse
▪ William Imon – “Data Warehouse is a subject-oriented, integrated, nonvolatile and time-variant
collection of data in support of management’s decisions.”
▪ Ralph Kimball – “A data warehouse is a copy of transaction data specifically structured for query
and analysis”.
▪ A data warehouse is oriented for data consumption as opposed to online transaction processing
systems. It’s therefore designed for better analytic performance. It’s aimed to be the organization’s
“single version of truth”. Two key aspects of data warehouse are consistency and data history.
• Consistency is guaranteed by providing a single view of the information regardless of the data
source.
• Data history historical is stored in order to analyze change over time.
5
Subject Oriented
▪ The data in data warehouse is organized
Operational Data
around the major subject of the enterprise Systems Warehouse
(i.e. the high-level entities).
Customer
▪ The orientation around the major subject
areas causes the data warehouse design to
be data driven.
▪ The operational systems are designed

around the application and functions. For Supplier
example, loans, savings, credit cards in case
of a Bank, where Data Warehouse is
designed around a subject like Customer,
Product, Vendor, etc.
Product
Organized by processes Organized by

or tasks subject
6
Time Variant
▪ Data is stored as a series of snapshots or views which record how it is collected across time.
Data Warehouse Data
Time Data
{ Key
▪ It helps in business trend analysis

▪ In contrast to OLTP environment, data warehouses focus on change over time, that is what we
mean by time variant.
7
Integrated
▪ Data is stored once in a single integrated location.
Auto Policy
Processing Data Warehouse
System
Database
Customer Fire Policy
data Processing
stored System
in several
databases
FACTS, LIFE
Subject = Customer
Commercial, Accounting
Applications
▪ It is closely related to subject orientation

▪ Data from disparate sources need to be put in a consistent format
▪ Resolves problems such as naming conflicts and inconsistencies
8
Non-Volatile
▪ Existing data in the warehouse is not overwritten or updated
External
Sources
Production
Databases
Data
Warehouse
Production Database
Applications Data
Warehouse
Environment
Update
Insert • Load
Delete • Read-Only
▪ This is logical because the purpose of a data warehouse is to enable you to analyze what has
occurred
9
Data Granularity
▪ It refers to the level of detail
▪ It is inversely proportional to the amount of data stored
▪ Data is summarized at different levels
▪ Many Data warehouses have at least two levels of granularity
▪ Summarized data is stored
▪ It reduces storage costs
▪ It reduces CPU usage
▪ It increases performance since smaller number of records have to be processed
▪ Design is around traditional high-level reporting needs
▪ Tradeoff with volume of data to be stored and detailed usage of data
10
OLTP vs. Data Warehouse (1/2)
▪ So, what’s different between OLTP and Data Warehouse?

• OLTP systems are tuned for known transactions and workloads while workload is not known
in a data warehouse
• Special data organization, access methods and implementation methods are needed to
support data warehouse queries (typically multidimensional queries)
• For example, average amount spent on phone calls between 9AM-5PM in Vijayawada during
the month of February, 2007
11
OLTP vs. Data Warehouse (2/2)
▪ OLTP ▪ DW
• Application Oriented • Subject Oriented
• Used to run business • Used to analyze business
• Detailed data • Summarized and refined data
• Data is Current & up-to-date • Snapshot data
• Isolated data • Integrated data
• Repetitive access • Ad hoc access
• Clerical user • Knowledge user (Manager)
• Performance sensitive • Performance relaxed
• Few records accessed at a time (tens) • Large volumes accessed at a time
• Read/Update access (millions)
• No data redundancy • Mostly Read (Batch Update)
• Database Size 100 MB -100 GB • Redundancy present
• Database Size 100 GB - few terabytes
▪ OLTP systems are used to “run” a ▪ The Data Warehouse helps to

business “optimize” the business
12
Complete Warehouse Solution Architecture
Data Information Knowledge

Data Sources Data Management Access
Sales
Data
Mart
Metadata
Legacy Data
Inventory
Extract
Transform Enterprise Data
Load Data Mart
Warehouse
Operational Data
The Post
Purchase
Organizationally Data
structured Mart
VISA
External Data Departmentally
Sources structured
Asset Assembly (and Management) Asset Exploitation
13
Introduction To Data Marts
What is a Data Mart?
▪ From the Data Warehouse, atomic data flows to various departments for their customized needs.
If this data is periodically extracted from data warehouse and loaded into a local database, it
becomes a data mart.
▪ The data in Data Mart has a different level of granularity than that of Data Warehouse. Since the
data in Data Marts is highly customized and lightly summarized, the departments can do whatever
they want without worrying about resource utilization.
▪ Also, the departments can use the analytical software they find convenient. The cost of processing
becomes very low.
14
Data Mart Overview
DM Sales Sales Representatives
and Analysts
Data Warehouse
DM Marketing
DM HR
DM Sales DM HR Human
Resources
DM Finance
Data Marts
DM Marketing
Satisfy 80% of
Financial Analysts,
the local end-
Strategic Planners,
users’ requests
and Executives
15
From DW To Data Marts
Information
Individually
Structured
Departmentally
Structured
Organizationally
Structured Data Warehouse
Data
16
Operational Data Store (ODS)
▪ ODS provides an integrated view of data in operational systems. As the below figure indicates,
there is a clear separation between ODS and the data warehouse.
Operational
A Data Store Data Warehouse
EIS
DSS
B
Apps
PC
C Current or near Historical data
current data
Summary and detail
Detailed data
Non-volatile
Updates allowed snapshots only
17
Benefits of ODS
▪ Supports operational reporting needs of the organization
▪ Provides a complete view of customer relationships, the data for which might be stored in several
operational databases -- this data can include data from an organization’s internal systems, as well
as external data from third-party vendors.
▪ Operates as a store for detailed data, updated frequently and used for drill-downs from the data
warehouse which contains summary data
▪ Reduces the burden placed on other operational or data warehouse platforms by providing an
additional data store for reporting
▪ Provides more current data than in a data warehouse and more integrated than an OLTP system
▪ Feeds other operational systems in addition to the data warehouse
18
Agenda

19
Data Warehousing SCHEMAS
▪ A schema is a collection of database objects, including tables, views, indexes, and synonyms.
▪ There are various ways of arranging schema objects in the schema models designed for data
warehousing. They are:
• Star Schema
• Snowflake Schema
• Galaxy Schema
Star Schema ▪ It consists of a fact table connected to a set of dimensional

table. Data in Dimension tables is de-normalized.
▪ It is a refinement of star schema where some dimensional

Snowflake Schema
hierarchy is normalized into a set of dimensional tables
▪ Multiple fact tables share dimension tables; viewed as a

Galaxy Schema
collection of stars, therefore called galaxy schema
20
Star Schema
▪ A Star Schema a highly de-normalized, query-centric model where information is broken into two
groups
• Facts
• Dimensions
Employee_Dim
EmployeeKey
EmployeeID
.
.
.
Time_Dim Branch_Dim
TimeKey Sales_Fact BranchID
TheDate TimeKey Branchno
. EmployeeKey .
. .
. ProductKey .
CustomerKey
ShipperKey
Required Data
(Business Metrics)
or (Measures)
Shipper_Dim Customer_Dim
ShipperKey CustomerKey
ShipperID CustomerID
. .
. .
. .
21
Snowflake Schema
Branch_Dim
branchID {PK}
Sales_fact
branchNo timeID {FK}
branchType propertyID {FK}
branchID {FK}
city {FK}
clientID {FK}
promotionID {FK}
City staffID {FK}
city {PK} ownerID {FK}
region {FK} offerPrice

sellingPrice
saleCommission
Region saleRevenue
region {PK}
Figure32.2
country
Fact Table
Dimension
Tables
22
Galaxy Schema
▪ Multiple groups of facts linked by few common dimensions
Dimension1 Dimension2
Fact1
Fact2 Dimension5 Fact3
23
Agenda

24
Data Warehousing Objects (1/8)
Various objects used in Data Warehousing are:
▪ Fact Tables
▪ Dimension Tables
▪ Hierarchies
▪ Unique Identifiers
▪ Relationships
25
▪ Fact Tables
• Represents a business process, i.e., models the business process as an artifact in the data
model.
▪ Contains the measurements or metrics or facts of business processes
• "monthly sales number" in the Sales business process
• most of them are additive (sales this month), some are semi-additive (balance as of), some
are not additive (unit price)
▪ The level of detail is called the “grain” of the table
▪ Contains foreign keys for the dimension tables
26
Fact Types
▪ Additive facts
• Additive facts are facts that can be summed up through all of the dimensions in the fact table
▪ Semi-additive facts
• Semi-additive facts are facts that can be summed up for some of the dimensions in the fact
table
▪ Non-additive facts
• Non-additive facts are facts that cannot be summed up for any of the dimensions present in
the fact table
27
Additive Fact Table
Date
Store
Product
Sales Amount
▪ The purpose of this table is to record the Sales_Amount for each product in each store on a daily
basis. Sales_Amount is the fact.
▪ In this case, Sales_Amount is an additive fact, because we can sum up this fact along with any of
the 3 dimensions present in the fact table – date, store, and product.
28
Fact table for Semi-additive and Non-additive facts
Date
Account
Currect_Balance
Profit_Margin
▪ The purpose of this table is to record the current balance for each account at the end of each day,
as well as the profit margin for each account for each day
▪ Current_Balance and Profit_Margin are the facts
▪ Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what’s
the total current balance for all accounts in the bank?), but it does not make sense to add them
up through time
▪ Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account
level or day level
29
Types of fact tables
▪ Based on the above classifications, there are two types of fact tables
• Cumulative
• Snapshot
Date
• This type of fact table describes what has Store
happened over a period of time.
• For example, this fact table may describe Product
Cumulative
the total sales by product, by store, by day. Sales Amount
The facts for this type of fact tables are
mostly additive.
Date
• This type of fact table describes the state of Account
things in a particular instance of time, and Currect_Balance
Snapshot
usually includes more semi-additive and
non-additive facts. Profit_Margin
30
Dimension Tables
▪ Defines business in terms already familiar to users
▪ Wide rows with lots of descriptive text
▪ Small tables (about a million rows)
▪ Joined to fact table by a foreign key
▪ Heavily indexed
▪ Typical dimensions
• Time periods, geographic regions (markets, cities), products, customers, salesperson, etc.
Dimension Tables Types
Slowly Changing
Junk Dimensions
Dimensions
Dimension
Tables Types
Degenerate
Conformed Dimensions
Dimensions
31
Agenda

32
Slowly Changing Dimensions: (SCD)
▪ Various data elements in the dimension undergo changes (for example, changes in attributes,
hierarchical structures) which need to be captured for analysis
▪ SCD problem is a common one particular to data warehousing
▪ In a nutshell, this applies to cases where the attribute for a record varies over time
▪ Example:
Customer key Name State
1001 Christina Illinois
▪ Christina is a customer who first lived in Chicago, Illinois. At a later date, she moved to Los
Angeles, California. Now, how to modify the table to reflect this change?
▪ This is a “Slowly Changing Dimension problem”
33
Types of SCD
▪ There are in general 5 ways to solve this type of problem, and they are categorized as follows:
Type Type Type Type Type

1 2 3 4 6
New record A new record is The original Maintains a Combination of

replaces the added to the record is separate history Type 1, Type 2,
original record; customer modified to table to track and Type 3
no trace of the dimension table reflect the changes
old record change
34
Type 1
▪ New record replaces the original record; no trace of the old record
1001 Christina Illinois
▪ After Christina moves from Illinois to California, the new information replaces the old record and
we have the following table:
1001 Christina California
Advantages Disadvantages
▪ This is the easiest way to handle the ▪ All the history is lost. By applying this
Slowly Changing Dimension, since there methodology, it is not possible to track
is no need to keep track of the old back in history. For example, in the above
information. case, the company would not able to
know that Christina lived in Illinois
before.
35
Type 2
▪ In type 2 SCD, a new record is added to the table to represent the new information. Therefore,
both the original and the new record will be present
1001 Christina Customer key
1005 Christina California
▪ After Christina moves from Illinois to California, we add the new information as a new row into
the table
▪ This allows us to accurately keep all the ▪ This will cause the size of the table to
historical information grow fast and where the number of rows
for the table is very high, storage and
performance can become a concern
36
Type 3
▪ In type 3 SCD, there will be two columns to indicate the particular attribute of interest, one
indicating the original value, and one indicating the current value. There will also be a column that
indicates when the current value becomes active.
Customer key Name Original State Current State Effective Date
1001 Christina Illinois California 15-Jan-03
▪ After Christina moves from Illinois to California, the original information gets updated, and we
have the above table (Assuming the effective date of change is January 15, 2003
▪ This does not increase the size of the ▪ Type 3 will not be able to keep all the
table since new information is updated history where an attribute is changed
▪ This allows us to keep some part of more than once. For example, if Christina
history later moves from California to Texas on
December 15, 2003, the California
information is lost
37
Type 4
▪ The Type 4 method is usually referred to as using "history tables", where one table keeps the
current data, and an additional table is used to keep a record of some or all changes. Both the
surrogate keys are referenced in the Fact table to enhance query performance.
▪ In the example below, the original table name is Supplier and the history table name is
SupplierHistory
Supplier
Supplierkey SupplierCode SupplierName SupplierState
1001 S123 ABC Suppliers IL
SupplierHistory
Supplierkey SupplierCode SupplierName SupplierState Effective Date
1001 S123 A&B Suppliers CA 15-Jan-03

1002 S123 ABC Suppliers IL 22-Dec-04
38
Type 6 (1/3)
▪ The Type 6 method combines the approaches of types 1, 2, and 3 (1 + 2 + 3 = 6). This is also called
as Hybrid SCD.
▪ Example
Supplier
Supplier Row Supplier Supplier Current Historical Start End Current

key Key Code Name State State Date Date Flag
1001 1 ABC ABC 01 Jan 31 Dec
Suppliers CA CA 2001 9999 Y
▪ The Current State and the Historical State are the same. The optional Current Flag attribute
indicates that this is the current or most recent record for this supplier.
▪ When ABC Suppliers company moves to Illinois, we add a new record as in Type 2 processing,
however a row key is included to ensure we have a unique key for each row
39
Type 6 (2/3)
Supplier
Supplier Row Supplier Supplier Current Historical Start End Date Current
key Key Code Name State State Date Flag
Suppliers CA CA 2001 2004 N
1001 2 ABC ABC 23 Dec 31 Dec
Suppliers IL CA 2004 9999 Y
▪ We overwrite the Current Flag information in the first record (Row Key = 1) with the new
information, as in Type 1 processing. We create a new record to track the changes, as in Type 2
processing. And, we store the history in a second State column (Historical State), which
incorporates Type 3 processing.
▪ For example, if the supplier were to relocate again, we would add another record to the Supplier
dimension, and we would overwrite the contents of the Current State column
40
Type 6 (3/3)
SupplierHistory
Supplier Row Supplier Supplier Current Historical Start End Date Current
key Key Code Name State State Date Flag
Suppliers CA CA 2001 2004 N
1001 2 ABC ABC 23 Dec 31 Dec
Suppliers IL CA 2004 2009 N
Suppliers NY IL 2010 2009 Y
41
Degenerate Dimension (1/2)
▪ A Degenerate Dimension is a dimension which has only a single attribute
▪ This dimension is typically represented as a single field in a fact table
▪ The data items that are not facts and data items that do not fit into the existing dimensions are
termed as Degenerate Dimensions
▪ Degenerate Dimensions are the fastest way to group similar transactions
▪ Degenerate Dimensions are used when fact tables represent transactional data
▪ They can be used as primary key for the fact table but they cannot act as foreign keys
42
Degenerated Dimension (2/2)
▪ Degenerated Dimension is a dimension key without a corresponding dimension.
▪ Example:
• In the Point Of Sale Transaction Fact table, we have Date Key (FK), Product Key (FK), Store Key
(FK), Promotion Key (FP), and POS Transaction Number
• Date Dimension corresponds to Date Key; Production Dimension corresponds to Production
Key. In a traditional parent-child database, POS Transactional Number would be the key to the
transaction header record that contains all the info valid for the transaction as a whole, such
as the transaction date and store identifier. But in this dimensional model, we have already
extracted this info into other dimension. Therefore, POS Transaction Number looks like a
dimension key in the fact table but does not have the corresponding dimension table.
• Therefore, POS Transaction Number is a degenerate dimension
43
Conformed Dimensions
▪ Conformed Dimension is a dimension which is fixed and reusable.
▪ It is also called as fixed dimension. It is a dimension which doesn't affect with respect to time.
▪ For example, if the name of the city is changed from Bombay to Mumbai, the name will not
change from time to time, once the change is done, the change is permanent. These types of
dimensions are called conformed or fixed dimensions.
44
Junk Dimensions
▪ A dimension where one can store random transactional codes, flags and text attributes that are
not related to other dimensions and which provides a simple way for users to easily find those
unrelated attributes.
▪ Example: Marital Status: (Yes or No) Gender: (M or F) etc.
45
Additional Data Warehousing Objects
Hierarchies
▪ Hierarchies are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the quarter
level to the year level. A level represents a position in a hierarchy.
Unique Identifiers
▪ Unique identifiers are specified for one distinct record in a dimension table.
Artificial unique identifiers are often used to avoid the potential problem of
changing unique identifiers. Unique identifiers are represented with the# character.
For example, #customer_ ID
Relationships
▪ Relationships guarantee business integrity. Designing a relationship between the

sales information in the fact table and the dimension tables products and customers
enforces the business rules in databases. Here the information of sales is related to
products and customers
46
Agenda

47
Ralph Kimball Vs. Bill Inmon
Ralph Kimball's paradigm Bill Inmon's paradigm
Data warehouse is the conglomerate of all data Data warehouse is one part of the overall
marts within the enterprise. Information is business intelligence system. An enterprise has
always stored in the dimensional model. one data warehouse, and data marts source their
information from the data warehouse. In the data
warehouse, information is stored in 3rd normal
form
48
Basic Design Approaches of Data Warehouse
▪ There are two major types of approaches to building or designing the Data Warehouse.
The Top-Down The Bottom-Up

Approach Approach
49
The Top Down Approach (1/2)
▪ The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach
• Inmon advocated a “dependent data mart structure”
• The data flow in the top down OLAP environment begins with data extraction from the
operational data sources. This data is loaded into the staging area and validated and
consolidated for ensuring a level of accuracy and then transferred to the Operational Data
Store (ODS).
• Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area
for aggregation, summarization and then extracted and loaded into the Data Warehouse.
• Once the Data Warehouse aggregation and summarization processes are complete, the data
mart refresh cycles will extract the data from the Data Warehouse into the staging area and
perform a new set of transformations on them. This will help organize the data in particular
structures required by data marts. Then the data marts can be loaded with the data and the
OLAP environment becomes available to the users.
50
Inmon Approach
▪ The data marts are treated as sub sets of the
data warehouse. Each data mart is built for an
individual department and is optimized for
analysis needs of the particular department
for which it is created.
51
The Bottom – Up Approach (1/3)
▪ The Data Warehouse Bus Structure: The Bottom-Up Approach
• Ralph Kimball designed the data warehouse with the data marts connected to it with a bus
structure
• The bus structure contained all the common elements that are used by data marts such as
conformed dimensions, measures, etc. defined for the enterprise as a whole
• This architecture makes the data warehouse more of a virtual reality than a physical reality
• All data marts could be located in one server or could be located on different servers across
the enterprise while the data warehouse would be a virtual entity being nothing more than a
sum total of all the data marts
• In this context, even the cubes constructed by using OLAP tools could be considered as data
marts
52
Kimball Approach
▪ The bottom-up approach reverses the
positions of the Data Warehouse and the data
marts. Data marts are directly loaded with
the data from the operational systems
through the staging area.
▪ The data flow in the bottom-up approach
starts with extraction of data from
operational databases into the staging area
where it is processed and consolidated and
then loaded into the ODS.
▪ The data in the ODS is appended to or
replaced by the fresh data being loaded. After
the ODS is refreshed, the current data is once
again extracted into the staging area and
processed to fit into the data mart structure.
The data from the data mart is then extracted
into the staging area, aggregated,
summarized and so on and loaded into the
Data Warehouse and made available to the
end user for analysis.
53
Agenda

54
What is Business Intelligence (BI)?
Business Intelligence
▪ Business Intelligence is a generalized term applied to a broad category of applications and
technologies for gathering, storing, analyzing, and providing access to data to help enterprise
users make better business decisions
▪ Business Intelligence applications include the activities of decision support systems, query and
reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining
▪ An alternative way of describing BI is: the technology required to turn raw data into information
to support decision-making within corporations and business processes
55
Case Study
Sales guys at food mall selling

All the customer transactions
different products to
are maintained in the system
customers. E.g. Cheese Box Director of food mall wants to create
a report on how many Cheese boxes
he should purchase daily-monthly-
quarterly-yearly
Scanning all transactions for

this report becomes very time BI tools help him take quick decision as these
consuming reports are created quickly and presented in
proper format
56
BI & Its Benefits
Why BI?
▪ BI technologies help bring decision-makers the data in a form they can quickly digest and apply to
their decision making
▪ BI turns data into information for managers and executives and in general, people making
decisions in a company
▪ Companies want to use technology tactically to make their operations more effective and more
efficient - Business intelligence can be the catalyst for that efficiency and effectiveness
Benefits of BI
▪ The benefits of a well-planned BI implementation are going to be closely tied to the business
objectives driving the project.
• Identify trends and anomalies in business operations more quickly, allowing for more accurate
and timelier decisions
• Deliver actionable insight and information to the right place with less effort
• Identify and operate based on a single version of the truth, allowing all analysis to be
completed on a core foundation with confidence
57
Business Intelligence Architecture
58
Agenda

59
OLAP/ MOLAP/ ROLAP (1/5)
OLAP
▪ OLAP stands for On-Line Analytical Processing. It is an approach to quickly answer multi-
dimensional analytical queries
▪ OLAP can be broadly divided into two different camps: MOLAP and ROLAP
▪ In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and
ROLAP
60
What sales did we
Querying the Cube expect to achieve
in North America
for CY 2004 Q1?
Q1
Q2
N/A
Q3 Pacific
y
or
rit
Europe
er
sT
le
Q4 North America
Sa
Measures Dimension
61
MOLAP ROLAP
▪ This is the more traditional way of OLAP ▪ This methodology relies on manipulating
analysis. In MOLAP, data is stored in a the data stored in the relational database
multidimensional cube. The storage is not to give the appearance of traditional
in the relational database, but in OLAP's slicing and dicing functionality. In
proprietary formats essence, each action of slicing and dicing is
equivalent to adding a “where" clause in
the SQL statement.
Advantage Advantage
▪ Excellent performance: MOLAP cubes are ▪ Can handle large amounts of data
built for fast data retrieval, and are optimal ▪ Often, relational database already comes
for slicing and dicing operations with a host of functionalities. ROLAP
▪ Can perform complex calculations: All technologies, since they sit on top of the
calculations have been pre-generated relational database, can therefore leverage
when the cube is created. Hence, complex these functionalities.
calculations are not only doable, but also
quickly returnable
62
MOLAP ROLAP
Disadvantage Disadvantage
▪ Limited in the amount of data it can handle ▪ Performance can be slow
▪ Requires additional investment: Cube ▪ Limited by SQL functionalities: ROLAP
technologies are often proprietary and do technology mainly relies on generating SQL
not already exist in the organization. statements to query the relational
Therefore, to adopt MOLAP technology, database and SQL statements do not fit all
additional investments in human and needs.
capital resources are needed ▪ HOLAP technologies attempt to combine
the advantages of MOLAP and ROLAP.
HOLAP leverages cube technology for
faster performance. When detail
information is needed, HOLAP can “drill
through" from the cube into the underlying
relational data.
63
64
Agenda

65
What is Drill Down?
▪ Drill Down explains the data from summary to details.
▪ It is nothing more than adding row headers from the dimension tables
▪ An explicit hierarchy is not needed to support drill down
66
True Meaning of Drill Down
Brand Package Size Sales

Brawny 2-Pack $50
Brawny 3-Pack $110
Brawny 6-Pack $75
Brand Package Size Color Sales

Brawny 2-Pack White $8
Brawny 2-Pack Brown $5
Brawny 2-Pack Green $37
Brawny 6-Pack Pink $12
Brawny 6-Pack Brown $4
Drill down really means “show me more detail.”
67
Balanced Hierarchies
1998
Balanced Hierarchies
Qtr1 Qtr2 Qtr3 Qtr4
Jan Feb Mar Oct Nov Dec
Apr May Jun Jul Aug Sep
68
Ragged Hierarchies
North America
Ragged Hierarchies
USA Canada Mexico
North West
California
Brit
Columbia Dist Zacatecas
Federal
Oregon Washington
69
Multiple Hierarchies
▪ A typical dimension contains one or more natural hierarchies
▪ Any attribute whether or not belonging to a hierarchy can be freely used while drilling up or
down
70
Why Aggregates?
Data warehousing fact tables tracks lowest-level data
Capability
Group into most Pull out the groups and Relax constraints on one
interesting natural clumps sub-groups or more of the dimensions
Results in a vast number of records into the query
Why not pre-store the summary?
71
What are Aggregates?
▪ An aggregate is a fact table record representing a summarization of base level fact table records.
▪ Category aggregate Sales Fact ▪ Category Dimension

• Time Key • Category Key
• Category Key • Category
• Store Key • Department
• Promo Key
• Quantity - sold
• Dollars - sold Examples
• Dollars - cost • Category totals by store by day
• Customer - count • Category totals by month totals by
store
▪ Each aggregate fact table record is always associated with one or more aggregate dimension
table record.
72
Benefit of Aggregates
▪ The use of pre-stored summaries is the single most effective tool the data warehouse designer
have in order to control performance
▪ Guaranteed to provide correct summary value, as it is risky to leave the grouping to the user
▪ Aggregations provide a home for planning data
73
Surrogate Key versus Natural Key
▪ In a data warehousing environment also we cannot “afford” to use the natural keys, as they are
“expensive” in terms of the space they occupy.
▪ Natural keys are also known as Production keys, Smart keys, Intelligent keys. They are called so
because they have some information embedded in them about the record they represent.
▪ Surrogate keys also have various aliases like Meaningless keys, Integer keys, Non-natural keys,
Artificial keys. Surrogate keys are integers that are assigned sequentially to a column as needed
to populate a dimension.
▪ Advantages of Surrogate keys:
• Buffer the data warehouse from operational changes
• Save space
• Provide performance advantages
• Enable efficient handling of changes to dimension tables
74
Why Helper Tables?
▪ Multi-valued dimensions are normally illegal in a dimensional design. We usually insist that when
the grain of a fact table is declared, the only legal dimensions that can be attached to that fact
table are those that take on a single value for that grain.
▪ The preferred way to handle this multi-valued dimension is with a "helper table," as shown in
Figure 1, where it is called the Account to Customer Map.
▪ The helper table creates a many-to-many link between the fact table and the dimension table. As
a purist star-join dimensional designer, we like to avoid such situations, but there are a few
compelling circumstances where the physical world demands modeling a many-to-many
relationship between a fact table and a dimension table.
▪ Helper tables are also used to handle dimensions with complex hierarchies.
75
Questions (1/11)
1. Which of the following statements is true?
a. A data warehouse is useful to all organizations that currently use OLTPs
b. A data warehouse is valuable only if the organization has an interest in analyzing historical
data
c. A data warehouse is valuable to those organizations that need to keep an audit trail of their
activities
d. A data warehouse is necessary to all those organizations that are using relational OLTPs
2. A data warehouse needs to be

a. Subject-orientated
b. Non-volatile
c. Capable of integrating data from a wide variety of sources
d. Time-variant
e. Capable of informing decisions
76
Questions (2/11)
3. Analytical processing is
a. The act of exporting data into a spreadsheet for analysis
b. The act of summarizing data on a regular basis (for example, month end summaries)
c. The act of using software to analyze highly consolidated data, often to view the changes over
time
d. The act of using a relational database to produce reports giving data summaries on a regular
basis (for example, monthly)
4. Data in a data warehouse

a. Subject-orientated
b. Non-volatile
c. Capable of integrating data from a wide variety of sources
d. Time-variant
e. Capable of informing decisions
77
Questions (3/11)
5. Transaction processing is
a. The act of processing, recording, and storing individual transactions in a database
b. The act of analyzing transactions on a regular basis (for example, monthly)
c. The act of processing individual transactions
d. The act of analyzing each transaction to verify that it is valid
6. A Network database
a. provides many to many links between physical records
b. provides many to many links based on the data
c. is only accessible over the internet
d. is only accessible over a network
78
Questions (4/11)
7. Which of the following is associated with a data warehouse?
a. A flat file
b. A star schema
c. A hierarchical and/or network structure
d. A relation
8. A hierarchical database is
a. a structure where one parent (record) can have many children (records) and each child can
have many parents
b. a database that is concerned wholly with the manipulation of physical records
c. a database that is concerned wholly with the manipulation of data
d. a tree structure where one parent (record) can have many children (records) and each child
can have only one parent
79
Questions (5/11)
a. Adding data for the sake of it may well degrade the effectiveness of data warehousing
analysis
b. A data warehouse automatically makes a copy of every transaction recorded in an OLTP
system
c. A data warehouse is a relatively straightforward thing to set up
d. The more data a data warehouse has, the better it is

a. A fact table describes the transactions stored in a DWH
b. The fact table of a data warehouse is the main store of descriptions of the transactions
stored in a DWH
c. A fact table describes the granularity of data held in a DWH
d. The fact table of a data warehouse is the main store of all of the recorded transactions over
time
80
Questions (6/11)
11. OLAP stands for
a. On Line Analytic Processing
b. On Line Analytical Processing
c. On Line Analysis Processing
d. On Line Analytical Process
12. The main organizational justification for implementing a data warehouse is to provide
a. Large scale transaction processing
b. Storage for large volumes of data
c. Cheaper ways of handling transactions
d. Decision support
81
Questions (7/11)
13. Dimensionality refers to
a. The number of dimension tables that exist in a star schema
b. The data that describes the transactions in the fact table
c. The level of detail that is held in the Data Warehouse
d. The level of detail of data that is held in the fact table
14. OLTP stands for

a. On Line Transaction Processing
b. On Line Terminal Processing
c. On Line Transact Process
d. On Line Transaction Process
82
Questions (8/11)
15. A relational database differs from both a hierarchical database and a network database because
a. A relational database links (joins) tables via data and the other models use physical links
b. A relational database needs a Windows environment
c. A relational database can handle large data sets whereas the other two models cannot
d. A relational database links (joins) tables via physical links and the other models use data links
16. Network databases are

a. Less flexible than hierarchical databases
b. More inflexible in producing reports than a hierarchical database
c. Simpler than hierarchical databases
d. More flexible than hierarchical databases at the cost of duplicate data
83
Questions (9/11)
17. This diagram represents a

a. A Star Schema
b. A Hyper Schema
c. A Relational Schema
d. A Hyper Star
84
Questions (10/11)
18. A data warehouse
a. Takes regular copies of transaction data
b. Takes regular copies of transaction data and stores it in a way that is optimized for query and
reporting
c. Must import data from transactional systems whenever significant changes occur in the
transactional data
d. Has to work on live transactional data to provide up-to-date and valid results
19. Granularity refers to

a. The number of fact tables in a data warehouse
b. The level of detail of the data descriptions held in a data warehouse
c. The level of detail of the data stored in a data warehouse
d. The number of dimensions in a data warehouse
85
Questions (11/11)
20. Hierarchical databases are good at
a. Processing a relatively small set of pre-defined queries from a large data set
b. Structuring data
c. Analyzing highly consolidated data over time
d. Enabling the user to develop ad hoc queries
86
CitiusTech
Markets
CitiusTech
Services
CitiusTech
Platforms
Accelerating
Innovation
Thank You CitiusTech Contacts

Email Ct-univerct@citiustech.com
www.citiustech.com
87

Data Warehousing Concepts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing Concepts

Uploaded by

Copyright:

Available Formats

Data Warehousing Concepts

▪ Closer Look at Data warehouse and its Key Elements

Who are the potential What are the region-wise

What will be the impact on

▪ The operational systems are designed

Organized by processes Organized by

Data Warehouse Data

▪ It helps in business trend analysis

▪ It is closely related to subject orientation

▪ So, what’s different between OLTP and Data Warehouse?

▪ OLTP systems are used to “run” a ▪ The Data Warehouse helps to

Data Information Knowledge

DM Sales Sales Representatives

▪ Closer Look at Data warehouse and its Key Elements

Star Schema ▪ It consists of a fact table connected to a set of dimensional

▪ It is a refinement of star schema where some dimensional

▪ Multiple fact tables share dimension tables; viewed as a

region {FK} offerPrice

Fact2 Dimension5 Fact3

▪ Closer Look at Data warehouse and its Key Elements

▪ Closer Look at Data warehouse and its Key Elements

1001 Christina Illinois

Type Type Type Type Type

New record A new record is The original Maintains a Combination of

Customer key Name State

1001 Christina Illinois

1001 Christina California

1001 Christina Customer key

1005 Christina California

Customer key Name Original State Current State Effective Date

1001 Christina Illinois California 15-Jan-03

1001 S123 ABC Suppliers IL

1001 S123 A&B Suppliers CA 15-Jan-03

Supplier Row Supplier Supplier Current Historical Start End Current

▪ Relationships guarantee business integrity. Designing a relationship between the

▪ Closer Look at Data warehouse and its Key Elements

The Top-Down The Bottom-Up

▪ Closer Look at Data warehouse and its Key Elements

Sales guys at food mall selling

Scanning all transactions for

▪ Closer Look at Data warehouse and its Key Elements

▪ Closer Look at Data warehouse and its Key Elements

Brand Package Size Sales

Brand Package Size Color Sales

Drill down really means “show me more detail.”

Jan Feb Mar Oct Nov Dec

Apr May Jun Jul Aug Sep

USA Canada Mexico

Results in a vast number of records into the query

Why not pre-store the summary?

▪ Category aggregate Sales Fact ▪ Category Dimension

2. A data warehouse needs to be

4. Data in a data warehouse

10. Which of the following statements is true?

14. OLTP stands for

16. Network databases are

17. This diagram represents a

19. Granularity refers to

Thank You CitiusTech Contacts

You might also like