You are on page 1of 64

St.

Vincent Pallotti College of Engineering


& Technology

Data Warehousing and Mining


(BEIT701T)
7th Sem B.E. (IT)
Presented By

SAMIR SIDDIQUI
CR(FINAL YEAR)
Department of Information Technology
1
Types of data
Operational data (OLTP application)
• Data that ‘works’.
• Frequent updates and queries
• Normalized(standardize) for efficient search
and updates
• Fragmented and local reference
• Point queries: queries accessing individual
tuples.
Cont…
Historical data (OLAP application)
• Data that ‘tells’.
• Very infrequent updates
• Analytical queries that require huge amounts
of aggregation.
• Integrated data set with global relevance
• Performance issues mainly in query response
time (not in updates)
e.g. of OLTP Queries
• What is the salary of Mr. X
• What is the address and phone no. of the
person in change of the supplies department.
e.g. of OLAP Queries
• How is the employee attrition scene changing
over the years across the company?
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
December 22, 2022 Data Mining: Concepts and Techniques 5
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Data Mining: Concepts and


6 December 22, 2022
Techniques
Examples of OLAP applications in
various functional areas

7 © Pearson Education Limited 1995, 2005


OLAP Applications
• Although OLAP applications are found in
widely divergent functional areas, they all
have the following key features:
– multi-dimensional views of data
– support for complex calculations
– time intelligence

8 © Pearson Education Limited 1995, 2005


OLAP Applications - multi-
dimensional views of data
• Core requirement of building a ‘realistic’
business model.

• Provides basis for analytical processing


through flexible access to corporate data.

• The underlying database design that


provides the multi-dimensional view of data
should treat all dimensions equally.
9 © Pearson Education Limited 1995, 2005
OLAP Applications - support for
complex calculations
• Must provide a range of powerful
computational methods such as that required
by sales forecasting, which uses trend
algorithms such as moving averages and
percentage growth.

• Mechanisms for implementing computational


methods should be clear and non-procedural.

10 © Pearson Education Limited 1995, 2005


OLAP Benefits
• Increased productivity of end-users.
• Reduced backlog of applications
development for IT staff.
• Retention of organizational control over the
integrity of corporate data.
• Improved potential revenue and profitability.

11 © Pearson Education Limited 1995, 2005


From Tables and Spreadsheets
to Data Cubes
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to each
of the related dimension tables

Han: Data Cubes 12


Data Cube Terminology
• A data cube supports viewing/modelling of a variable (a set of
variables) of interest. Measures are used to report the values
of the particular variable with respect to a given set of
dimensions.
• A fact table stores measures as well as keys representing
relationships to various dimensions.
• Dimensions are perspectives with respect to which an
organization wants to keep record.
• A star schema defines a fact table and its associated
dimensions.

Han: Data Cubes 13


Difference between fact data and
dimension data
Sr. Fact Data Dimension Data
No.
1 Millions or Billions of Few millions rows
rows
2 Multiple foreign keys One primary key
3 Numeric Textual description
4 Don’t change Frequently modified.
Conceptual Modeling
of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation

Han: Data Cubes 15


Warehouse Database Schema
 ER design techniques not appropriate
 Design should reflect multidimensional view
 Star Schema
 Snowflake Schema
 Fact Constellation Schema
Example of a Star Schema
Order
Product
Order No ProductNO
Order Date ProdName
Fact Table ProdDescr
Customer
OrderNO Category
Customer No CategoryDescription
SalespersonID
Customer UnitPrice
Name CustomerNO
Customer ProdNo Date
Address
DateKey DateKey
City
CityName Date
Salesperson
Quantity City
SalespersonID
SalespersonName Total Price
CityName
City
State
Quota
Country
Star Schema
 A single fact table and a single table for each dimension
 Every fact points to one tuple in each of the dimensions and
has additional attributes
 Does not capture hierarchies directly
 Generated keys are used for performance and maintenance
reasons
 Fact constellation: Multiple Fact tables that share many
dimension tables
 Example: Projected expense and the actual expense may share
dimensional tables
Example of a Snowflake Schema
Order
Product
Order No Category
ProductNO
Order Date ProdName CategoryNam
Fact Table e
ProdDescr
Customer CategoryDes
Category cr
OrderNO
Customer No Category
SalespersonID
Customer UnitPrice
Name CustomerNO
Customer ProdNo Date
Address Month
DateKey
City DateKey Month
Year
Date
Salesperson CityName Year
Year
Month
SalespersonID Quantity City
SalespersonName State
Total Price CityName
City StateName
State
Quota Country
Country
Snowflake Schema

 Represent dimensional hierarchy directly by


normalizing the dimension tables
 Easy to maintain
 Saves storage, but is alleged that it reduces
effectiveness of browsing (Kimball)
 Galaxy schema: multiple fact tables with shared
dimension categories
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

Han: Data Cubes 21


Multidimensional Data
• Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
Han: Data Cubes 22
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

Han: Data Cubes 23


Browsing a Data Cube

• Visualization
• OLAP capabilities
• Interactive manipulation
Han: Data Cubes 24
Representation of Multi-
dimensional Data
• Example of two-dimensional query.
• What is the total revenue generated by property sales in
each city, in each quarter of 2004?’

• Choice of representation is based on types of


queries end-user may ask.

• Compare representation - three-field relational


table versus two-dimensional matrix.
25 © Pearson Education Limited 1995, 2005
Multi-dimensional Data as Three-field table versus
Two-dimensional Matrix

26 © Pearson Education Limited 1995, 2005


Representation of Multi-
dimensional Data
• Example of three-dimensional query.
– ‘What is the total revenue generated by property sales
for each type of property (Flat or House) in each city, in
each quarter of 2004?’

• Compare representation - four-field


relational table versus three-dimensional
cube.

27 © Pearson Education Limited 1995, 2005


Multi-dimensional Data as Four-
field Table versus Three-
dimensional Cube

28 © Pearson Education Limited 1995, 2005


Representation of Multi-
dimensional Data
• Cube represents data as cells in an array.

• Relational table only represents multi-


dimensional data in two dimensions.

29 © Pearson Education Limited 1995, 2005


Representation of Multi-
dimensional Data
• Use multi-dimensional structures to store
data and relationships between data.
• Multi-dimensional structures are best
visualized as cubes of data, and cubes within
cubes of data. Each side of a cube is a
dimension.
• A cube can be expanded to include other
dimensions.

30 © Pearson Education Limited 1995, 2005


Representation of Multi-
dimensional Data
• Pre-aggregation is valuable, as typical
dimensions are hierarchical in nature.
– (e.g. Time dimension hierarchy - years, quarters,
months, weeks, and days)

• Predefined hierarchy allows logical pre-


aggregation and, conversely, allows for a
logical ‘drill-down’.

31 © Pearson Education Limited 1995, 2005


Categories of OLAP Tools
• OLAP tools are categorized according to the
architecture used to store and process
multi-dimensional data.

• There are four main categories:


– Multi-dimensional OLAP (MOLAP)
– Relational OLAP (ROLAP)
– Hybrid OLAP (HOLAP)
– Desktop OLAP (DOLAP)
32
Multi-dimensional OLAP (MOLAP)
• Use specialized data structures and multi-
dimensional Database Management Systems
(MDDBMSs) to organize, navigate, and
analyze data.

• Data is typically aggregated and stored


according to predicted usage to enhance
query performance.

33
Multi-dimensional OLAP (MOLAP)
• Use array technology and efficient storage
techniques that minimize the disk space
requirements through sparse data
management.

• Provides excellent performance when data is


used as designed, and the focus is on data for
a specific decision-support application.

34
Multi-dimensional OLAP (MOLAP)
• Traditionally, require a tight coupling with
the application layer and presentation layer.

• Recent trends segregate the OLAP from the


data structures through the use of published
application programming interfaces (APIs).

35
Typical Architecture for MOLAP
Tools

36
MOLAP Tools - Development
Issues
• Underlying data structures are limited in
their ability to support multiple subject areas
and to provide access to detailed data.

• Navigation and analysis of data is limited


because the data is designed according to
previously determined requirements.

37
MOLAP Tools - Development
Issues
• MOLAP products require a different set of
skills and tools to build and maintain the
database, thus increasing the cost and
complexity of support.

38
Relational OLAP (ROLAP)
• Fastest-growing style of OLAP technology
due to requirements to analyze ever-
increasing amounts of data and the
realization that users cannot store all the
data they require in MOLAP databases.

39
Relational OLAP (ROLAP)
• Supports RDBMS products using a metadata
layer - avoids need to create a static multi-
dimensional data structure - facilitates the
creation of multiple multi-dimensional views
of the two-dimensional relation.

40
Relational OLAP (ROLAP)
• To improve performance, some products use
SQL engines to support the complexity of
multi-dimensional analysis, while others
recommend, or require, the use of highly
denormalized database designs such as the
star schema.

41
Typical Architecture for ROLAP
Tools

42
ROLAP Tools - Development Issues
• Performance problems associated with the
processing of complex queries that require
multiple passes through the relational data.

• Middleware to facilitate the development of


multi-dimensional applications. (Software
that converts the two-dimensional relation
into a multi-dimensional structure).

43
ROLAP Tools - Development Issues
• Development of an option to create
persistent, multi-dimensional structures with
facilities to assist in the administration of
these structures.

44
Hybrid OLAP (HOLAP)
• Provide limited analysis capability, either
directly against RDBMS products, or by using
an intermediate MOLAP server.

• Deliver selected data directly from the DBMS


or via a MOLAP server to the desktop (or local
server) in the form of a datacube, where it is
stored, analyzed, and maintained locally.

45
Hybrid OLAP (HOLAP)
• Promoted as being relatively simple to install
and administer with reduced cost and
maintenance.

46
Typical Architecture for HOLAP
Tools

47
HOLAP Tools - Development

Issues
Architecture results in significant data redundancy
and may cause problems for networks that
support many users.

• Ability of each user to build a custom datacube


may cause a lack of data consistency among users.

• Only a limited amount of data can be efficiently


maintained.
48
Desktop OLAP (DOLAP)
• Store the OLAP data in client-based files and
support multi-dimensional processing using a
client multi-dimensional engine.

• Requires that relatively small extracts of data


are held on client machines. They may be
distributed in advance, or created on
demand (possibly through the Web).

49
Desktop OLAP (DOLAP)
• As with multi-dimensional databases on the
server, OLAP data may be held on disk or in
RAM, however, some DOLAP products allow
only read access.

• Most vendors of DOLAP exploit the power of


desktop PC to perform some, if not most,
multi-dimensional calculations.

50 © Pearson Education Limited 1995, 2005


Desktop OLAP (DOLAP)
• The administration of a DOLAP database is
typically performed by a central server or
processing routine that prepares data cubes
or sets of data for each user.

• Once the basic processing is done, each user


can then access their portion of the data.

51 © Pearson Education Limited 1995, 2005


Typical Architecture for DOLAP
Tools

52
© Pearson Education Limited 1995, 2005
DOLAP Tools - Development
Issues
• Provision of appropriate security controls to
support all parts of the DOLAP environment.
Since the data is physically extracted from the
system, security is generally implemented by
limiting the information compiled into each
cube.
• Once each cube is uploaded to the user's
desktop, all additional meta data becomes the
property of the local user.
53 © Pearson Education Limited 1995, 2005
DOLAP Tools - Development
Issues
• Reduction in the effort involved in deploying
and maintaining the DOLAP tools. Some DOLAP
vendors now provide a range of alternative
ways of deploying OLAP data such as through
e-mail, the Web or using traditional
client/server architecture.

• Current trends are towards thin client


machines.
54 © Pearson Education Limited 1995, 2005
OLAP Extensions to SQL
• Advantages of SQL include that it is easy to
learn, non-procedural, free-format, DBMS-
independent, and that it is a recognized
international standard.
• However, major limitation of SQL is the inability
to answer routinely asked business queries such
as computing the percentage change in values
between this month and a year ago or to
compute moving averages, cumulative sums,
and other statistical functions.

55
OLAP Extensions to SQL
• Answer is ANSI adopted a set of OLAP
functions as an extension to SQL to enable
these calculations as well as many others that
used to be impossible or even impractical
within SQL.

• IBM and Oracle jointly proposed these


extensions early in 1999 and they now form
part of the current SQL standard, namely SQL:
2008.
56
Comparing the use of MOLAP,
HOLAP and ROLAP 
The type of storage medium impacts on cube processing
time, cube storage and cube browsing speed. Some of
the factors that affect MOLAP storage are:  
• Cube browsing is the fastest when using MOLAP. This
is so even in cases where no aggregations have been
done. The data is stored in a compressed
multidimensional format and can be accessed quickly
than in the relational database. Browsing is very slow
in ROLAP about the same in HOLAP. Processing time is
slower in ROLAP, especially at higher levels of
aggregation.
Cont…
• MOLAP storage takes up more space than
HOLAP as data is copied and at very low levels
of aggregation it takes up more room than
ROLAP.
• ROLAP takes almost no storage space as data
is not duplicated. However ROALP
aggregations take up more space than MOLAP
or HOLAP aggregations.  
Cont…
• All data is stored in the cube in MOLAP and
data can be viewed even when the original
data source is not available. In ROLAP data
cannot be viewed unless connected to the
data source.  
From Tables and Spreadsheets
to Data Cubes
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to each
of the related dimension tables

Han: Data Cubes 60


Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data,
or introducing new dimensions
• Slice and dice:
– project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
– drill across: involving (across) more than one fact table
– …
Han: Data Cubes 61
Summary
• Data warehouse
– A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making
process
• A multi-dimensional model of a data warehouse
– Star schema, snowflake schema, fact constellations
– A data cube allows to view measures with respect to a given set
of dimensions
• OLAP operations: drilling, rolling, slicing, dicing and
pivoting

Han: Data Cubes 62


Question Bank
Q1.Differentiate between OLTP and OLAP. [6M][W17], [6M][S18], [6M][S17]. [6M][S16]. [6M][W16].
[6M][S19].
Q2. Describe STAR and SNOWFLAKE scheme with example. [8M][W17], [8M][W16]. [8M][W16].
[8M][S19].
Q3.What is OLAP? Discuss basic operations of OLAP with example. [8M][W17]. [8M][S17]. [8M][S19].
Q4. Write short note on the following:
a.ROLAP, b.MOLAP, c.HOLAP, [6M][W17], [6M][S17]. [6M][S19].
Q5.Explain types of OLAP servers. [8M][S18]
Q6. Write short note on: [14M][S18],
a..Fact Table b.Dimension table, c.Fact constellation.
Q7.Describe the star schema, States its advantages and disadvantages. [7M][S16].
Q8.What are the different types of OLAP models. [6M][S16].
Q9.Define Data Cube and explain OLAP operations on data cubes. [7M][S16]. [6M][W16].
Q10.What is the need of multidimensional analysis. [4M][W16].
Q11.Write any six characteristics of OLAP. . [4M][W16].
Q12.Describe the snowflake schema with neat sketch.
Q.13 Differentiate between OLTP and DW.
Q. 14.Write short notes on:
a.Drill up,b.Drill down,c.Slice,d.Dice,e.Pivot.
Q15.Show all important aspect of star schema and snowflake schema.
Q16.Explain fact constellation in brief. What is the purpose of multidimensional data model?

You might also like