DM Chapter 2

Introduction to Data Mining &
machine learning
(INSY3051)
Chapter 2
Data warehousing and OLAP Technology for data
mining
Compiled by: Abinet T. 04/11/2024
1
Introduction to Data Mining
Contents
OLAP technology, attribute-oriented induction
What is a data warehouse?
A multidimensional data models
Data cube computation
Data warehouse architecture
Data warehouse implementation
2
What Is a DataWarehouse
 Defined in many different ways, but none are
rigorous definition.
generalize and consolidate data in multidimensional space.
provides architectures and tools for business executives to
systematically organize, understand, and use their data to
make strategic decisions.
 a database that is maintained separately from an
organization’s operational databases.
3
Cont,,,
A short and more comprehensive definition is
given by Inmon as:
 “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision making process”.
 Subject-oriented: A data warehouse is organized
around major subjects, such as customer, supplier,
product, and sales.
focuses on the modeling and analysis of data for
decision makers
4
Cont,,,
 Integrated: is usually constructed by integrating
multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records.
 Time-variant: Data are stored to provide information
from a historical perspective(e.g., the past 5–10 years).
 Nonvolatile: A data warehouse is always a physically
separate store of data transformed from the application
data found in the operational environment.
5
Differences between Operational Database Systems and Data
Warehouses
The major task of on-line operational database

systems is to perform on-line transaction and query
processing. These systems are called on-line
transaction processing(OLTP) systems.
 They cover most of the day-to-day operations of
an organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and
accounting.
6
Cont,,,
on the other hand on-line analytical processing
(OLAP) systems serve users or knowledge workers in
the role of data analysis and decision making.
 Such systems can organize and present data in
various formats in order to accommodate the
diverse needs of the different users.
7
Comparison between OLTP and OLAP systems.
Feature OLTP OLAP

Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive, analyst)
DB design ER based, application-oriented star/snowflake, subject-
oriented
Data current; guaranteed up-to- historical; accuracy
date maintained
over time
View detailed, flat relational summarized,

multidimensional
Access read/write mostly read
Introduction to Data Mining 8

A Multidimensional Data Model
Data warehouses and OLAP tools are based on a
multidimensional data model. This model views
data in the form of a data cube.
A data cube allows data to be modeled and viewed
in multiple dimensions. It is defined by dimensions
and facts.
dimensions are the perspectives or entities with respect to
which an organization wants to keep records(store’s sales with
respect to the dimensions time, item, branch, and location)
Facts are numerical measures.(sales amount in dollars, number
of units sold, amount budgeted)
9
Cont,,
• A 2-D view of sales data according to the dimensions time and item, where the
sales are from branches located in the city of Vancouver. The measure displayed is
dollars sold (in thousands).
10
Cont,,
Although we usually think of cubes as 3-D
geometric structures, in data warehousing the data
cube is n-dimensional.
11
Cont,,,
A 3-D view of sales data for AllElectronics, according to the dimensions time, item, and
location. The measure displayed is dollars sold (in thousands).
12
Cont,,
A 3-D data cube representation of the data in above Table according to the dimensions
time, item, and location. The measure displayed is dollars sold (in thousands).
13
Thus--
▶ In data warehousing literature, an n dimensional (n-D) cube
is called a base cuboid.
▶ Base cuboid shows some information about every attribute at
different granularity
▶ The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
▶ This shows the most summarized information which is free from
any attribute
▶ The lattice of cuboids forms a data cube.
14
Cube: A Lattice of Cuboids

Conceptual Modeling of Data Warehouses
▶ Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set
of dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation
16
Example of Star Schema
17
Example of Snowflake Schema
18
Example of Fact Constellation
19
Typical OLAP Operations
▶In multidimensional model, data are organized into
multiple dimensions,and each dimension contains multiple
level of abstraction defined by concept hierarchies.
▶ This organization provides users with flexibility to view
and conduct BA/DM investigations from different
perspectives.
▶ Different OLAP data cube operations exists to materialize
these views
(basics):
– Roll up (drill-up) and Drill down (roll down)
– Slice and dice
OLAP Operations: Roll-up and Drill-down
▶ Attribute values often have a hierarchical structure.

▶ Each date is associated with a year, month, and week.
▶ A location is associated with a continent, country, state
(province, etc.), and city.
▶ Products can be divided into various categories, such as
clothing, electronics, and furniture.
▶ Note that these categories often nest and form a tree
or lattice
▶ A year contains months which contains day
▶ A country contains a state which contains a city
21
Cont,,
▶ This hierarchical structure gives rise to the roll-up and
drill-down operations.
▶ For sales data, we can aggregate (roll up) the sales
across all the dates in a month.
▶ Conversely, given a view of the data where the time
dimension is broken into months, we could split the
monthly sales totals (drill down) into daily sales totals.
▶ Likewise, we can drill down or roll up on the location
or
product ID attributes.
22
Cont,,
23
Cont,,
24
OLAP Operations: Slicing and Dicing
▶ Slicing is selecting a group of cells from the entire

multidimensional array by specifying a specific value
for one or more dimensions.
▶ Dicing involves selecting a subset of cells by
specifying a range of attribute values.
▶ This is equivalent to defining a sub array from the
complete array.
▶ In practice, both operations can also be
accompanied by aggregation over some dimensions.
25
Cont,,
26
Cont,,
27
Design of a Data Warehouse: A Business Analysis
Framework
▶The basic steps involved in the design process of data warehouse
mainly involves business analysis
▶ It involves answering a question “What can a business analysts
gain from having a data warehouse?”
▶ May provide a competitive advantage by presenting relevant
information
▶ May enhance business productivity as it enable to quickly and
efficiently gather information that accurately describe the organization
▶ May facilitate customer relationship management by providing
consistent view of customers and items across all lines of business, all
departments and all markets
28
Cont,,
▶ Four views that should be considered regarding the
design of a data warehouse with in a business analysis
framework.
– Top-down view: allows selection of the relevant
information (subjects)necessary for the data warehouse
– Data source view: exposes the information being
captured, stored, and managed by operational systems
– Data warehouse view: Seeing from the perspective of fact
tables and dimension tables
– Business query view: sees the perspectives of data in the
warehouse from the view of end-user
29
Data Warehouse Design Process
▶ Can be built using top-down approaches, bottom-up
approaches or a combination of both
▶ Top-down: Starts with overall design and planning
▶ Require huge investment and commitment, Appropriate when
the technology is mature and well known
▶ Bottom-up: Starts with experiments and prototypes
▶ Appropriate in the early stage of business modeling and technology
development, Enables the business to move forward at considerably
less expense and to evaluate the benefits of technology before
making significant commitment
▶ From software engineering point of view
▶ Waterfall
▶ Spiral/Agile ..
30
Architectural representation
▶ Data warehouse often adopt three-tier architecture
▶ Warehouse database server (The bottom tier)
▶ Almost always a relational DBMS, rarely flat files
▶ Back end tools and utilities are used to feed data into the middle tier
▶ The tools and utilities perform data extraction, cleaning and transformation as
well as load and refresh functions to update the warehouse
▶ OLAP servers (Middle tier)
▶ Implemented either as Relational OLAP (ROLAP) or Multidimensional OLAP
(MOLAP)
▶ ROLAP: extended relational DBMS that maps operations on multidimensional
data to standard relational operators
▶ Multidimensional OLAP (MOLAP): special-purpose server that directly
implements multidimensional data and operations
▶ Clients(the top tier)
▶ Query and reporting tools, Analysis tools, Data mining tools
31
The Complete Data Warehouse System
32
Three Data Warehouse Models-implementation
perspective
▶ From the implementation point of view, there are three DW models
▶ Enterprise warehouse: collects all information about subjects that span
the entire organization (customers, products, sales, assets, personnel)
▶ Requires extensive business modeling (may take years to design and
build)
▶ Data Mart: a subset of corporate-wide data that is of value to a specific
groups of users.
▶ Its scope is confined to specific, selected groups . For example, a
marketing data mart my confine its subject to customer, product and
sales
▶ Virtual warehouse : A set of views over operational databases
▶ Only some of the possible summary views may be materialized
▶ Easy to build but requires excess capacity on operational database
servers 33
▶ Describe three possible conceptual data model for data warehouse?
▶ Explain slicing and role up as OLAP operations
▶ Enumerate at least 5 differences b/n OLAP and OLTP?
▶ Describe how a dimensional model (DM) differs from an
Entity–Relationship (ER) model.
▶ Present a diagrammatic representation of a typical star schema.
34
Thank you
35

DM Chapter 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Chapter 2

Uploaded by

Copyright:

Available Formats

Introduction to Data Mining &

Compiled by: Abinet T. 04/11/2024

The major task of on-line operational database

Feature OLTP OLAP

View detailed, flat relational summarized,

Introduction to Data Mining 8

Introduction to Data Mining 15

▶ Attribute values often have a hierarchical structure.

▶ Slicing is selecting a group of cells from the entire

You might also like