You are on page 1of 14

DATA MINING AND WAREHOUSING- 18UCAE51

UNIT I DATA WAREHOUSING


Introduction
⚫ The term Data Warehouse was coined by Bill Inmon in 1990
⚫ A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and
data analysis as a data warehouse.
Definition
⚫ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”.
⚫ The four keywords, subject-oriented, integrated, time-variant, and nonvolatile,
distinguish data warehouses from other data repository systems, such as relational
database systems, transaction processing systems, and file systems

Datawarehouse Architecture
⚫ Data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a
⚫ database used for reporting
⚫ data analysis
Integrating data from one or more disparate sources creates a central repository of
data, a data warehouse (DW).
⚫ Data warehouses store current and historical data and are used for creating trending
reports for senior management reporting such as annual and quarterly
comparisons.
⚫ The data stored in the warehouse is uploaded from
the operational systems Three Tier Architecture

⚫ Data ware house derived into two phases

⚫ Reconcillation

⚫ Deriv

ation Tier 1:
Datawarehouse

1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

Server Tier2:

OLAP Engine

Tier 3: Client

i) ETL Process

⚫ The typical extract transform load (ETL)-based data warehouse uses staging,
data integration, and access layers to house its key functions. The staging layer
or staging database stores raw data extracted from each of the disparate source
data systems.

⚫ ETL Tools

⚫ Extraction of Data

⚫ Transportation of Data

⚫ Loading of Data

ii) Meta Data

⚫ Descriptions of data about data

⚫ Metadata is a road map to data warehouse.

2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

⚫ Metadata in data warehouse define the warehouse objects.

⚫ The metadata act as a directory. This directory helps the decision support system
to locate the contents of data warehouse.
⚫ Additional metadata are created and captured for times tamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or
integration processes .
⚫ Descriptions of data about data Metadata is a road map to data warehouse. Metadata in
data warehouse define the warehouse objects. The metadata act as a directory. This
directory helps the decision support system to locate the contents of data warehouse.

iii) DW Server

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS


that maps functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose


server that directly implements multidimensional information and operations.

iv) Operational Data Store

The ODS is a subject-oriented. It is organized around the significant information subject


of an enterprise. In a university, the subjects may be students, lecturers and courses
while in the company the subjects might be users, salespersons and products.

The ODS is an integrated. That is, it is a group of subject-oriented record from a variety
of systems to provides an enterprise-wide view of the information.

The ODS is a current-valued. That is, an ODS is up-to-date and follow the current status
of the data. An ODS does not contain historical information. Since the OLTP system
data is changing
all the time, data from underlying sources refresh the ODS as generally and frequently
as possible
v) Datamart

A Data Mart is focused on a single functional area of an organization and contains


a subset of data stored in a Data Warehouse. A Data Mart is a condensed version
of Data Warehouse and is designed for use by a specific department, unit or set of

3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

users in an organization. E.g., Marketing, Sales, HR or finance.


A data mart contains a subset of corporate-wide data that is of value to a specific group of
users. The scope is confined to specific selected subjects. For example ,a marketing data mart
may confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is more likely to be
measured in weeks rather than months or years. However, it may involve complex integration in
the long run if its design and planning were not enterprise-wide.
vi) Datamining
Data Mining supports knowledge discovery by finding hidden patterns and associations
constructing analytical models, performing classifications and predictions, and finally presenting
the mining results using visualization tools

vii) On LineAnalytical Processing

On-Line Analytical Processing. OLAP is a classification of software technology which


authorizes analysts, managers, and executives to gain insight into information through fast,
consistent, interactive access in a wide variety of possible views of data that has been
transformed from raw information to reflect the real dimensionality of the enterprise as
understood by the clients. OLAP consists of three basic analytical operations:
 Consolidation (Roll-Up)
 Drill-Down
 Slicing And Dicing

viii) Virtual Datawarehouse


. A virtual warehouse is a set of views over operational databases. For efficient
query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build and it requires excess capacity on the
operational database servers. A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary views
may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers

Data Warehouse Back end Process.


Data extraction: get data from multiple, heterogeneous, and external sources
Data cleaning: detect errors in the data and rectify them when possible
Data transformation: convert data from legacy or host format to warehouse format

4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

Load: sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh: propagate the updates from the data sources to the warehouse

Dimensional Modelling
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube

• A data cube, such as sales, allows data to be modeled and viewed


in multiple dimensions

– Dimension tables, such as item (item_name, brand, type), or


time(day, week, month, quarter, year)

– Fact table contains measures (such as dollars_sold) and keys


to each of the related dimension tables
• In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
• Data Warehouse models
Multidimensional modeling is a technique for structuring data around the
business concepts. ER models describe “entities” and “relationships”.
Multidimensional modelsdescribe “measures” and “dimensions”. It describes the
following
⚫ i) Facts
Facts represents the subjects on which a data warehouse is built which can be
examinedand analysed for deriving business intelligence
⚫ ii)Dimensions
Dimensions are organized into hierarchies
E.g., Time dimension: days  weeks  quarters
E.g., Product dimension: product  product line  brand
⚫ iii)Datacubes

5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

When data is grouped or combined in multidimensional matrices called Data


Cubes. Thedata cube method has a few alternative names or a few variants, such
as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."

⚫ iv)Dimension Hierarchy

Categorisation of Hierarchies
Simple hierarchy has tree structure generated for the instances with one-to-many parent-
child relationships. A simple hierarchy is symmetric if there exists single path from bottom level
members to the top and all levels are mandatory. It’s fully summarizable and the aggregation of
measures along the levels is straightforward. The simple hierarchy is asymmetric when not all

6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

levels are mandatory. There may be paths not covering all levels or there may be parent levels
without children
Example: A Concept Hierarchy: Dimension (location)

1. 5 Aggregate Function
• Data cubes facilitate the answering of data mining queries as they allow
the computation of aggregate data at multiple levels of granularity
• Aggregate functions return a single result row based on groups of rows, rather
than on single rows.
• Aggregate functions can appear in select lists
and in ORDER BY and HAVING clauses
.
● Distributive Function
Distributive: there is a function G() such that

F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,…n})

● Examples: COUNT(), MIN(), MAX(), SUM()


● G=SUM() for COUNT()

7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

● Algebraic function
Algebraic: there is an M-tuple valued function G()
and a function H() such that
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n })

● Examples: AVG(), standard deviation, MaxN(), MinN()


● For AVG(), G() records sum and count, H() adds these
two components anddivides to produce the global
average

● Holistic function
There is no constant bound on the size of the storage needed to describe
a sub- aggregate.There is no constant M, such that an M-tuple characterizes
the computationF({Xi,j |i=1,...,I}).
Examples: Median(), MostFrequent() (also called the Mode()), and Rank()
1. 5. 6 SELF MAINTAINABLE
-An aggregate function is self- maintainable if a new value of the function can be computed
solely from the old values of the aggregate function and the changes to the source data.
COUNT and SUM are self-maintainable with respect to insertions and deletions. MAX
and MIN are self-maintainable with respect to insertions, but not self-maintainable with respect
to deletions. AVG is not self-maintainable in itself, but it can be computed by SUM and
COUNT that are themselves self-maintainable with respect to insertions and deletions.

For an aggregate function to be self-maintainable, a necessary condition is that the function must
be distributive. In fact, all distributive aggregate functions are self-maintainable with respect to

8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

insertions. However, not all distributive aggregate functions are self-maintainable with respect to
deletion. For example, MIN and MAX are not self-maintainable with respect to deletion.

1. 5. 7 TYPES OF ADDITIVITY

Additive measure is a very popular and widely used measure. Fully additive facts can be
meaningfully added across all dimensions. For example, Quantity Ordered in the Order item
entity can be added across dates, products or customers to get the total sales volume for a
particular day, product or customer. Semi-additive facts can be meaningfully added across some
dimensions but not others m(usually time). For example, Quantity On Hand from the Stock
Level entities can be added across products and warehouse (to get the total quantity on hand for a
particular products or warehouse) but not across time, as this would lead to double counting of
stock.

1. 6 Summarisability
SUMMARISABILITY

Summarisability in multidimensional model refers to the correct computation of aggregate


values at a coarser level from the aggregate values at a finer level of detail. Typically, data is
aggregated along multiple hierarchies, summarizing data along multiple dimensions. For
example, a summary may show the total sales in the year 2010 at all branch locations in
Jharkhand. If summarisability condition is violated, then the correct result will not be derived by
the data analysis tools, resulting in an erroneous decision.

A member mu of a dimension is summarisable from a set of members {ml1……..mlk} of the


same dimension if, for every distributive aggregate function, the facts associated with mu can be
computed from facts associated with {ml1…….mlk}. The summarisability condition for the
table holds for a specific aggregate function, but not for all distributive aggregate functions.

The concept of summarisability is the core of multidimensional modeling. There are three
necessary conditions for summarisability, namely disjointness, completeness and type
compatibility, which every dimension hierarchy must fulfil.

DISJOINTNESS

9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

The disjointness condition states that the members {ml1…..mlk} must form mutually
disjoint subsets over individuals/objects. In the above example, the BA students enrolled in
different departments do not form a disjoint set of data groups over departments. An inaccurate
summarization can result if summaries from different paths of the same hierarchy are merged.
Specifically, data cannot be merged among members that have overlapping data instances.

For example, it is incorrect to merge total sales from an area code with total sales in a city
that the area code served. This is because data instances would be rolled up into the both
categories, and measures would be added twice if the summaries are merged.

COMPLETENESS

Completeness in hierarchies means that all members of {ml1….mlk} belong to one higher-class object
mu, which consists of those members only. Given that disjointness is satisfied, it is necessary to test
whether the categorization of individuals/objects into groups

Type Compatibility
Type compatibility becomes relevant when summarization is undertaken
Summarization is performed quite differently with respect to temporal function.
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain insight
into information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.
OLAP Types
There are three main types of OLAP servers are as following

10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

i) Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in
the market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view. These are intermediate
servers which stand in between a relational back-end server and user frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

ii) MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored. Applications requiring iterative and comprehensive time-series analysis of
trends are well suited for MOLAP technology (e.g., financial analysis and budgeting). A
MOLAP cube is built for fast information retrieval, and is optimal for slicing and dicing
operations.

Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.

iii) Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in the relational
tables while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill
through from the cube down to the relational tables for delineated data. The Microsoft SQL
Server 2000 provides a hybrid OLAP server.

OLAP operations.
OLAP Operations
⚫ A number of operations may be applied to data cubes. The common ones are:
⚫ → roll-up
⚫ → drill-down

11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

⚫ → pivot or rotate
⚫ → slice & dice
i) ROLL UP
This is like zooming out on the data cube.
This is required when the user needs further abstraction or less detail.
Initially the concept hierarchy was "street < city < province < country".
On rolling up the data is aggregated by ascending the location hierarchy from the
level of city to level of country.
The data is grouped into cities rather than countries

12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

ii) DRILL DOWN


⚫ This is like zooming in on the data and is therefore the reverse of roll-up.
⚫ This is an appropriate operation when the user needs further details or when the
user wants to partition more finely or wants to focus on some particular values of
certain dimensions.
⚫ This adds more details to the data.
⚫ Initially the concept hierarchy was "day < month < quarter < year."
⚫ On drill-up the time dimension is descended from the level quarter to the level of
month. When drill-down operation is performed then one or more dimensions from the data
cube are added
iii) SLICE & DICE
⚫ These are operations for browsing the data in the cube. The terms refer to the ability
to look at information from different viewpoints.
⚫ A slice is a subset of cube corresponding to a single value for 1 or more members
of dimensions.
⚫ The slice operation is performed for the dimension time using the criterion time ="Q1".
⚫ The dice operation is similar to slice but dicing does not involve reducing number
of dimensions.
⚫ A dice is obtained by performing a selection on two or more dimensions.
⚫ The dice operation on cube based on the following selection criteria that involve
three dimensions.
(location = "Toronto" or "Vancouver") (time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
iv) Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which
rotates the data axes in view to provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.

13 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51

PART B & C
1 Explain the Data Warehouse Architecture.
2 Discuss about Data Warehouse models
3 Discuss about Aggregate Functions
4 Explain the OLAP types with example.
5 Describe the OLAP operations with example.

14 CS Department MTNC

You might also like