Professional Documents
Culture Documents
Datawarehouse Architecture
⚫ Data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a
⚫ database used for reporting
⚫ data analysis
Integrating data from one or more disparate sources creates a central repository of
data, a data warehouse (DW).
⚫ Data warehouses store current and historical data and are used for creating trending
reports for senior management reporting such as annual and quarterly
comparisons.
⚫ The data stored in the warehouse is uploaded from
the operational systems Three Tier Architecture
⚫ Reconcillation
⚫ Deriv
ation Tier 1:
Datawarehouse
1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
Server Tier2:
OLAP Engine
Tier 3: Client
i) ETL Process
⚫ The typical extract transform load (ETL)-based data warehouse uses staging,
data integration, and access layers to house its key functions. The staging layer
or staging database stores raw data extracted from each of the disparate source
data systems.
⚫ ETL Tools
⚫ Extraction of Data
⚫ Transportation of Data
⚫ Loading of Data
2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
⚫ The metadata act as a directory. This directory helps the decision support system
to locate the contents of data warehouse.
⚫ Additional metadata are created and captured for times tamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or
integration processes .
⚫ Descriptions of data about data Metadata is a road map to data warehouse. Metadata in
data warehouse define the warehouse objects. The metadata act as a directory. This
directory helps the decision support system to locate the contents of data warehouse.
iii) DW Server
The ODS is an integrated. That is, it is a group of subject-oriented record from a variety
of systems to provides an enterprise-wide view of the information.
The ODS is a current-valued. That is, an ODS is up-to-date and follow the current status
of the data. An ODS does not contain historical information. Since the OLTP system
data is changing
all the time, data from underlying sources refresh the ODS as generally and frequently
as possible
v) Datamart
3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
Load: sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh: propagate the updates from the data sources to the warehouse
Dimensional Modelling
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
⚫ iv)Dimension Hierarchy
Categorisation of Hierarchies
Simple hierarchy has tree structure generated for the instances with one-to-many parent-
child relationships. A simple hierarchy is symmetric if there exists single path from bottom level
members to the top and all levels are mandatory. It’s fully summarizable and the aggregation of
measures along the levels is straightforward. The simple hierarchy is asymmetric when not all
6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
levels are mandatory. There may be paths not covering all levels or there may be parent levels
without children
Example: A Concept Hierarchy: Dimension (location)
1. 5 Aggregate Function
• Data cubes facilitate the answering of data mining queries as they allow
the computation of aggregate data at multiple levels of granularity
• Aggregate functions return a single result row based on groups of rows, rather
than on single rows.
• Aggregate functions can appear in select lists
and in ORDER BY and HAVING clauses
.
● Distributive Function
Distributive: there is a function G() such that
7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
● Algebraic function
Algebraic: there is an M-tuple valued function G()
and a function H() such that
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n })
● Holistic function
There is no constant bound on the size of the storage needed to describe
a sub- aggregate.There is no constant M, such that an M-tuple characterizes
the computationF({Xi,j |i=1,...,I}).
Examples: Median(), MostFrequent() (also called the Mode()), and Rank()
1. 5. 6 SELF MAINTAINABLE
-An aggregate function is self- maintainable if a new value of the function can be computed
solely from the old values of the aggregate function and the changes to the source data.
COUNT and SUM are self-maintainable with respect to insertions and deletions. MAX
and MIN are self-maintainable with respect to insertions, but not self-maintainable with respect
to deletions. AVG is not self-maintainable in itself, but it can be computed by SUM and
COUNT that are themselves self-maintainable with respect to insertions and deletions.
For an aggregate function to be self-maintainable, a necessary condition is that the function must
be distributive. In fact, all distributive aggregate functions are self-maintainable with respect to
8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
insertions. However, not all distributive aggregate functions are self-maintainable with respect to
deletion. For example, MIN and MAX are not self-maintainable with respect to deletion.
1. 5. 7 TYPES OF ADDITIVITY
Additive measure is a very popular and widely used measure. Fully additive facts can be
meaningfully added across all dimensions. For example, Quantity Ordered in the Order item
entity can be added across dates, products or customers to get the total sales volume for a
particular day, product or customer. Semi-additive facts can be meaningfully added across some
dimensions but not others m(usually time). For example, Quantity On Hand from the Stock
Level entities can be added across products and warehouse (to get the total quantity on hand for a
particular products or warehouse) but not across time, as this would lead to double counting of
stock.
1. 6 Summarisability
SUMMARISABILITY
The concept of summarisability is the core of multidimensional modeling. There are three
necessary conditions for summarisability, namely disjointness, completeness and type
compatibility, which every dimension hierarchy must fulfil.
DISJOINTNESS
9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
The disjointness condition states that the members {ml1…..mlk} must form mutually
disjoint subsets over individuals/objects. In the above example, the BA students enrolled in
different departments do not form a disjoint set of data groups over departments. An inaccurate
summarization can result if summaries from different paths of the same hierarchy are merged.
Specifically, data cannot be merged among members that have overlapping data instances.
For example, it is incorrect to merge total sales from an area code with total sales in a city
that the area code served. This is because data instances would be rolled up into the both
categories, and measures would be added twice if the summaries are merged.
COMPLETENESS
Completeness in hierarchies means that all members of {ml1….mlk} belong to one higher-class object
mu, which consists of those members only. Given that disjointness is satisfied, it is necessary to test
whether the categorization of individuals/objects into groups
Type Compatibility
Type compatibility becomes relevant when summarization is undertaken
Summarization is performed quite differently with respect to temporal function.
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain insight
into information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.
OLAP Types
There are three main types of OLAP servers are as following
10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
i) Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in
the market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view. These are intermediate
servers which stand in between a relational back-end server and user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.
ii) MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored. Applications requiring iterative and comprehensive time-series analysis of
trends are well suited for MOLAP technology (e.g., financial analysis and budgeting). A
MOLAP cube is built for fast information retrieval, and is optimal for slicing and dicing
operations.
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in the relational
tables while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill
through from the cube down to the relational tables for delineated data. The Microsoft SQL
Server 2000 provides a hybrid OLAP server.
OLAP operations.
OLAP Operations
⚫ A number of operations may be applied to data cubes. The common ones are:
⚫ → roll-up
⚫ → drill-down
11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
⚫ → pivot or rotate
⚫ → slice & dice
i) ROLL UP
This is like zooming out on the data cube.
This is required when the user needs further abstraction or less detail.
Initially the concept hierarchy was "street < city < province < country".
On rolling up the data is aggregated by ascending the location hierarchy from the
level of city to level of country.
The data is grouped into cities rather than countries
12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
13 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UCAE51
PART B & C
1 Explain the Data Warehouse Architecture.
2 Discuss about Data Warehouse models
3 Discuss about Aggregate Functions
4 Explain the OLAP types with example.
5 Describe the OLAP operations with example.
14 CS Department MTNC