You are on page 1of 136

Data

Warehousing
&
Data Mining
Code: ITM 402

Dr. Dinesh
Gomber
What is Data
Warehouse?
Inmonss definition
A data warehouse is
-subject-oriented(high level),
-integrated,
-time-variant,
-nonvolatile
collection of data in support of managements
decision making process.
Subject-oriented
Data warehouse is organized around
subjects such as
sales,product,customer.
It focuses on modeling and analysis of
data for decision makers.
Excludes(delete) data that is notuseful
in decision support process.
Integration
Data Warehouse is constructed by
integrating multiple heterogeneous
sources.
Data Preprocessing are applied to
RDBMS
ensure consistency.

Data
Legacy Warehouse
System

Flat File Data Processing


Data Transformation
Integration
In terms of data.
encoding structures.

Measurement of
attributes.

physical attribute.
of data remarks

naming conventions.

Data type format


Time-variant
Provides information from historical
perspective e.g. past 5-10 years
Every key structure contains either
implicitly or explicitly an element of
time
Nonvolatile
Data once recorded cannot be updated.
Data warehouse requires two operations
in data accessing
Initial loading of data
Access of data

load

acce
ss
How it is differ from
Database
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites.

OLAP: On Line Analytical Processing


Describes processing at warehouse

CSE601 10
Warehouse is a Specialized
Standard DB (OLTP)DB Warehouse (OLAP)
Mostly updates Mostly reads
Many small Queries are long and
transactions complex
Mb - Gb of data Gb - Tb of data
Current snapshot History
Index/hash on p.k. Lots of scans
Raw data Summarized, reconciled data
Thousands of users Hundreds of users (e.g.,
(e.g., clerical users) decision-makers, analysts)

CSE601 11
Operational v/s Information
System
Features Operational Information
Characteristics Operational processing Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database Knowledge workers
professional
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction Complex query
Access Read/write Mostly read
Operational v/s Information
System
Features Operational Information
Focus Data in Information out
Number of records tens millions
accessed

Number of users thousands hundreds


DB size 100MB to GB 100 GB to TB
Priority High performance,high High flexibility,end-
availability user autonomy
Metric Transaction throughput Query througput
Data Warehouse
Definitions
A data warehouse is a type of computer
based information system developed to
provide an organization with business
intelligence to support decision making and
to monitor the operations in a company.

Integrates data from many different sources


and makes it available to end users in a
what they can understand and use in a
business context in a timely manner.
Primary difference between
a database and data
warehouse
Database stores information for a
single application, whereas a Data
warehouse stores information from
multiple databases, or multiple
applications, and external
information such as industry
information
The Database Approach
Database approach: data organised as
entities
Entity: object that has data
People
Events
Products
Character: smallest piece of data
Field: single piece of information about entity
Record: collection of fields
The Database Approach
(continued)
File: collection of related records
Database management system
(DBMS): program used to build
databases
Populates with data
Manipulates data
Query: message requesting access
to data
The Database Approach
(continued)
Database has security issues
Database administrator (DBA):
limits user access to database
Requires users to enter codes
DBMS bundled with fourth-generation
languages
The Database Approach
(continued)
Database Models
Database model: general logical
structure
How records stored in database
Records linked differently in different
models
Models constantly changing
The Relational Model

Relational Model: consists of


tables
Based on relational algebra
Tuple: record
Attribute: field
Relation: table
Key: identifier field
Used to retrieve records
The Relational Model
(continued)
Primary key: unique key
Uniquely identifies record
Required in table
Composite key: combination of fields
Serves as primary key
Foreign key: shared field
Links tables
Join table: composite of tables
The Relational Model
(continued)
The Relational Model
(continued)
Table relationships with other tables
One-to-many relationship: one
item in table linked to many items in
other table
Many-to-many relationship: many
items in table linked to many items
of other table
The Object-Oriented Model
Object-Oriented model: uses object-
oriented approach
Encapsulation: combined storage of data
and relevant procedures
Allows object to be planted in different
data sets
Inheritance: creates new object by
replicating characteristics of existing (parent)
object
Structured Query Language

Structured query language:


language of choice for DBMSs
Advantages
Standardised language
Used in many host languages
Portable
The Schema and Metadata

Schema: plan
Describes structure of database
Names and sizes of fields
Identifies primary keys
Data dictionary: repository of
information about data
The Schema and Metadata
(continued)
Metadata: data about data
Source of data
Tables related to data
Field information
Usage of data
Population rules
The Schema and Metadata
(continued)
Data Warehouse
Fundamentals
Extraction, transformation, and loading
(ETL) a process that extracts information
from internal and external databases,
transforms the information using a common
set of enterprise definitions, and loads the
information into a data warehouse

Data mart contains a subset of data


warehouse information. The ETL process also
gathers data from the data warehouse and
passes it to the data marts
Data Warehouse
Fundamentals
Components of a Data Warehouse

Metadata means data about data


Multidimensional
Analysis
and Data Mining
Databases contain information in a
series of two-dimensional tables

In a data warehouse and data mart,


information is multidimensional, it
contains layers of columns and rows
Scenario 1

ABC Pvt Ltd is a company with branches


at Mumbai, Delhi, Chennai and Banglore.
The Sales Manager wants quarterly sales
report. Each branch has a separate
operational system.
Scenario 1 : ABC Pvt Ltd.
Mumbai

Delhi
Sales per item type per branch Sales
for first quarter. Manager

Chennai

Banglore
ETL(Extract, Transform,
Load)
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the
data.
And Then Load into Data Warehouse.
Use query analysis tools to support
adhoc queries.
Solution 1:ABC Pvt Ltd.

Extract sales information from each


database.
Store the information in a common
repository at a single site.
Solution 1:ABC Pvt Ltd.
Mumbai

Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse

Chennai

Banglore
Data Warehousing
Architecture Monitoring &
Administratio OLAP Servers
n
Metadata
Repository

Reconciled data Analysis


External Extract
Sources
Transform
Serve
Load
Refresh Query/Reportin
Operational g
Dbs

Data Mining

DATA SOURCES TOOLS

DATA MARTS
Data Warehouse Architecture
Data Warehouse server
almost always a relational
DBMS,rarely flat files
OLAP servers
to support and operate on multi-
dimensional data structures
Clients
Query and reporting tools
Analysis tools
Data mining tools
OLTP
OLTP (On-line Transaction Processing)
Operational data
To control and run fundamental business
tasks
Transactions:INSERT, UPDATE, DELETE.
Detailed and current data,
Relatively standardized and simple queries
Highly normalized(3NF) with many tables
OLAP
OLAP (On-line Analytical Processing)
Low volume of transactions.
To help with planning, problem solving,
and decision support
Relatively standardized and simple
queries
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas.
Historical data, stored in multi-
dimensional schemas (usually star
schema).
OLTP (On-line Transaction
Processing) vs. OLAP (On-
line Analytical
Processing)
We can divide IT systems into
transactional (OLTP) and analytical
(OLAP). In general we can assume
that OLTP systems provide source
data to data warehouses, whereas
OLAP systems help to analyze it.
Data Mart
Introduction

OLAP (Online Analytical Processing)


designates a category of
applications and technologies that
allow the collection, storage,
manipulation and reproduction of
multidimensional data, with the
goal of analysis.

48
OLTP vs. OLAP
We can divide IT systems into transactional
(OLTP) and analytical (OLAP). In general we
can assume that OLTP systems provide
source data to data warehouses, whereas
OLAP systems help to analyze it.
OLTP IS Highly normalized with many
tables(RDBMS)
OLAP Typically de-normalized with fewer
tables use of( star and/or snowflake
schemas)
Difference between OLTP AND
OLAP
OLTP (On-line Transaction Processing) is characterized by
a large number of short on-line transactions (INSERT, UPDATE,
DELETE). The main emphasis for OLTP systems is put on very
fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by
number of transactions per second. In OLTP database there is
detailed and current data, and schema used to store
transactional databases is the entity model (usually
3NF).

- OLAP (On-line Analytical Processing) is characterized by


relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications
are widely used by Data Mining techniques. In OLAP database
there is aggregated, historical data, stored in multi-
dimensional schemas (usually star schema).
Three types in dataware
data structure
Rolap (ex. Star Schema)
Molap (cubes, slicing, dicing)
hybrid
Model of OLAP
OLAP Models
Relational (ROLAP): uses relational
star schema
Multidimensional (MOLAP): uses data
cubes
Rolap
Multi-Dimensional
OLAP Servers
Predefined hierarchy allows logical
pre-aggregation and, conversely,
allows for a logical drill-down.

Supports common analytical


operations
Consolidation.
Drill-down.
Slicing and dicing.

55
Multi-Dimensional OLAP
Servers
Roll UP - aggregation of data such as simple
roll-ups or complex expressions involving inter-
related data, for example Monthly data to
quarterly data.

Drill-Down - is reverse of consolidation and


involves displaying the detailed data that
comprises the consolidated data for example
quarterly data to monthly data.

Slicing and Dicing - (also called pivoting) refers


to the ability to look at the data from different
viewpoints.

56
Slicing
Multi-Dimensional OLAP
servers
Can store data in a compressed form by
dynamically selecting physical storage
organizations and compression techniques
that maximize space utilization.

Dense data (i.e., data that exists for high


percentage of cells) can be stored
separately from sparse data (i.e.,
significant percentage of cells are empty ).

59
ON-LINE ANALYTICAL
PROCESSING
Demand for OLAP
To develop DM, three approaches
In all approaches, Data Marts
rest on Dimensional Model
Data Marts are sufficient for
basic data analysis
Users need to go beyond such
basic analysis

61
Demand for OLAP

Need for Multidimensional


Analysis
Fast Access & Powerful
Calculations
Limitations of other analysis
methods like:
SQL
Spreadsheets
Report Writers

62
Demand for OLAP
Traditional tools of report writers,
query products, spreadsheets, &
language interfaces do not match
the user expectations as far as
performing multidimensional
analysis with complex calculations
is concerned.
Tools used with OLTP and basic DW
environments do not match up to
the task

63
OLAP is the Answer!
OLAP is a category of software technology
that enables analysts, managers, and
executives to gain insight into the data
through fast, consistent, interactive, access in
a wide variety of possible views of
information that has been transformed from
raw data to reflect the real dimensionality of
the enterprise as understood by the user.

64
Why is OLAP useful?
Facilitates multidimensional data
analysis by pre-computing aggregates
across many sets of dimensions
Provides for:
Greaterspeed and responsiveness
Improved user interactivity

65
Data Warehouses
A data warehouse is based on a
multidimensional data model which
views data in the form of a data cube
A data cube allows data to be modeled
and viewed in multiple dimensions

66
CUBE

Multi-dimensional cube:
Fact table view:
sale prodId storeId date amt
p1 c1 1 12 c1 c2 c3
day 2
p2 c1 1 11 p1 44 4
p1 c3 1 50 p2 c1 c2 c3
p2 c2 1 8
day 1
p1 12 50
p1 c1 2 44 p2 11 8
p1 c2 2 4

dimensions = 3

67
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4

68
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

69
Cube Aggregation

Example: computing sums


c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
70
Aggregation Using
Hierarchies

c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8

country

region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

71
Pivoting
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

c1 c2 c3
p1 56 4 50
p2 11 8

72
OLAP Operations
Roll-Up
Drill-Down
Slice & Dice
Pivot

73
OLAP Operations

74
Slicing

75
Dicing (Sub-cube)

76
Roll-Up

77
Drill-Down

78
Other OLAP
Operations

o Moving Averages
o Growth Rates
o Depreciation
o Currency Conversion
o Statistical Functions
o Top N or Bottom N queries

79
Conceptual vs. Actual
The cube is a logical way of
visualizing the data in an OLAP setting
Not how the data is actually
represented on disk
Two ways of storing data:
ROLAP: Relational OLAP
MOLAP: Multidimensional OLAP

80
OLAP & CUBE
Construction of the data cube
is key to the operation of OLAP
The computation process
creates a set of aggregates on
the various dimensions of the
data
The CUBE operator

81
Approaches to OLAP
Servers

It is all about which DBMS you


choose to store your data
warehouse data
ROLAP
MOLAP
BOTH - HOLAP

82
Approaches to OLAP
Servers
Three possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to
store and manage warehouse data
OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
(3) Hybrid OLAP (HOLAP)
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools

83
ROLAP
Special schema design: star, snowflake

Proven technology (relational model,


DBMS), tend to outperform specialized
MDDB especially on large data sets

Products
IBM DB2, Oracle, Sybase IQ,
RedBrick, Informix

84
ROLAP
Defines complex, multi-dimensional data with
simple model
Reduces the number of joins a query has to
process
Allows the data warehouse to evolve with
relatively low maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already
selected technologies.
BUT!!!
SQL for multi-dimensional manipulation of
calculations.

85
MOLAP

MDDB: a special-purpose data


model
Facts stored in multi-dimensional
arrays
Dimensions used to index array
Sometimes on top of relational DB
Products
Pilot, Arbor Essbase, Gentia

86
OLAP Needs
User Needs
Multidimensional view
Excellent Performance
Analytical Flexibility
Real-Time Data Access
High Data Capacity

87
OLAP Needs: User Needs
Excellent Performance
For example, suppose you have a Sales indicator
with six dimensionsRepresentatives, Products,
Customers, Regions, Months, and Years.
MOLAP tools will store a given aggregate, such as
the November 1997 government sales of product
A504 by representative 1040 in New York, in 1
cell of the MDDB.
In contrast, ROLAP tools consume 600% more
space, because they require a record of seven
valuessix foreign keys and the actual
aggregatein a relational summary table.

88
OLAP Needs: User Needs
Excellent Performance
RDBMSs must use several summary tables to store the aggregates
that a MOLAP could store in just one cube. For example, consider a Sales
indicator with three dimensions: Months, Regions, and Products. The
indicator cube will contain seven sets of aggregates:
Sales by month
Sales by product
Sales by region
Sales by month and product
Sales by month and region
Sales by product and region
Sales by product, month, and region
To store these aggregates in an RDBMS, youd have to create seven
summary tables, one for each aggregate set.
HOW MANY SUMMARY TABLES FOR 6 DIMENSIONS?
(Separate fact table and shrunken dimension table approach for storing
aggregates)
89
OLAP Needs: User Needs
Analytical Flexibility

Both ROLAP & MOLAP tools offer comparative


performance for
Comparative Analysis
Roll-up and Drill-down
Slicing & Dicing

90
OLAP Needs: User Needs
Real-Time Data Access
MOLAP tools load data into the multidimensional cubes.
Consequently, the data being accessed is only as recent
as the last load.
Some applications require real-time data access
Process of continually refreshing the data attaches higher
costs to operating a MOLAP system
Some MOLAP tools offer reach-through functionality to
access volatile data stored outside the MDDB
Unfortunately, users must be aware of the underlying
database structure
Relational data access is too complex for the typical user

91
OLAP Needs: User Needs
Real-Time Data Access
ROLAP tools maintain a constant link to the
operational RDBMS, which provides users with
up-to-the-minute, accurate data
(Real-Time Data Warehousing)
Industries & organizations with highly volatile
data particularly benefit from this access to
live, operational data.

92
OLAP Needs: User Needs
High Capacity Data
MOLAP products are limited by the size of the
cube defined by the multidimensional view.
When dimension elements are predefined, the
scope of available data is limited at the onset.
ROLAP tools circumvent this barrier. Dynamic
dimensions are not stored in the predefined
multidimensional model, but fetched at run
time from the RDBMS.

93
OLAP Needs: Needs
Easy Development
MOLAP development is straightforward, it requires no
fine tuning and creates its own aggregates.
ROLAP tools, on the other hand, require a specific
schema for the relational database.
Skilled DBAs must provide the appropriate schema
(star or snowflake schema), tune the database, and
create the appropriate summary tables.
However, many ROLAP tools are metadata-driven,
which means the multidimensional view is generated
and maintained more easily.

94
Hybrid OLAP - HOLAP
o Best of both worlds

o Storing detailed data in RDBMS

o Storing aggregated data in MDBMS

o User access via MOLAP tools

95
HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensional
SQL-Read access
Multidimensional
User
data Meta data
Multi- Viewer
dimensional
Derived data
data
SQL-Reach
Through
Relational
Viewer
SQL-Read

96
ROLAP, MOLAP, or HOLAP
IF
A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. Youre developing a general-purpose application for inventory movement or assets management
THEN
Consider an MDD /MOLAP solution for your data mart

IF
A. Your data is over 100 GB
B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN
Consider an RDBMS/ROLAP solution for your data mart.

IF
A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN
Consider an HOLAP for your data mart

97
Conclusions
ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDDB -> Cube structures
ROLAP or MOLAP: Data models used play major role in
performance differences
MOLAP: for summarized and relatively lesser volumes
of data (100GB)
ROLAP: for detailed and larger volumes of data
Both storage methods have strengths and weaknesses
The choice is requirement specific, though currently
data warehouses are predominantly built using
RDBMSs/ROLAP.
HOLAP is emerging as the OLPA server of choice

98
Different forms of
OLAP
Three ways of storing data:

Multidimensional OLAP (MOLAP)


Best Query Performance
Relational OLAP (ROLAP)
Ideal for large databases
Hybrid OLAP (HOLAP)
Best of both worlds!

99
Relational Database Model

Attribute 1 Attribute 2 Attribute 3 Attribute 4


Name Age Gender Emp No.
Row 1 Anderson 31 F 1001
Row 2 Green 42 M 1007
Row 3 Lee 22 M 1010
Row 4 Ramos 32 F 1020

The table above illustrates the employee relation.


Multidimensional Database
Customer Store
Model
Store

Time Time

SALES FINANCE

Product GL_Line

The data is found at the intersection


of dimensions.
Two dimensions
Three dimensions
Specialised Multidimensional
Benefits: tool
Quick access to very large volumes of data
Extensive and comprehensive libraries of
complex functions
analysis
Strong modeling and forecasting capabilities
Can access multidimensional and relational
database structures
Caters for calculated fields
Disadvantages:
Difficulty of changing model
Lack of support for very large volumes of
data
May require significant processing power
MOLAP Server
The application layer
stores data in a
multidimensional structure DSS client
The presentation layer
provides the
MOLAP
multidimensional view Engine
Efficient storage and processing Application
Complexity hidden from the layer

user
Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server

The warehouse stores DSS client


atomic data.
The application layer
ROLAP
generates SQL for the engine
three- dimensional Application
view. Multiple
layer

The presentation layer SQL


provides the
multidimensional view.
Warehouse
server
Choosing a Reporting
Architecture
Business needs Good

Potential for growth MOLAP

interface Query
Performance

enterprise architecture ROLAP


OK
Network architecture
Simple Complex
Speed of access Analysis

Openness
Modeling
Warehouses differ from operational
structures:
Analytical requirements
Subject orientation
Data must map to subject oriented
information:
Identify business subjects
Define relationships between subjects
Name the attributes of each subject
Modeling is iterative
Modeling tools are available
Creating the Dimensional
Identify fact Model
tables
Translate business measures into
fact tables
Analyze source system information
for additional measures
Identify base and derived measures
Document additivity of measures
Identify dimension tables
Link fact tables to the dimension
tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that
represents the attributes of the
business
Contain relatively static data
Are joined to a fact table through a
Product Channel

foreign key reference Facts


(units,
price)

Customer Time
Fact Tables
Fact tables have the following
characteristics:
Contain numeric measures (metrics) of
the business
May contain summarized (aggregated)
data
May contain date-stamped data
Have key value that is typically a
concatenated key composed of the
primary keys of the dimensions
Joined to dimension tables through
foreign keys that reference primary keys
in the dimension tables
Dimensional Model (Star
Schema)
Fact table

Product Channel

Facts
(units,
price)

Customer Time

Dimension tables
Star Schema Model
Product Table Store Table
Product_id Store_id
Product_desc District_id
...

Sales Fact Table


Central fact table Product_id
Store_id
Radiating dimensions Item_id
Day_id
Denormalized model Sales_dollars
Sales_units
...
Time Table Item Table
Day_id Item_id
Month_id Item_desc
Period_id ...
Year_id
Star Schema Model

Easy for users to understand


Fast response to queries
Simple metadata
Supported by many front end
tools
Less robust to change
Slower to build
Does not support history
Snowflake Schema Model
Product Table Store Table
District Table
Product_id Store_id
District_id
Product_desc Store_desc
District_desc
District_id

Sales Fact Table


Item_id
Store_id
Sales_dollars
Sales_units

Time Table Item Table Dept Table Mgr Table


Week_id Item_id Dept_id Dept_id
Period_id Item_desc Dept_desc Mgr_id
Year_id Dept_id Mgr_id Mgr_name
Snowflake Schema Model

Direct use by some tools


More flexible to change
Provides for speedier data
loading
May become large and
unmanageable
Degrades query performance
More complex metadata
Using Summary Data
Phase 3: Modeling summaries

Provides fast access to


precomputed data
Reduces use of I/O, CPU, and
memory
Is distilled from source systems and
precalculated summaries
Usually exists in summary fact
tables
Architecture of Data
WareHouse
Architecture of a Data
Warehouse with a Staging Area
Architecture of a Data
Warehouse with a Staging Area
and Data Marts
Incorrect Data in the Data
warehouse.
The architect needs to know what is to do
about incorrect data in the data warehouse.
The first assumption is that incorrect data
arrives in the data warehouse on an
exception basis.
If the data is being incorrectly entered in
the data warehouse on a wholesale basis,
then
It is the duty of the architect to find the
offending and make adjustment.
How to correct
To correct the offending an architect can do
three things.
Example: suppose on july 1 Rs 500 is made in
to operational system on july 2 a snapshot
taken in data warehouse and on july 15 it
discovered that it was a entry of 250 rather
than 500 on july 1.
Then
choice 1. go back to july 2 and update 250
inspite of 500. but it can create problem if any
report has been taken between july 2 to july 15.
How to correct
choice 2.
Enter offsetting entry i.e make two
entry first debit 500 then credit 250.
some time it also can create problem.
Choice 3.
Reset the account to the proper value.
but it will not correct the error.
So depending on the situation you can
make any decision.
Structuring Data in Data
Warehouse
The simplest and most common data structure found
in the data warehouse is :-

1:- The simplest cumulative structure i.e daily


transactions being reported from the operational
environment.
Example: jan 1, jan2 jan3 data

2:- Rolling summary data


After cumulative that they are summarized into
Data Warehouse records,
Example: Rolling summary data
Week1 data, week2 data, month1 data month2 data.
Reporting and the architected
environment
Once the data warehouse has been constructed all reporting and
informational processing will be done from there.

1. Operational reporting for clerical level :-

It focus on the line item(detailed information).


Example: A cashier has to check whole day transaction in the
evening for balance check.

2. Data ware house reporting for management level:-

It focus on summary information.


Example: A bank vice president has to take decision how many
ATM machine has to place in that particular city so he does not
need one day transactions but he needs one month or one year
summary of data to take decision.
Purging Warehouse Data
Data purging is nothing but deleting
your data from DW.
Data does not just pour into a Data
warehouse. But It has its own life
cycle within the data warehouse.
It does not means it is fully removed
it means it rolled up to high level of
summary. Where details is lost.
Granularity
Refers to the level of details of the Data
Dual level of Granularity:-
1. Low Level of Detail(More details)

2. High Level of detail( less details i.e


Summary)

Mostly Data in Data warehouse is in High level


But it has Low Level of Detail also for atomic
query.
Data Granularity in
Database
Data Granularity
A significant difference between an
operational system and a data
warehouse is the granularity of the
data stored.
An operational system typically stores
data at the lowest level of granularity:
the maximum level of detail.
Granularity in Data
Warehouse
However, because the data warehouse contains
data representing a long period in time, simply
storing all detail data from an operational
system can result in an overworked system that
takes too long to query.
A data warehouse typically stores data in
different levels of granularity or summarization,
depending on the data requirements of the
business. If an enterprise needs data to assist
strategic planning, then only highly
summarized data is required.
Granularity in Data
Warehouse
The lower the level of granularity of
data required by the enterprise, the
higher the number of resources
(specifically data storage) required to
build the data warehouse. The
different levels of summarization in
order of increasing granularity are:
Current operational data
Historical operational data
Granularity in Data
Warehouse
Aggregated data
Metadata

Current and historical operational data


are taken, unmodified, directly from
operational systems. Historical data is
operational level data no longer queried
on a regular basis, and is often archived
onto secondary storage
So to know what will happen in future you need a technique called Data Mining
Book you can refer
Data Mining
Concepts and Techniques
Auther:- Jaiwei Han and Micheline
Kamber

Publisher:
Morgan Kaufmann Publishers

You might also like