You are on page 1of 155

• ddata Warehousing and Mining

By
Mrs.Chhaya s PaWar
Be-a / te-B
2018-19

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHAPTER 1
INTRODUCTION TO DATA WAREHOUSING

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


INCREASING DEMAND FOR STRATEGIC
INFORMATION
• Strategic information is not required for day to day business
• It is critical for survival of the corporation in highly competitive world
• Critical business decisions depends upon availability of strategic information
• Its needed for making strategies, set goals, set objective
• For Exa; Retain current customers
• Increase sales in north west region, better product launch next year
• Exa, Jio, Flat 50% Sale in Mall
• can have an advantage in product development, marketing, pricing strategy,
production time, historical analysis, forecasting and customer satisfaction. However,
data warehouses also can be very expensive to design and implement
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHARACTERISTICS OF STRATEGIC
INFORMATION
• Integrated: overall view
• Data Integrity: Accurate data
• Accessible: easily accessible by users
• Timely: info must be available in stipulated time

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


THE INFORMATION CRISIS
• Organizations having huge amount of data almost doubling every year
• Data generated from their day to day operations
• Existing information system are unable to handle
• BUT
• Information crisis are becoz of unavailability of data that useful for making strategic
decisions
• Reasons: data in different formats, different platforms, different data structures,
• For making decisions data must be available in a format that enable executives,
managers to analyse trends

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


NEED OF DATA WAREHOUSING
It all starts with a decision that needs to be made, or a question that needs answering.
For example:
• A marketing manager might want to understand where to invest their online
advertising dollars
• A call centre manager might want to know the optimum number of staff to hire for
their call centre, or
• A sales manager might want to identify the customers that deliver the most profit – so
they can find more of these customers.
• We want to make an informed, and objective, data driven decision, and we need
some data to answer these questions.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
Challenges
• Not structured for reporting
• Hard to access
• not integrated with all the other data to give a complete picture
This is where data warehousing comes in. Data warehousing allows you to:
• Extract data from your organizational systems
• Load it into a centralized location
• Transform and integrate the data into a format optimized for analytics
• The data warehouse can be used as a source for your data visualisation tool to
provide reports & dashboards, for advanced analytics, and for a variety of other
purposes.
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
• The established enterprise data warehousing tools vendors, Teradata,
IBM/Netezza, Oracle/Exadata, Microsoft, and SAP/Sybase have released
new appliance product families.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


MAJOR OPERATIONS DONE BY DW TOOL

• Batch and near real-time loads to integrate data from multiple resources
(internal and external)
• Basic reporting with no drill-down/ drill-across
• Online analytical processing (OLAP)
• Predictive analytics
• Operational business intelligence

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


EXAMPLE: CLINICAL DATA REPOSITORY VS
CLINICAL DATAWAREHOUSE
• Design of repository is not adequate solution as its designed for patients care
and not for analysis
• Not able to integrate with non clinical data store
• Cant record patient satisfaction score
• Reporting tool is not standardized
• Tools are not standardized
• Data is not always secure
Hence inadequate for quality and cost improvement purpose
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
INABILITY OF PAST DECISION SUPPORT SYSTEM
• Cycle of request and report
user needs info---request report from IT—IT creates adhoc queries—IT sends
requested reports—user hopes to find the right answer
• Too many adhoc requests with variety of reports required
• Applications, format, platform all are different
• Required reports keeps changing exa. Formats
• Dependence on IT
• IT overloaded with multiple cycles
• Hence IT not able to give flexible environment for conduction of analysis

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


PRESENCE OF BETTER TECHNOLOGY

• Power of microprocessors doubling every two years


• Processing speed
• reducing prices of storage
• Increase in network bandwidth
• Heterogeneous hardware and software
• Legacy systems with new applications

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


OPERATIONAL VS DECISION SUPPORT SYSTEM
Attributes Operational Systems Decision Support System
Data Content Current Archived,summarised,derived
Data Structure Optimised for transactions Optimised for complex queries
Access Frequency high Medium to low
Aceess type Read,update,delete Read
Usage Predictive,repeatative Adhoc , random
Response time Sub seconds Several seconds to minutes
User number Large number Relatively small number
Characteristics Operational processing Informational processing
Orientation transaction Analysis
Users Clerk, DBA,db professional Executives, managers
Function Day to day operations Decision support,long term requi.
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
Attributes Operational Systems Decision support systems
Database Design ER based , application oriented Star /snowflake,subject oriented
Summarization Highly detailed Summarised, consolidated
View Detailed Summarised
Unit of Work Short , simple transaction Complex queries
Records Accessed tens Millions
Database size 100 MB to GB 100GB to TB
Priority High performance,high High flexibility, end user
availability autonomy
Indexes few Many
Joins many Some
Duplicated data Normalised DBMS Denormalised DBMS
Derived data and aggregations Rare common

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


WHAT DATA WAREHOUSE CAN DO..

• Immediate information delivery


• Integration of data a from within and outside organization
• Provides insight of the future
• Enables the users to look at the same data in different ways
• Freedom from dependency on IT

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


WHAT DATA WAREHOUSE CANNOT DO…
• Cant create data on its own
• If there is dirty data, then DW will not be able to correct results until the data
is first cleaned

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


APPLICATAPPLICATIONS OF DATA WAREHOUSE
• Retail….customer loyalty,targeted marketing
• Financial,banking…..Risk management,fraud detection
• Airlines………..route profitability, promotional schemes
• Manufacturing…………..cost reduction, resource management
• Government…….development, manpower planning, cost control
• Insurance companies, healthcare, travel,inventory,telecommunications,…….
• Health Services….population at risk
• Insurance Services
• Data Service providers….value added services
• Utilities….power usage analysis
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
TOP 5 DATA WAREHOUSES IN THE MARKET TODAY

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BENEFITS OF DATA WAREHOUSING

Tangible Benefits
• Better decisions in terms of cost and quality
• Enhanced asset liability management
• Cost of product introduction lowers with targeted marketing
With 200 million dollars annual sales even 1% improvement in sales wll
bring 2 million dollar additional revenue

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BENEFITS OF DATA WAREHOUSING

Intangible Benefits
• Improved productivity by keeping data at one place
• Enhanced customer relations by knowing individual customer
• CRM improved with customization
• Enable the reengineering of business processes

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


SO KEY AREAS ARE..

• Customer experience
• Risk mitigation
• Finance transformation
• Product innovation
• Asset optimization
• Operational excellence
………………words by Teradata

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


COST INCURRED IN DEPLOYING DATA
WAREHOUSE
Recurring cost One time cost -Memory cost
h/w maintenance Hard disk -Usefulness
s/w maintenance CPU
-Data
Management
Middleware technology Network hardware s/w
Data refreshing DBMS s/w
Integration of data Middleware h/w s/w
Data transformation Integration of data
Maintenance of data model Data transformation
Data archival DB design
Data model definition
Network related issues
CHPT 1:INTRODUCTION TO DATA WAREHOUSING Data dictionery
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DATA WAREHOUSE DEFINED

Data warehouse are special type of databases that are specially built for
getting information OUT rather than putting data IN
Definition: According to W.H.Inmon(1992) considered to be father of Data
Warehousing;
Data Warehouse is subject oriented , Integrated, non volatile, time
variant collection of data in support of management’s decision.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF DATA WAREHOUSE

• Subject oriented
• Integrated
• Non volatile
• Time variant

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


SUBJECT ORIENTED

• Data is stored by business subjects rather than by application


• Occurrences of subject at many places is difficult for decision maker, he wants complete
picture
• Provide simple and concise view around subjects excluding data that is not usefulin the decision
CHPTsupport process
1:INTRODUCTION TO DATA WAREHOUSING
SUBJECT ORIENTED..
• Occurrence of subjects at many places is difficult for the decision maker
• Subject orientation gives a complete picture
• Data comes from diverse sources
• Data Integration issue
• Like…Description, encoding , Units etc.
• Hence Data cleansing and data transformation is a must

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
INTEGRATED DATA

• Data comes form diverse applications


• Source of data is invisible to decision maker
• Two Tasks: data Cleansing, Data Transformation
• Data Cleansing: errors in data entry , metadata, problem with application
• Data Transformation: to bring in one consistent format like description,
encoding, units, coding

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


NON VOLATILE DATA

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


NON VOLATILE DATA
• Business transactions do not update DW but they update operational systems..
• Operational systems are volatile & DW once written it remains unchanged
• If any value changes in a record , a new record is added.
TIME INVARIANT DATA
• In Data Warehouse , decision maker can view data across the field of time at
whichever level of detail they want
• Operational systems contains current data as they process day to day
transactions
• This feature allows for study and analysis of the past , corelate the
information to the present scenario and finally enables the forecasting of the
future

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


ININORMATION FLOW MECHANISM

How the huge amount of data in the source systems get delivered to the data
warehouse users as useful pieces of information
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
ETL PROCESS

Steps involved in the process of transformation of data into


information

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


STEP 1: SELECT SOURCE DATA

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF PRODUCTION DATA

• Main source
• Lot of variation as data is collected from different platform, OS , DBMS etc
• Disparity in data is the biggest challange

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF INTERNAL DATA

• Data from private files


• Includes data which is not stored on computer
• Personal soreadsheets, customer profile or data that we keep while dealing
with customer
• Useful when contribution from each customer is significant
• Aaditional complexity to process and integrate such data

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF ARCHIVED DATA

• In every operational system, old data is periodically taken and stored in


archived files
• For example it is shifted from online disks to magnetic tape
• Stage 1 archived to archival database….still online
• Stage 2 archived to flat files on disk storage
• Stage 3 oldest data to tape and kept offsite

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF EXTERNAL DATA

• From business magazines, industry newsletters,technology reports,sales


marketing analysis reports, competitive analysis reports
• To get broader and clear view of the data
• Challenge here is frequency of availability…constant monitoring
• Another issue need to be converted to internal format
• Data granularity
• Data unpredictability
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DATA STAGING AREA

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


WHY SEPARATE DATA STAGING AREA

• It’s a place where all the extracted data is temporarily stored and prepared
for loading into data warehouse
• It isolates the raw data
• DW users cant access staging area so security and process quality
• It eases the development of central metadata repository which maintains
documentation of operational systems,ETL process,DW,tools and predefined
reports

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA PROCESSING AT DATA STAGING AREA

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TRANSFORM THE EXTRACTED DATA
• Translating coded values
• Deriving new calculated value
• Merging and splitting of fields
• Aggregating summarizing
• Generating primary, foreign keys
• Applying data validation
• Resolving synonyms

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


LOADING TRANSFORMED DATA

• Initial loading od data- once, large amount of data


• Refresh cycle of data warehouse- constantly updated , daily ,weekly, monthly,
quarterly

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DELIVER INFORMATION TO USERS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


METADATA
• Contains structure of data from customer and programmer perspective
• Source system
• Transformation process
• Data model
• History of data extraction

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


ROLE OF METADATA
• Stores data about data
• It’s a key to provide users and developers to provide road map of info in DW

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


IMPORATNCE OF METADATA

• Communication through various applications becomes possible


• Helps in creating own reports and queries
• Meaning of data elements
• To build data extraction and transformation compnents
• Helps in initial loading

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TYPES OF METADATA

• Operational metadata
• Extraction and transformational metadata
• End user metadata

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA WAREHOUSE ARCHITECTURE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


THREE TIER ARCHITECTURE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DESIGN STRATERGIES

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BUILDING DATA MARTS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BUIDING DATA MARTS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


PRACTICAL APPROACH FOR BUILDING DATA MARTS
• Suggested by Ralph Kimball, combining both approaches
• Gather requirements from user
• Define them at enterprise level
• Establish scope of DW & its intended use
• Define and prioritize requirements & info needs of the users that DW will address
• Design the Architecture of overall Warehouse
• Define subject areas and sequence implementation
• Conform and standardize data that would be stored in DW
• Implement DW as series of super marts
CHAPTER 2

DIMENSIONAL MODELLING
DIMENSIONAL MODELLING
• Preparing logical design of data warehouse
• Data tables are designed , physically created and linked with each other
Data Wrehouse Modelling vs Operational Database modelling

DW modelling Operational DB modelling


Directly accessed by users Users doesn’t directly
interact
DW must allow easy data Pedefined queries
access
Data Analysis Day to day operations
Current as well as historic Current data
data
Subject oriented Application oriented
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DIMENSIONAL MODEL VS ER MODEL

Dimensional Model ER model


Simpler Complex.
Denormalized Normalized
Express microscopic relationship Capture business measures
Designed to answer queries on the overall business Well suited to answer queries at transaction level
process to reveal trends

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FEATURES OF GOOD DIMENSION MODEL

• Best data access


• Query centric
• Optimised for query and analysis
• Depict the way in which fact table interact with dimension table
• Allow equal interaction of every dimension with fact table
• Enable to perform drill down and roll up operation
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DIMENSIONAL ANALYSIS

• It means collecting info for data warehouse project


• All users think in terms of dimensions
• Business dimensions
• Users may not be able to tell what they exactly want from data warehouse
but they can provide how they think about the business
• Then we can go about and gather data about these business dimension

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DIMENSION HIERARCHIES

• Traversing different hierarchical levels of business dimensions for getting the


details at different levels
• Its path for drilling down or roll up
• Facts or metrics:
that measures the success of the business
the numbers that users analyse are the measurements.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BUSINESS DIMENSIONS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DIMENSION HIERARCHIES

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


INFORMATION PACKAGE DIAGRAM
IPD are used to record information requirements, various insights during requirement gathering in terms of
business dimensions and facts. Helpful for further development of DW .

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


INFORMATION PACKAGE DIAGRAM

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
FACT TABLES AND DIMENSION TABLES FORMED
FROM GIVEN IPD

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


THREE TYPES OF DIMENSIONAL MODEL

• Star Schema
• Snowflake Schema
• Fact Constellation Schema

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


STAR SCHEMA
• Fact table in the middle
• Dimension table arranged around
• Every dimension table attached to fact table with primary key
• Dimension table not connected to each other

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


HOW DOES A QUERY EXECUTE ?

Example:
How much profit in dollars did the salesperson david make on 2 January 2006
by selling trousers to jennhy at the new delhi store ?

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
PROS AND CONS OF STAR SCHEMA
• Easy to understand
• Optimizes navigation
• Enhances query execution
• Analytical flexibility- drill down or roll up
• Easy to reconfigure
• Best for ad hoc
• Enable summarization

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


ITS DEFICITS ARE A COMPROMISES THAT MAKE IT
WORK
• Narrow scope in terms of facts and dimensions
• Maintenance and addition of more of historical data creates problems
• Moderate performance
• Not suitable for storing detailed data

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


SNOWFLAKE SCHEMA

• Called so as name resembles snowflake


• Variation of star schema with further splitting of the data into additional table
• Normalizes dimension tables to eliminate redundancy
• Dimension data is grouped into multiple table instea dof one large table
• Used whem dimension table becomes too big that a star schema cant handle

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


SNOWFLAKE SCHEMA

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


PROS AND CONS OF SNOWFLAKE SCHEMA
Benefits
• Small savings on storage space
• Easy to updateand maintain
Downside
• Less intuitive for end user
• Complex structure
• Navigation difficult
• Query performance degrades
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
AGGREGATE FACT TABLE

• Contains precalculated summaries derived from most granular(detailed )fact


table
• Designed to reduce runtime processing
• Stores data needed for multiple executions of the same query
• High performance of specific task
• Good for repetitive tasks

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


NEED FOR BUILDING AGGREGATE FACT TABLE
• Large size of fact table
• To speed up query execution
So,
Aggregate table has very few rows as compared to base fact table
So queries executed against them are faster
Limitations: reaggregated when there is any change in source data,
Narrow applicability, hence limited interactive use

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FACT CONSTELLATION SCHEMA

• Collection of related star schema i.e.family of star


• Disdavntages: complicated design
• Reasons for creating families of star scema
- aggregate and derived dimension table
- to support core and custom table
- to support snapshot and transaction table

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TO SUPPORT AGGREGATE FACT TABLE AND
DERIVED DIMENSION TABLE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TO SUPPORT SNAPSHOT AND TRANSACTION TABLE

• Example: telephone company


• Snapshot will contain customers account balance
• Transaction contain daily or weekend transaction details
• Example : Bank
• Transaction table: individual transaction for amount for customer
• Snapshot: balance at the end of the day
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
TO SUPPORT SNAPSHOT & TRANSACTION TABLE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CORE AND CUSTOM TABLE

• How to track dissimilar services. Exa; Bank


• Here core fact table holds the metrics that are common to all types of
accounts and eac custom fact table contains metrcs specific to that line of
service

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
FACTLESS FACT TABLE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


FACTLESS FACT TABLE

• Empty , has no measures


• It is just design to record sequence of events
• Used for example:
• To answer many interesting questions
classes which are heavily attended, most consistently attended, who taught most
students

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


UPDATES TO THE DIMENSION TABLE

• slowly changing dimensions

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TYPE 1 CHANGES:CORRECTION OF ERRORS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TYPE 2 CHANGES: PRESERVATION OF HISTORY

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TYPE 3 CHANGES: TENTATIVE SOFT REVISIONS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHAPTER 3
ETL PROCESS
CHPT 1:ETL PROCESSDATA WAREHOUSING
DATA EXTRACTION
EXTRACTING DATA FOR REFRESHING
• Immediate Data Extraction….occurs in real time
- Capture through transaction logs- selects all committed transactions from log, no
extra overhead, useful only when source systems are database applications
- Capture through database triggers- output of trigger program written on separate
file from where data is extracted, reliable, befor and after images are available, but
extra overhead
-Capture in source application- source files are modified to write all adds, deletes
etc on both source and db files, performance degradation
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
• Deferred Data Extraction….occurs in later point of time
data capture is done in later point in time
- Capture based on date and timestamps- records selected based on
timestamps, works with every type of system, good performance when records
are small, special way to handle deletions
- Capture by comparing files- last option if nothing works, compares two
snapshots of data, must keep prior copies of relevant data, simple, full
comparison becomes inefficient

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


MANAGING REFERENCE TABLES

• Every organization has some reference tables


• First technique- capture snapshot of reference table every six months, simple
but inefficient(deletions)
• Second- create snapshot of reference table, then capture all activities against
reference table through out the year in a separate table. So that table can
be reconstructed using this table at any time

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA TRANSFORMATION
Tasks involved in data transformation
• Format revision- data types , length of the fields
• Decoding of fields- M,F
• Splitting of fields- address into flat, building ,road etc
• Merging of information- data from different table
• Character set conversion- EBCDIC,ASCII
• Conversions of units- exa; currency
• Date and time conversions- format
• Summarization- from most granular data, total sales
• Key restructuring
• De duplication

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


ROLE OF DATA EXTRACTION
• Map the data from source system to DW
• Data cleaning, missing values
• Remove duplicates
• Splitting and merging of fields
• Sorting
• Conversion to approapriate types
• Aggregation ,summarization
• Consolidation , integration

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA LOADING
• Initial load- populating DW fields for the first time
• Increamental load- ongoing changes periodically
• Full refresh- contents of one or more tables area full erased and written with
fresh data
• During data loading DW is offline so do it part by part
• After loading test the loaded data to verify the correctness

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DURING INITIAL LOAD..

• May take several days


• Load dimension tables first then fact tables then aggregate table
• Create indexes on thoase tables

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TECHNIQUES OF DATA LOADING

• Load- wipe out existing , load fresh data


• Append- append new , preserve existing
• Destructive merge- if any primary key of new data matches with existing then
existing record overwritten with new data
• Constructive Merge- if primary key matches then preserve existing and mark
the newly added record

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


WHEN TO GO FOR DATA UPDATE THAN REFRESH

• After initial load dW is updated using


-update- application of incremental changes
- refresh- complete reload
• Refresh is simpler than update. But takes longer time
• Cost of refresh remains same irrespective of changes, but cost of update
varies depending on changes

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA QUALITY… TOPMOST CHALLENGE
Need of data quality
• Boosts confidence and enhances strategic decision making
• Better customer service
• Reduces costs and risks
• Improves productivity
Poor quality of data leads to
• Bad decisions
• Lost business opportunities
• Wastage of resources
• Inconsistent data reports
• Time and efforts needed to correct data

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CATEGORIES OF ERRORSORS
Incomplete errors
• Missing records
• Missing fields
• Records or fields by design are not recorded
Incorrect errors
• Wrong codes
• Wrong calculations,Aggregations
• Duplicate records
• Wrong information added to the source system exa.date format
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
Incomprehensibility errors
• Makes data difficult to read
• Multiple fields in one field
• Unknown codes…lack of documentation
Inconsistency errors
• Similar data from multiple systems can be easily incosistent
• Inconsistent use of different codes
• Inconsistent meaning of a code
• Inconsistent aggregating
• Lack of referential integrity ..valid refernces i.e references must exist
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
VARIOUS SOURCES OF POLLUTION OF DATA

• System conversion
• Data Aging
• Heterogeneous system integration
• Incomplete information at data entry
• Fraud
• Lack of policies…prevention of corrupt or incorrect data
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
ISSUES IN DATA CLEANSING

• Which data to cleanse- decided by project team and users, whether cleansing
and aftermath of leaving the dirty data as it is
• Where to cleanse- at the data staging area
• How to cleanse- find appropriate tools

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHAPTER 4
OLAP
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
NEED OF OLAP

• To minimize on the fly processing needed when the user is navigating the data
• Preprocessing and storing all the possible combinations of measures,
dimensions and hierarchies before the user starts the analysis
• This makes the data available instantaneously before the user

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


BASIC VIRTUES OF OLAP
• Enables analysts, executives, managers to gain useful insights in business
• To measure metrics along several dimensions
• Allows data to be viewed from different perspective
• Drill down and roll up
• Use of complex formulae and calculation
• Fast response, speed of thought
• Complements others data mining and EIS
• Presentation and visualizations
• Can be implemented on web
• Highly interactive analysis

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


OLTP VS OLAP

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


OLAP DEFINED

• Definition-
Online analytical processing (OLAP) is a category of software technology that
enables analysts, managers, executives to gain insights into the data through
fast, consistent, interactive access in a wide variety of possible views of
information that has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHARACTERISTICS OF OLAP
• Allows users to have multidimensional and logical view of the data in DW
• Provides interactive query and complex analysis
• Drill down and roll up operations
• Enables to perform complex calculations and comparisons
• Displays results in variety of formats including charts and graphs

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


MULTIDIMENSIONAL LOGICAL DATA MODEL

• 2D – in tables , rows and columns


• Multidimensional model- multidimensional cube, dimensions are displayed
along different axis
• Users must be able to summarise the data stored in individual cells across any
dimensionsand able to manipulate data as a cube
• Challenge: too few or too many dimensions
• 3 is ideal , more than 7 cant be understood
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
MULTIDIMENSIOANL DATA MODEL

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


USERS OF MULTIDIMENSIONAL MODEL

• Traditional analyst or end user – who executes queries on DW


• DWI is data warehouse interface- it is the software component used to extract
data from DW
• Multidimensional structure administrator- responsible for construction and
maintenance of DW interfaces
• DW administrator- for construction and maintenance

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


OLAP FUNCTIONS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DIMENSIONAL ANALYSIS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
HYPERCUBES

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
ATTRIBUTES OF DIMENSIONS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


COMBINATION OF FOUR DIMENSION IN A TABLE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


DATA REPRESENTED IN A HYPERCUBE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TABLE STORING DATA ALONG FIVE DIMENSIONS

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING
OLAP OPERATIONS IN MULTIDIMENSIONAL MODEL
Roll Up And Drill Down

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


ROLL UP AND DRILL DOWN EXAMPLE

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


SLICE AND DICE OPERATION

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


PIVOT OR ROTATE OPERATION

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


TYPES OF OLAP

• OLAP is a specialized tool that creates a multidimensional view of data for


the user to do the analysis.
• ROLAP and MOLAP are two models of OLAP.
• Though they are different in many aspects, the most important difference
between them is ROLAP provides data, directly from main data warehouse
whereas, MOLAP provides data from the proprietary databases MDDBs.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


• Basis for Comparison ROLAP MOLAP
Full Form - ROLAP stands for Relational -MOLAP stands for Multidimensional
Online Analytical Processing. Online Analytical Processing
Storage & Fetched - Data is stored and fetched from - Data is Stored and fetched from
the main data warehouse. the Proprietary database MDDBs.
• Data Form -Data is stored in the form of - Data is Stored in the large multidimensional
relational tables. array made of data cubes.
• Data volumes -Large data volumes. -Limited summaries data is kept in MDDBs.
• Technology -Uses Complex SQL queries to -MOLAP engine created a precalculated and
fetch data from the main warehouse. prefabricated data cubes for multidimensional data
views.
Sparse matrix technology is used to manage datasparsity.
• View -ROLAP creates a multidimensional - MOLAP already stores the static multidimensional
view of data dynamically. view of data in MDDBs.
• Access -Slow access. -Faster access.
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
ROLAP

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


• ROLAP is Relational Online Analytical Processing model,
• where the data is stored as in relational database i.e. rows and columns in the data
warehouse. In the ROLAP model data is present in the front of the user in the multidimensional
form.

• Whenever the ROLAP engine in analytical server issues a complex query, it fetches data from
the main warehouse and dynamically creates a multidimensional view of data for the user.
• Here, it differs from MOLAP because MOLAP already has a static multidimensional view of
data stored in proprietary databases MDDBs.

• As the multidimensional view of data is created dynamically it processes slower in comparison


to MOLAP. ROLAP engine deals with large volumes of data.

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


MOLAP

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


• MOLAP is a Multidimensional Online Analytical Processing model. The data used for
analysis is stored in specialized multidimensional databases (MDDBs).

• . The cells or data cubes of this multidimensional databases carry precalculated and
prefabricated data. Proprietary software systems create this precalculated and
fabricated data, while the data is loaded to MDDBs from the main databases.

• Now, it is the work of MOLAP engine, which reside there in the application layer,
provide the multidimensional view of data from MDDBs to the user.
• Thus when a user request for the data, no time is wasted in calculating the data and
the system responses fast

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


HOLAP

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


HOLAP

• Typically it stores data in a both a relational database (RDB) and a multidimensional


database (MDDB) and uses whichever one is best suited to the type of processing
desired.
• For data-heavy processing, the data is more efficiently stored in a RDB, while for
speculative processing, the data is more effectively stored in an MDDB.
• HOLAP users can choose to store the results of queries to the MDDB to save the effort
of looking for the same data over and over which saves time. Although this technique
improves performance, it takes a toll on storage.
• The user has to strike a balance between performance and storage demand to get
the most out of HOLAP.
• Nevertheless, because it offers the best features of both OLAP and ROLAP, HOLAP
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
is increasingly preferred.
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DOLAP: Desktop OLAP

• Desktop On-Line Analytic Processing (DOLAP) is single-tier, desktop-based OLAP


technology.

• It is able to download a relatively small hypercube from a central point, usually from
data mart or data warehouse, and perform multidimensional analyses while
disconnected from the source.

• Data sets are limited to the boundaries defined by the user with no access to granular
data.

• In general, cubes contain summarized data, organized in a fixed structure of


dimensions. Therefore, it is ideal for well-understood, recurring analytic questions and
reporting
CHPT 1:INTRODUCTION TO DATA WAREHOUSING
DOLAP

CHPT 1:INTRODUCTION TO DATA WAREHOUSING


CHPT 1:INTRODUCTION TO DATA WAREHOUSING

You might also like