You are on page 1of 468

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

10

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
11
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
12
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

13

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

14

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

15

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

16

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

17

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

18

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
19
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

20

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

21

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

22

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

23

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

24

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

25

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

26

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
27

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

28

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

29

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

30

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
31
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

32

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
33
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

34

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
35
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

36

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

37

37

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 38

38

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
39

39

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
40
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
41

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
42
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

42

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
43
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

43

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
44
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

44

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
45
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

45

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
46
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

46

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
47
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

47

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
48
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

48

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
49
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

49

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
50
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

50

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
51
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

51

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

52

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
53
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

53

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
54
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

54

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
55
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

55

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
56
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

56

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
57
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

57

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
58
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

58

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
59
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

59

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
60
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

60

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
61
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

61

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
62
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

62

Data Warehouse Testing

63

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

64

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

65

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

66

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

67

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

68

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

69

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

70

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

71

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

72

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

73

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

74

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

75

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

76

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

77

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

78

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

80

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

81

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

82

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

83

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

84

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

85

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

87

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

88

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

89

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

90

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

91

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

92

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

94

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

95

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

96

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

97

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

98

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

99

2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

100

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

101

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

102

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
103
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
104
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

105

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

106

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

107

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

108

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

109

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

110

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
111
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

112

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

113

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

114

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

115

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

116

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

117

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

118

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
119

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

120

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

121

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

122

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
123
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

124

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
125
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

126

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
127
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

128

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

129

129

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 130

130

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 131 data

131

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
132
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
133

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
134
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

134

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
135
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

135

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
136
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

136

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
137
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

137

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
138
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

138

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
139
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

139

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
140
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

140

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
141
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

141

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
142
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

142

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
143
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

143

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

144

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
145
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

145

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
146
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

146

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
147
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

147

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
148
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

148

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
149
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

149

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
150
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

150

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
151
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

151

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
152
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

152

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
153
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

153

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
154
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

154

Data Warehouse Testing

155

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

156

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

157

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

158

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

159

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

160

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

161

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

162

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

163

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

164

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

165

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

166

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

167

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

168

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

169

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

170

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

171

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

172

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
173
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
174
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

175

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

176

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

177

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

178

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

179

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

180

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
181
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

182

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

183

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

184

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

185

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

186

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

187

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

188

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
189

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

190

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

191

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

192

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
193
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

194

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
195
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

196

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

198

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

200

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

201

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

202

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

203

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

204

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

205

2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

206

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

207

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

208

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
209
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
210
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

211

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

212

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

213

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

214

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

215

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

216

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
217
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

218

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

219

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

220

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

221

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

222

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

223

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

224

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
225

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

226

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

227

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

228

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
229
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

230

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
231
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

232

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
233
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

234

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

235

235

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 236

236

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 237 data

237

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
238
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
239

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
240
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

240

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
241
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

241

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
242
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

242

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
243
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

243

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
244
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

244

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
245
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

245

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
246
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

246

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
247
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

247

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
248
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

248

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
249
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

249

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

250

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
251
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

251

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
252
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

252

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
253
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

253

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
254
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

254

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
255
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

255

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
256
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

256

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
257
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

257

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
258
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

258

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
259
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

259

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
260
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

260

Data Warehouse Testing

261

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

262

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

263

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

264

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

265

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

266

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

267

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

268

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

269

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

270

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

271

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

272

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

273

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

274

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

275

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

276

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

277

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

278

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

279

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


What makes a Data Warehouse

280

2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

281

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

282

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

283

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
284
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
285
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

286

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

287

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

288

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

289

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

290

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

291

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
292
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

293

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

294

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

295

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

296

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

297

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

298

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

299

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
300

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

301

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

302

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

303

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
304
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

305

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
306
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

307

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
308
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

309

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

310

310

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 311

311

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 312 data

312

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
313
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
314

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
315
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

315

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
316
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

316

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
317
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

317

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
318
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

318

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
319
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

319

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
320
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

320

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
321
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

321

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
322
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

322

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
323
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

323

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
324
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

324

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

325

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
326
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

326

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
327
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

327

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
328
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

328

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
329
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

329

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
330
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

330

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
331
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

331

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
332
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

332

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
333
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

333

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
334
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

334

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
335
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

335

Data Warehouse Testing

336

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

337

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

338

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

339

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

340

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

341

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

342

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

343

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

344

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

345

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

346

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

347

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

348

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

349

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

350

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
351
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

352

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

353

353

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 354

354

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 355 data

355

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
356
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
357

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
358
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

358

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
359
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

359

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
360
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

360

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
361
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

361

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
362
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

362

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
363
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

363

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
364
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

364

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
365
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

365

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
366
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

366

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
367
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

367

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

368

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
369
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

369

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
370
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

370

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
371
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

371

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
372
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

372

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
373
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

373

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
374
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

374

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
375
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

375

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
376
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

376

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
377
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

377

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
378
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

378

Data Warehouse Testing

379

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

380

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

381

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

382

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

383

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

384

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

385

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

386

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

387

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

388

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

389

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

390

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

391

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

392

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

393

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

394

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

395

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Modeling
Effective way of using a Data Warehouse

396

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
397
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
398
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

399

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

400

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Overview of Data Cleansing

401

2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

402

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
Identify authoritative data sources Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

Data Entry Points


Cost of bad data

Measure Quality Of Data

Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


Use data cleansing tools to clean data at the source Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

403

Data Quality Solution


Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

404

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Quality Solution


Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Addon Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
405
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Tools In The Market


Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

406

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Extraction, Transformation, Load

407

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

Clean Transform Match Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

408

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

409

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

410

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing

411

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Major components involved in ETL Processing


Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

412

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
413

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Metadata Management

414

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

415

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information


How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

416

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
417
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

418

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
419
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

420

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
421
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP

422

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

5/16/2013

423

423

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers
5/16/2013 424

424

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Distinction between OLTP and OLAP


OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates


5/16/2013

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 425 data

425

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
426
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
427

3 x 3 x 3 = 27 cells

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
5/16/2013
428
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

428

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

5/16/2013
429
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

429

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

5/16/2013
430
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

430

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

5/16/2013
431
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

431

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


5/16/2013
432
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

432

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


5/16/2013
433
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

433

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
5/16/2013
434
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

434

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

5/16/2013
435
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

435

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

5/16/2013
436
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

436

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

Drill-down from Year to Quarter


5/16/2013
437
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

437

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5

East West Central


January February March Year 1999

Drill-down from Quarter to Month

438

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

5/16/2013
439
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

439

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


5/16/2013
440
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

440

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
441
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

441

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
442
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

442

ROLAP - Features Three-tier hardware/software architecture:


GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

5/16/2013
443
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

443

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
5/16/2013
444
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

444

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

5/16/2013
445
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

445

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

5/16/2013
446
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

446

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

5/16/2013
447
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

447

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
448
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

448

Data Warehouse Testing

449

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

The methodology required for testing a Data Warehouse is different from testing a typical transaction system

450

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

451

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

452

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

453

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

454

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the endresult data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

455

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

456

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.

All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.

457

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

458

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Integration Testing
Integration testing will involve following:

Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

459

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Performance Testing
Performance Testing should check for : ETL processes completing within time window.

Monitoring and measuring the data quality issues.


Refresh times for standard/complex reports.

460

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

461

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Questions

462

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Thank You

463

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

465

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

466

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

An Overview
Understanding What is a Data Warehouse

467

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

468

2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential

You might also like