Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

2

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

3

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

4

© 2009 Wipro Ltd - Confidential

5

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

6

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

7

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

8

© 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

9

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

10

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

11

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

12

© 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

13

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
14
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
15
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

16

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

17

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

18

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

19

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

20

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

21

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

22

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

23

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
24
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

25

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

26

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

27

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

28

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

29

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

30

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

31

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

32

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

33

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

34

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

35

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

36

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

37

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

38

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

39

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

40

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
41

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

42

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

43

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

44

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

45

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

46

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
47
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

48

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
49
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

50

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
51
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

52

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

53

53

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 54

54

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
55

55

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
56
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
57

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
58
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

58

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
59
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

59

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
60
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

60

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
61
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

61

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
62
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

62

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
63
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

63

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
64
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

64

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
65
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

65

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
66
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

66

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
67
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

67

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

68

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
69
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

69

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
70
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

70

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
71
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

71

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
72
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

72

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
73
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

73

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
74
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

74

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
75
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

75

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
76
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

76

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
77
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

77

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
78
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

78

Data Warehouse Testing

79

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

80

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

81

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

82

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

83

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

84

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

85

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

86

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

87

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

88

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

89

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

90

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

91

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

92

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

93

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

95

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

96

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

97

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

98

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

99

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

100

© 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

102

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

103

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

104

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

105

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

106

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

107

© 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

109

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

110

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

111

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

112

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

113

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

114

© 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

115

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

116

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

117

© 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
118
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
119
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

120

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

121

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

122

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

123

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

124

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

125

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
126
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

127

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

128

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

129

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

130

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

131

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

132

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

133

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
134

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

135

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

136

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

137

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
138
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

139

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
140
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

141

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
142
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

143

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

144

144

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 145

145

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 146 data

146

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
147
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
148

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
149
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

149

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
150
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

150

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
151
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

151

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
152
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

152

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
153
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

153

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
154
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

154

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
155
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

155

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
156
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

156

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
157
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

157

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
158
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

158

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

159

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
160
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

160

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
161
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

161

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
162
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

162

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
163
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

163

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
164
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

164

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
165
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

165

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
166
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

166

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
167
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

167

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
168
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

168

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
169
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

169

Data Warehouse Testing

170

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

171

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

172

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

173

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

174

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

175

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

176

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

177

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

178

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

179

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

180

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

181

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

182

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

183

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

184

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

185

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

186

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

187

© 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
188
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
189
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

190

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

191

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

192

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

193

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

194

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

195

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
196
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

197

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

198

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

199

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

200

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

201

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

202

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

203

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
204

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

205

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

206

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

207

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
208
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

209

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
210
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

211

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

213

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

215

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

216

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

217

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

218

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

219

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

220

© 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

221

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

222

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

223

© 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
224
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
225
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

226

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

227

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

228

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

229

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

230

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

231

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
232
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

233

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

234

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

235

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

236

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

237

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

238

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

239

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
240

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

241

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

242

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

243

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
244
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

245

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
246
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

247

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
248
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

249

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

250

250

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 251

251

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 252 data

252

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
253
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
254

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
255
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

255

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
256
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

256

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
257
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

257

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
258
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

258

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
259
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

259

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
260
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

260

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
261
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

261

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
262
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

262

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
263
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

263

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
264
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

264

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

265

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
266
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

266

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
267
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

267

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
268
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

268

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
269
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

269

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
270
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

270

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
271
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

271

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
272
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

272

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
273
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

273

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
274
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

274

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
275
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

275

Data Warehouse Testing

276

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

277

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

278

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

279

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

280

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

281

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

282

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

283

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

284

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

285

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

286

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

287

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

288

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

289

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

290

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

291

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

292

© 2009 Wipro Ltd - Confidential

What is Data Warehouse?
Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. – WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support – Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis – Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

293

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

294

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
What makes a Data Warehouse

295

© 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

296

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

297

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

298

© 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
299
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
300
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

301

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

302

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

303

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

304

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

305

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

306

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
307
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

308

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

309

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

310

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

311

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

312

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

313

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

314

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
315

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

316

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

317

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

318

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
319
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

320

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
321
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

322

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
323
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

324

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

325

325

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 326

326

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 327 data

327

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
328
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
329

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
330
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

330

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
331
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

331

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
332
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

332

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
333
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

333

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
334
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

334

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
335
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

335

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
336
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

336

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
337
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

337

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
338
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

338

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
339
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

339

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

340

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
341
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

341

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
342
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

342

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
343
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

343

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
344
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

344

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
345
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

345

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
346
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

346

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
347
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

347

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
348
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

348

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
349
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

349

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
350
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

350

Data Warehouse Testing

351

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

352

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

353

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

354

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

355

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

356

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

357

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

358

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

359

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

360

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

361

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

362

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

363

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

364

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

365

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
366
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

367

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

368

368

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 369

369

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 370 data

370

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
371
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
372

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
373
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

373

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
374
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

374

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
375
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

375

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
376
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

376

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
377
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

377

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
378
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

378

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
379
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

379

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
380
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

380

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
381
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

381

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
382
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

382

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

383

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
384
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

384

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
385
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

385

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
386
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

386

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
387
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

387

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
388
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

388

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
389
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

389

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
390
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

390

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
391
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

391

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
392
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

392

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
393
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

393

Data Warehouse Testing

394

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

395

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

396

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

397

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

398

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

399

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

400

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

401

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

402

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

403

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

404

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

405

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

406

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

407

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

408

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

409

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Architecture
This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

410

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

411

© 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student… Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.
o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models – most commonly used:
o Star Schema o Snowflake Schema
412
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
413
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

414

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south

Dimension Table
cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region regId name north cold region south warm region

415

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

416

© 2009 Wipro Ltd - Confidential

The Need For Data Quality
      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with – error detection – error rework – customer service – fixing customer problems

417

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Six Steps To Data Quality
Understand Information Flow In Organization
 Identify authoritative data sources  Interview Employees & Customers

Identify Potential Problem Areas & Asses Impact

 Data Entry Points
 Cost of bad data

Measure Quality Of Data

 Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values
 Use data cleansing tools to clean data at the source  Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

 Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

 Identify & Correct Cause of Defects  Refine data capture mechanisms at source  Educate users on importance of DQ
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

418

Data Quality Solution
Customized Programs  Strengths: – Addresses specific needs – No bulky one time investment  Limitations – Tons of Custom programs in different environments are difficult to manage – Minor alterations demand coding efforts Data Quality Assessment tools  Strength – Provide automated assessment  Limitation – No measure of data accuracy

419

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Quality Solution
Business Rule Discovery tools  Strengths – Detect Correlation in data values – Can detect Patterns of behavior that indicate fraud  Limitations – Not all variables can be discovered – Some discovered rules might not be pertinent – There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths – Usually are integrated packages with cleansing features as Addon  Limitations – Error prevention at source is usually absent – The ETL tools have limited cleansing facilities
420
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Tools In The Market
 Business Rule Discovery Tools – Integrity Data Reengineering Tool from Vality Technology – Trillium Software System from Harte -Hanks Data Technologies – Migration Architect from DB Star  Data Reengineering & Cleansing Tools – Carlton Pureview from Oracle – ETI-Extract from Evolutionary Technologies – PowerMart from Informatica Corp – Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools – Migration Architect, Evoke Axio from Evoke Software – Wizrule from Wizsoft  Name & Address Cleansing Tools – Centrus Suite from Sagent – I.d.centric from First Logic

421

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

422

© 2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data – Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files

Meta Data Repository

Scheduled Extraction

RDBMS

•Clean •Transform •Match •Merge

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

423

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction – Cleanup
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

424

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.

 To solve the problem, companies use extract, transform and load (ETL) software.
 The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

425

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

426

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing
  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

   

427

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
428

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Metadata Management

429

© 2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

430

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?

How much money was lost or earned as a result? Interpreting information
How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

431

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management
 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
432
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users • Warehouse administrator • Application developer Business Users -Business metadata • Meanings • Definitions • Business Rules Software Tools • Used in DW life-cycle development • Metadata requirements for each tool must be identified • The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository • Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

433

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Third Party Bridging Tools  Oracle Exchange
– Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability – Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
434
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products — e.g., One for AD and one for DW, with bridges between them

435

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools
Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard – Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation – Can enable exchange over the web employing industry standards for storing and sharing programming data – Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)
– Based on XML/UML standards – Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member), Viasoft
436
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP

437

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP

MDDB Concepts
Implementation Techniques Architectures

Features
Representative Tools

6/19/2012

438

438

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing
 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) • Used interchangeably with ‘BI’ • Multidimensional view of data is the foundation of OLAP • Users :Analysts, Decision makers
6/19/2012 439

439

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP
OLTP System Source of data Operational data; OLTPs are the original source of the data To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support

Purpose of data

What the data reveals Inserts and Updates
6/19/2012

Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 440 data

440

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members
441
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...
Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE … DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr … VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
442

3 x 3 x 3 = 27 cells

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation – A great deal of information is gleaned immediately upon direct inspection of the array – User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space – Very low Space Consumption compared to Relational DB  Performance – Gives much better performance. – Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance – No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
6/19/2012
443
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

443

Issues with MDDB

• Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

• Data Explosion
-Due to Sparsity -Due to Summarization

• Performance
-Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)

6/19/2012
444
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

444

Issues with MDDB - Sparsity Example
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

6/19/2012
445
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

445

OLAP Features
 Calculations applied across dimensions, through hierarchies and/or across members  Trend analysis over sequential time periods,  What-if scenarios.  Slicing / Dicing subsets for on-screen viewing  Rotation to new dimensional comparisons in the viewing area  Drill-down/up along the hierarchy  Reach-through / Drill-through to underlying detail data

6/19/2012
446
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

446

Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
White

Coupe

C O L O R ( ROTATE 90 )
o

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.
6/19/2012
447
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

447

Features of OLAP - Rotation
Sales Volumes

M O D E L

Mini Van Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

( ROTATE 90 )

o

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Mini Van Coupe Sedan White Red Blue

Carr Gleason Blue Red White

Mini Van

Clyde

Clyde Mini Van Coupe Sedan

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

o

MODEL

( ROTATE 90 )

o

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.
6/19/2012
448
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

448

Features of OLAP - Slicing / Filtering
 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe

Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
6/19/2012
449
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

449

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

6/19/2012
450
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

450

OLAP Reporting - Drill Down

Inflows ( Region , Year)
200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

6/19/2012
451
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

451

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)
90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central

1st Qtr

2nd Qtr 3rd Qtr Year 1999

4th Qtr

• Drill-down from Year to Quarter
6/19/2012
452
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

452

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)
20 15 Inflows ( $M 10 ) 5

East West Central
January February March Year 1999

0

• Drill-down from Quarter to Month

453

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP
 Multidimensional Databases for database and application logic layer

 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.  Database and Application logic provided as separate layers

 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

6/19/2012
454
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

454

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations
6/19/2012
455
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

455

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
456
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

456

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
457
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

457

ROLAP - Features  Three-tier hardware/software architecture:
 GUI on client; multidimensional processing on midtier server; target database on database server  Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

6/19/2012
458
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

458

HOLAP - Combination of RDBMS and MDDB
OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
6/19/2012
459
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

459

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

6/19/2012
460
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

460

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 – 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then it’s like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis

6/19/2012
461
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

461

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

6/19/2012
462
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

462

Sample OLAP Applications

 Sales Analysis  Financial Analysis  Profitability Analysis  Performance Analysis  Risk Management  Profiling & Segmentation  Scorecard Application  NPA Management  Strategic Planning  Customer Relationship Management (CRM)
6/19/2012
463
© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

463

Data Warehouse Testing

464

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different from testing a typical transaction system

465

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System
Data warehouse testing is different on the following counts: – User-Triggered vs. System triggered – Volume of Test Data – Possible scenarios/ Test Cases – Programming for testing challenge

466

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

467

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

468

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System…
• Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare preTransformation to post Transformation of data.

469

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process
Data-Warehouse testing is basically divided into two parts :  'Back-end' testing where the source systems data is compared to the endresult data in Loaded area  'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

470

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

471

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: •Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
•Testing the rejected records that don’t fulfil transformation rules.

472

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Unit Testing…
Unit Testing the Report data:

•Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available •Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems •Derivation formulae/calculation rules should be verified

473

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:

 Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that don’t fulfil transformation rules.  Error log generation

474

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.

 Monitoring and measuring the data quality issues.
 Refresh times for standard/complex reports.

475

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

476

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Questions

477

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Thank You

478

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

© 2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

480

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

Content [contd…]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

481

© 2009 Wipro Ltd - Confidential © 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

482

© 2009 Wipro Ltd - Confidential

Sign up to vote on this title
UsefulNot useful