You are on page 1of 52

Data Warehousing.

Data Warehouse basic concepts Data Warehouse Approach Data Modeling concepts OLAP (Online Analytical Processing) Data Warehouse Implementation Next steps in Data Warehousing
March 2005

Data Warehouse- Concepts

Module 1 Data Warehouse basic concepts

What is DSS?

Decision Support System Mainly used by business to take some strategic decisions based on the trends (comparing current fiscal to previous) and project the numbers based on history and some parameters Not to run the business, OLTP systems takes care of the day to day activities of a business. Example SAP Order Management takes care of the orders which an organization gets. In the DSS we collect all the data to do the analysis.

Advantages of DSS

A grocery store chain in US gives various information from DSS directly to store manager. Example, the system can predict the a particular stock outage in the store. Based on the history system knows for every 3 hours there should be sale on one particular item, if the DSS system did not see a transaction from last 3 hours it sends an SMS to current shifts manager mobile. Thats the level you can go with the DSS. It takes time to get there. (Now this DW is called Active DW) One of the retail major does the customer profiling, store sales analysis etc on the data warehouse, its implemented on Teradata.

OLTP

Online Transaction processing system Examples of OLTP systems are order management, payroll etc Always follows 3rd normal form, while designing the database All the Data Manipulation Language (DML Insert, Update, Delete, Select) types are active Deal with specific data (customer x, product z etc)

OLTP

vs

DSS

More DML operations (Update, Delete, Inserts) Point Queries Very specific while issuing queries Less history (approximately 6 months to 1 year) Used for day today activities (must to run the business)

No change in the data (No updates and deletes) Queries based on time period, set of products, set of customers etc Maintains the history. Used mainly for analytics (trend analysis, customer behavior etc)

General DSS Architecture


Source Data

OLTP 1

ETL (Tool or TSQL)

Data Warehouse Database

Pre Defined Reports Ad hoc Reporting

OLTP 2

Market Place

Staging DB

OLAP Cubes Database

Web clicks

ODS

Data Mining

Close the loop (write back to OLTP about the findings in DSS

Example for a DSS


OLTP 1
OLTP 2

Reporting

Data Warehouse

OLAP

Analytics OLTP 3 OLTP 4

Data Warehouse- Concepts

Module 2 Data Warehouse Approach

Distributed Approach

Various departments can start creating different data marts. Each can start working independently and see the ROI in a short span. In the long run integrating these data adds the complexity and Cost will be higher as there are more systems to maintain.

Distributed Approach

Gives only part of the answer Requires time and effort to put the pieces together No guarantee its the right answer

How We Are Different

Centralized Approach

Centralized data warehouse contains the data in one place, easy to answer any business question. In the long run this has the cost advantage over the non-centralized data warehouse. Not very easy to implement as it needs more time and resources. ROI wont be seen until the implementation is completed. So recommended approach is to implement the centralized data warehouse is, start with one subject area and keep adding one subject area at a time, this way organization will get the see the ROI at various stages.

Centralized Approach to DSS

Delivers one version of the truth for increased confidence and speed in decisionmaking

How We Are Different

Data Warehouse- Concepts

Module 3
Data Modeling concepts

Data Modeling
OLTP 3rd Normal Form DSS Dimensional Modeling (Star Schema Snow flake schema) Why Dimensional Modeling?

Data Modeling for OLTP

Usually 3rd normal form. Advantages : Flexibility to modify for the changes. No redundancy of the data in the model. Disadvantages : Complex queries to generate the reports as the number of tables to join are usually high.

Dimensional Modeling for DSS


Star Schema, Snowflake schema Based on RDBMS we have to choose what type of model suits better. Example: Teradata is an RDBMS which can give the results in reasonable time as its a parallel processing database engine in the market. So we can design the Enterprise data model in the 3rd normal form. But we cant have the same approach for SQL server or Oracle, we should think of denormalizing the data model. Star Schema makes queries run faster as the number of tables to join is less. In star schema all the hierarchies defined per dimension will be stored in single table. So the data redundancy is high. In snow flake we can have one more table for the hierarchy. Thats the difference between the star schema and snow flake schema.

Star Schema

Star schema is optimized for queries. You will have some level of redundant data available in star schema based data model.

Star Schema (in RDBMS)

Star Schema Example

Star Schema with Sample Data

Snow flake

Snow flake wont have much of redundant data as most of the dimensions will have a look table. This way the number of joins between the tables will become more. Both have advantages and dis advantages, so analyze the end users requirements and space constraints to pick the best.

The Snowflake Schema


Store Dimension

STORE KEY
Store Description City District ID State

District_ID
District Desc. Region_ID

Region_ID
Region Desc. Regional Mgr.

Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

The Snowflake Schema

No LEVEL in dimension tables Dimension tables are normalized by decomposing at the attribute level Each dimension table has one key for each level of the dimensions hierarchy The lowest level key joins the dimension table to both the fact table and the lower level attribute table

The Snowflake Schema

Additional features: The original Store Dimension table, completely denormalized, is kept intact, since certain queries can benefit by its allencompassing content. In practice, start with a Star Schema and create the snowflakes with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.

Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

ETL (E Extract)

Extract Getting data out of the source systems. This may be just a DTS package which pulls the data, or exporting a table to a flat file in the source system. In Teradata we have Fast Export utility where we can export the data to a flat file. In Oracle we have SQL*Loader to export the data to a flat file. In SQL Server we can use a DTS package to do the same job

ETL (T Transform)

Transform Its not necessary to have the same data model in source and destination. When the data model is different from source obviously we have to modify the source data to destinations data model. This process is called transformation. Example : When we receive data from various distis about the reseller information we wont get the geo information. So in the transformation logic we will have some code which assigns the respective geo based on the country from which you are getting the data. This is the simple example on transformation.

ETL (L Load)

Load Loding the transformed data into the destination datamoel (data warehouse). As there are export functionality available in each RDBMS there is an utility to import the data into the database. Teradata Fast Import Oracle SQL*Loader Sybase - bcp

Data Refresh in DSS


We have to refresh the data in DSS from various source systems in timely manner. While doing so, either we should do a full refresh of a particular table or capture only the changed data (this process is called delta) Usually for fact tables we go for delta refresh and for dimension tables we go for full refresh. As the environment is getting bigger and bigger almost all the tables will become delta loads.

Data Warehouse- Concepts

Module 4 OLAP (Online Analytical Processing)

What is OLAP?

What is OLAP?

Online Analytical Processing. Viewing data in a multi dimensional way.

Why OLAP?

Slice and dice for data warehouse. RDBMS is a 2 dimensional way of storing / viewing the data OLAP is a multi dimensional way of storing / viewing the data

Types in OLAP?

1.

2.

3.

Three types of OLAP in the industry. MOLAP Multi dimensional OLAP (Ex MSOLAP, Essbase, Cognos). ROLAP Relational OLAP ( Ex Business Objects, Microstrategy). HOLAP Hybrid OLAP

Aggregates
Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

81

Aggregates
Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

ans

date 1 2

sum 81 48

Another Example
Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

sale

prodId p1 p2 p1

date 1 1 2

amt 62 19 48

rollup
drill-down

Example
roll-up to region

NY
SF LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 56 M T W Th F S S

Dimensions: Time, Product, Store roll-up to brand Attributes: Product (upc, price, ) Store Hierarchies: Product Brand Day Week Quarter roll-up to week Store Region Country

Product

Time
56 units of bread sold in LA on M

Summary of Operations

Aggregation (roll-up)

aggregate (summarize) data to the next higher dimension element e.g., total sales by city, year total sales by region, year

Navigation to detailed data (drill-down) Selection (slice) defines a subcube

e.g., sales where city =Gainesville and date = 1/15/90 e.g., top 3% of cities by average income

Calculation and ranking

Visualization operations (e.g., Pivot) Time functions

e.g., time average

Data Warehouse- Concepts

Module 3 Data Warehouse Implementation Steps

Typical Approach
Data Modeling is a cyclic process involving the following steps Requirement Gathering Requirement Analysis Requirement Validation Logical Modeling Physical Design Implementation Validation The above cycle repeats for any upgrades or enhancements

Requirement Gathering

Identify the Business objectives Identify the reporting requirements Identify the frequency of report generation Granularity of Information Business rules

Requirement Analysis

Study the requirements captured Identify the subject areas Identify the Measures and criteria fields Identify the granularity of information required

Requirement Validation

Validate the analysis with the customer Document Sign off.

Logical Modeling

Identify facts and dimensions Create Logical Model

Physical Design

Analyze Source Systems with respect to Logical Model Data Quality Analysis Physical Design
Data type Indexes Partitioning Database creation etc.,

Source to target mapping Capture Transformation rules Capture Derivation rules for derived fields

Implementation

Database Creation Staging Design (Design Extraction Jobs) Develop ETL Jobs Unit testing of ETL Jobs Schedule Jobs Test Load Data Validation Performance monitoring ETL Job tuning Test Database performance tuning Final loading of data from source to target

Data Warehouse- Concepts

Module 5 Next steps in Data Warehousing

Data Mining

Literally, the purpose is to exploit raw data clusters as if they were gold mines, that is, to look for treasures buried inside. Actually, companies are trying to computerize the discovery proceeding of the various existing trends within large amounts of data. allowing the forecast of future behaviours. Mining tools provides the sophisticated algorithms to find the specific trends with the data available. Example : MS Analysis Server provides the following algorithms. (Decision Tree and Clustering etc)

Difference between OLAP and Data Mining

OLAP

Data Mining

Who were my 10 best customers last year What was the response rate to our mailing?

Which 10 customers offer me the greatest profit potential What is the profile of people who are likely to respond to future mailings?

Business Activity Monitoring (BAM)

BAM is the technology which is used to monitor the DW or OLTP actively for certain value. The system can run the set of process when it finds the exception and sends the information to relevant owners to take the action. Based on the findings immediately update the relevant OLTP system (conceptually its called closing the loop with DSS and OLTP) Example - INFORAY is a BAM tool which you can use on the DW.

Q&A

Thank You