DAY I

WHAT IS DATA WAREHOUSE?
VIEW – I DATA IN JAIL VIEW- II OPERATIONAL Vs.

ANALYTICAL
VIEW – III DATA INTEGRATION VIEW – IV TEXT BOOK

DAY I

DATA WAREHOUSE
Definition

A DATA WAREHOUSE IS SUBJECTORIENTED, INTEGRATED, TIMEVARIANT, NON-VOLATILE COLLECTION OF DATA IN SUPPORT OF MANAGEMENT’S DECISION MAKING PROCESS.

- W. H. Inmon, 1993

DAY II

DWh Architecture- 3 Tier
Multidimensional Database DataCube Lattice of Cuboids Dimension Modeling

3-Tier architecture META DATA Cleaning Integration etc DW Server Background Process OLAP Engine Mining. Analyses DW Server .

all other cuboids lie between the base cuboid and the apex cuboid.Lattice of Cuboids • A multidimensional data can be viewed as a lattice of cuboids. An] at the finest level of granularity is called the base cuboid • The coarsest level consists of one cell with numeric measures of all n dimensions. …. A2. . • The C[A1. • In the lattice of cuboids. This is called an apex cuboid.

Essential Features A multidimensional data model has the following conceptual basic components: • • • • Summary measure: Summary function: Dimension: Dimension hierarchy: .

Types of Dimension There different types of dimensions • Structural Dimension • Information Dimension • Partitioning Dimension • Category Dimension .

• Logical model: • Translate this into dimensional model. .Steps of Data Modeling • Business Model: • Developed a normalized entity-relationship model of the business model of the DWh. • Physical Model: • Translate this into physical model.

• Do not try to put any thought into how the information will be retrieved nor what it will be used for. usefulness of the data is determined by assessing its necessity within the process. • Examples. • Simply focus on the structure. . Sales warehouse • Who bought the product • Who sold the product • What was sold • When was it sold • How was it sold Note that in the operational system.Business Model • To develop normalized ER model.

so it is not in 3NF! . Quarter. Week. WHY? PID -->Product-->Category-->Industry. Month. State. Industry artificial key: PID Table Location: attributes City. Year artificial key: TimeID Important: often. Country artificial key: LocID Table Time: attributes Date. such tables are unnormalized. Transitive functional dependencies. Category.Sample Relational Representation Table Goods: attributes Product.

Automobile Sales (1) • The sales subject area for an automobile manufacturer. . – Identifying industry-specific product attributes creates effective sales data warehouse. • What kind of analytical needs would an automobile manufacturer have beyond simple reports like latest uptick or downturn in sales? • How each model is selling? • What flavors of that model sales best? Note – Time and product dimension are standard component of any sales data warehouse.

Option price. for meeting sales goals. The actual sale price and the down payment is determined at this stage. The customer either pays cash. This step establishes dealer invoice. MSRP base price. • The dealer receives an incentive payments from the manufacturer at various times.Automobile Sales (2) • The dealer takes delivery of the vehicle and is invoiced by the manufacturer. • The sale is completed. leases the vehicle. finances the vehicle or. . • Dealer adds on other revenue generating items. moving stagnant models.

day. color. • Discriminator: descriptive characteristics of a product that further describe it and are relevant to purchasing decisions. year. date. . – Rich time dimension gives flexibility in reporting • Product dimension: product id. and quarter. – Product dimension is the discriminator to differentiate the product in the marketplace. day of week.Automobile Sales (3) • Time dimension: month. make and year. the model.

gender.Other Dimensions • Categories: minivan. • Note that financing is a popular term of generating sales. geography • Dealer dimension • “Method of payment” dimension. cars. • Age. Lease. if you are not interested only in number of vehicles sold. muv • Demographic dimension. • Demographic data of customers is also of importance. • Loan. cash etc. instalments. Careful before finalising the dimension . income.

floor mats. style. and the line The price of manufacturer’s options added to the base price to determine the full manufacturer’s price The price of stereo. This includes any amount received by finance agency The base price for the model.Measures for the Sales fact table fact Actual sales price meaning The amount of money paid by the customer to the dealer for the vehicle. other add-ons Source Dealers sales reporting system MSRP Options price Manufacturer’s internal price listing Manufacturer’s option price listings Dealers sales reporting system Dealers add-on .

accounting systems Dealers invoicing system Dealers credit Total credits provided to the dealer for various reasons Dealer invoice The amount paid at the invoice by the dealer before applying for credits The cash provided by the customer to the dealer at the time of purchase Down payment Dealers sales reporting system .Measures for the Sales fact table fact MSRP full price meaning MSRP base price + option Price Source Derived at loading from other data Marketing dept information.

• We can answer: Since first September what dealers have highest percentages of lease transactions. • Each should be explicitly defined – what it means – how it is calculated – how it relates to other measures • If a measure is not what business persons expects do not include it. .Tips • There should never any doubt about precise meaning of any measure in the fact table.

• Note that many measures are related. • It is better to store these.Automobile Sales (4) • Manufacturers proceeds and dealers proceeds are other measures. – – – – Makes reporting process complex Increases chances of error Limits the capabilities of ad hoc access Save disk space Cost overweigh benefits . It can be derived from other measures • Whether to store the derived measure or to compute these measures when required.

AGE. DEMOGRAPHY GENDER. TIME YEAR. SALARY.FACTS ACTUAL SALE PRICE MSRP SALE PRICE OPTIONS PRICE FULL PRICE DEALER ADD-ONS DEALER CREDITS DEALER INVOICE AMMOUNT OF DOWNPAYMENT . MODEL YEAR. CATEGORY. ADDRESS. . FAMILTY SIZE…. PROFESSION.. EXTERIOR COLO. WE CAN GROUP THESE INFORMATION TO ONE STRUCTURE FACT TABLE PRODUCT MODEL NAME. INTERIOR COLOR…. INTEREST RATE. LENGTH. MONTH. DATE… PAYMENT METHOD EMI. QUARTER. DAY OF THE WEEK.

.Let us model as relations • Facts becomes a table • Each dimension becomes a table • Let’s look at the dimension table first. color int.. color. – – – – – product id product name product category ext.

product brand • Flattened out. not normalized • Multiple hierarchies • Dimension table key (primary key) . – Mostly for description • Attributes not directly related – Package size.Dimension Table • Large number of attributes • Less number of records • Can include textual attributes – Dimension table attributes are normally not for calculation.

– – – – – – product id time id demography id payment id fact1 fact2… Key of the fact table is the combination of primary keys of dimension tables . • Let’s look at the fact table.How is Fact Table built? • Recall – One cell of fact corresponds to one combination of values of all dimensions.

deep and not wide Sparsity of data Degenerate dimensions .Fact Table • • • • • • Concatenated fact table key Grain or level of data identified Large number of records Few attributes.

Dimension Table 1 Dimension Table 5 Fact Table Dimension Table 2 Dimension Table 4 Dimension Table 3 .

• each tuple in the Fact Table corresponds to one and only one tuple in each Dimension Table. • Each dimension is a single. Whereas one tuple in a Dimension Table may correspond to more than one tuple in the Fact Table. • The Dimension Tables consist of columns that correspond to attributes of the dimension. .Star Schema • Single Fact Table • Many Dimension Tables. highly denormalized table.one for each dimension • The Fact Table contains the detail summary data. Its primary key has one key per dimension.

• P gets denormalized by the join .Star Schema Star Join: • a large detail “parent” table P is joined with several small “child” tables that describe join keys in the parent table.

STAR SCHEMA TIME PRODUCT Product_key Model_name Styling_pack Line Category Exterior color firstmodelyea r FACT time_key demo_key Product_key Dealer-key Finance_key MEASURE DEMOGRAPHY MOP Finance_key Finance_type Term_in_mo nths Rate agent DEALER .

Item. Basket_count. • Dimensions: Day. – For some items. there may not be any promotion scheme How do you model this dimension? . Store • Measures: Quantity_ sold.Grocery Store • Nothing is known about Customer. Gross_sales. • Let us introduce yet another dimension “Promotion” • Limitations: – Promotions are different types – A single sale may be attributed to more than one promotion mode. margin. Net_sales.

.Grocery Store (2) • Many-to Many relationships between dimension and fact • Promotion dimension to be Normalized or not? • TIPS: • Remember that the DWh is a DB designed for reporting and analysis. To do so will degrade performance. • Avoid Many-to many relation by introducing a dummy entity in the middle. Resist all urges to normalize dimensions.

Snowflake Schema Fact Table Dimension item) Table ( item-key item-name brand type supplier-key time-key item-key location-key Dimension (time) Table Rupees-sold units-sold time-key year quarter month day Dimension (ocation Table l ) Dimension supplier) Table ( supplier-key suplier-name supplier-address supplier-type location-key street city-key Dimension ( ity) Table c city-key city-name state country pin-code .

Snowflake schema Dimension tables are normalised Dimension Table 1 Fact Table Dimension Table 3 Dimension Table 2 .

Snowflake Schema .

commission. net proceeds . Product.Flowers selling • Selling flowers via toll-free phone line • Dimensions: Customer. discount. Time. Channel • Fact: full price.

amount is refunded. • We can add one more Fact table to the schema. • We can add two more measures: – Expected delivery date and hour • (source: Order Entry System) • (Source: Point_of_sales) – Actual delivery date and hour • Assume that due to late delivery.More Fact Tables • Late supply of flowers at funeral or wedding negates the value. – RETURN_FACT Table – With one additional dimension Return_reason GALAXY SCHEMA .

GALAXY schema MORE THAN ONE FACT TABLES Dimension Table 1 Dimension Table 2 .

PRODUCT Product_key Name Type Occasion Expected-display-life Season Price_range Date_available-from description SALES_FACT Time_key Customer_key Product_key Channel_key Full_price Discount_amount Commission Net_proceeds TIME CUSTOMER Customer_key First_name Last_name Biil_to_address Bill_to_city Bill_state Age Gender Marital_status Household_income CHANNEL Channel_key Type Name Parent_company Region Location_city Original_contract_year .

Galaxy Schema Fact Table Dimension ( itemTable ) item key item name brand type supplier key time key item key location key Dimension (time) Table Rupees sold units -sold time key year quarter month day Fact Table time key item key location key Rupees sold units sold Dimension () Table location location key Dimension () Table street supplier citkey y supplier key suplier name supplieress addr supplier type Dimension Table city ( ) city key city name state country pin code Dimension) (Table location location key strt cit ykey - .