DATAWARE HOUSING FUNDAMENTALS

Definition of Data warehouse Inmon A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management·s decisionsµ.
OR

y

The Dataware House is an informational environment that
‡ ‡ ‡ ‡ ‡

Provides an integrated and total view of the enterprise Makes the enterprise·s Current Historical and Information easily available for decision making Makes the decision-support transactions possible without hindering operational systems Renders the organization·s information consistent Presents a flexible and interactive source of strategic information

OR

´A copy of the transactional data specially structured for reporting and analysisµ

Organizations Use of Dataware Housing y Retail Customer Loyalty Market Planning Financial Risk Management Fraud Detection Manufacturing Cost Reduction Logistics Management Utilities Asset Management Resource Management Airlines Route Profitability Yield Management y y y y .

Dataware House ² Subject Oriented  Organized around major subjects.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process Operational Systems Data Warehouse Customer Billing Order Processing Accounts Receivable Customer Data Account Sales REG Data . Account. Sales. not on daily operations or transaction processing. such as customer.  Focusing on the modeling and analysis of data for decision makers.

it is converted  Operational Systems Data Warehouse Savings Account Loans Account Subject = Account Checking Account .Dataware House . encoding structures. flat files.Integrated  Constructed by integrating multiple. heterogeneous data sources Relational or other databases. among different data sources  When data is moved to the warehouse. etc. external data  Data cleaning and data integration techniques are applied  Ensure consistency in naming conventions. attribute measures.

Operational Systems Delete Load Order Processing Update Insert Data Warehouse Create Access Sales Data . recovery.Dataware House ² Non Volatile  A physically separate store of data transformed from the operational environment  Operational update of data does not occur in the data warehouse environment   Does not require transaction processing. and concurrency control mechanisms Requires only : loading and access of data.

.90 days 5 .10 years .Dataware House ² Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g. past 5-10 years)  Every key structure in the data warehouse  Contains an element of time  But the key of operational data may or may not contain ´time elementµ Operational Systems Data Warehouse Deposit System Customer Data 60 .

unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions Response time is optimum serves relatively lower level of managerial users .Dataware House ² OLTP Vs OLAP OLTP (On-line Transaction Processing) y              holds current data Useful for end users stores detailed data data is dynamic repetitive processing (One record process at a time) high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions Response time is very quick serves large number of operational users y OLAP (On-line Analytical Processing)             holds historic and integrated data Useful for EIS And DSS stores detailed and summarized data data is largely static (A group of records processed in a batch) ad-hoc.

Dataware House Architecture Staging Area .

Dataware House Vs Data Mart y y y y y y y Dataware House Corporate/Enterprise wide Union of all data marts Data received from staging area Structure for corporate view of data Queries on presentation resource Organized an ER model y y y y y y y Data Mart Departmental A single Business process Star join (Facts & Dimensions) Structure to view the departmental view of the data Technology optimal for data access and analysis Structure to suit the departmental view of data .

Dataware To Meet Requirements within Dataware House ‡ The data is organized differently in Dataware House (e.g. : Multidimenssional) -Star Schema -Snow Flake Schema The data is viewed differently Data is stored differently -Vector (array) storage Data is Indexed Differently -Bitmap indexes -Join indexes ‡ ‡ ‡ .

Star Schema y Star Schema : ´A modeling technique used to map multidimensional decision support data into a relational database with the purpose of performing advantage data analysisµ OR ´A relational database schema organized around a central table (Fact table) joined to few smaller tables (dimension tables) using foreign key referencesµ Types of star schema 1)Basic star schema or Star Schema 2)Extended star schema or Snowflake schema. .

Multidimensional modeling y Multidimensional modeling is based on the concept of star schema. 1)Fact table 2)Dimension table Fact Table : ´Fact table contains the transactional data generated out of business transactionsµ Dimension Table : ´Dimension table contains master data or referential data used to analyze transactional dataµ . Star schema consists of two types of tables.

y Fact Table contains two types of columns 1)Measures 2)Key section Dataware House 3 types of measures 1)Additive measures 2)Non-additive measures 3)semi ²additive measures Key Section Date Prod_id Cust_id Measures Sales_revenue Tot_quantity Unit_cost Sale_price Fact Table Additive measures : ´Measures that can involve in the calculation inorder to derive new measuresµ Non-additive measures : ´Measures that can·t participate in the calculationsµ Semi-additive measures : ´Measures that can be participate in the calculations depend on the context ´ Measures that can be added across few dimensions and not with others. .

Resulting in a fact table without the measuresµ .Types of Star Schema Dataware House supports 2 types of star schemas 1)Basic star schema or Star schema 2)Extended star schema or Snow flake schema Star Schema : ´Fact tables existing in normalized format where as dimension tables existing in the demoralized formatµ Snowflake Schema : ´Fact and dimension tables are existed in the normalized formatµ Fact less fact table or Coverage tables: ´Events of the transactions can occur without the measures.

Example of Star Schema .

Example Of Snow Flake Schema .

Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.Dataware House ²Slowly Changing Dimensions Slowly Changing Dimensions : Dimensions that change over time are called Slowly Changing Dimensions. People change their names for some reason.a product price changes over time. Type1:Over writing the old values Type2:Creating an another additional record Type3:Creating new fields . For instance.

SCD Type1 y y Type1 : Overwriting the old values Product price in 2004 Product ID (PK) 1 year 2004 Prod Name Product1 Price 150 y In the year 2005. then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1. if the price of the product changes to $250. y Product Product ID (PK) 1 Year 2005 Prod Name Product1 Price 250 . there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.

00AM Year 2004 2005 Product Name Product1 Product1 Price 150 250 Expiry Date time 12-31-2004 11. PRODUCT Product ID (PK) 1 1 Effective Date time (PK) 01-01-2004 12.59PM .00AM 01-01-2005 12.SCD Type2 y Type 2: Creating an another additional record.

Product1. we are able to see the current price and the previous price of the product. From that.SCD Type3 y y Type3 : Creating new fields In this Type 3. is over years. then the complete history may not be stored. since the old values would have been updated with 2005 product information Product ID(PK) 1 Year Product Name Product1 Product Price 350 Old Product Price 250 Old Year 2006 2005 . For example. then we would not be able to see the complete history of 2004 prices. the latest update to the changed values can be seen. only the latest change will be stored. if the product1's price changes to $350. if the product price continuously changes. Product ID(PK) 1 Current Year 2005 Product Name Product1 Current Product Price 250 Old Product Price 150 Old Year 2004 y The problem with the Type 3 approach. in year 2006. Example mentioned below illustrates how to add new columns and keep track of the changes.

. ‡ Most data warehousing projects consolidate data from different source systems.‡ Extract. but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM). and load (ETL) is a process in database usage and especially in data warehousing that involves: ‡ Extracting data from outside sources ‡ Transforming it to fit operational needs (which can include quality levels) ‡ Loading it into the end target (database or data warehouse) Extract :‡ The first part of an ETL process involves extracting the data from the source systems. Extraction converts the data into a format for transformation processing. Common data source formats are relational databases and flat files. or even fetching from outside sources such as through web spidering or screen-scraping. transform.

g. In other cases.g.. . moving a series of addresses in one record into single addresses in a set of records in a linked address table) ‡ Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.Transform ‡ The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target.. Some data sources will require very little or even no manipulation of data. one or more of the following transformation types may be required to meet the business and technical needs of the target database: ‡ Generating surrogate-key values ‡ Transposing or pivoting (turning multiple columns into multiple rows or vice versa) ‡ Splitting a column into multiple columns (e. putting a comma-separated list specified as a string in one column as individual values in different columns) ‡ Disaggregation of repeating columns into a separate detail table (e.

Depending on the requirements of the organization. ‡ As the load phase interacts with a database. while other DW (or even other parts of the same DW) may add new data in a historicized form. usually the data warehouse (DW). for example. referential integrity. frequently updating extract data is done on daily. the constraints defined in the database schema as well as in triggers activated upon data load apply (for example. which also contribute to the overall data quality performance of the ETL process. this process varies widely. . ‡ Some data warehouses may overwrite existing information with cumulative. weekly or monthly. mandatory fields).Load ‡ The load phase loads the data into the end target. uniqueness. hourly.

create aggregates or disaggregates) ‡ Stage (load into staging tables. helps to diagnose/repair) ‡ Publish (to target tables) ‡ Archive ‡ Clean up . on compliance with business rules. apply business rules.Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: ‡ Cycle initiation ‡ Build reference data ‡ Extract (from sources) ‡ Validate ‡ Transform (clean. Also. if used) ‡ Audit reports (for example. in case of failure. check for data integrity.

. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. ETL applications implement three main types of parallelism: Data: By splitting a single sequential file into smaller data files to provide parallel access.A recent development in ETL software is the implementation of parallel processing. sorting one input file while removing duplicates on another file. for example. Component: The simultaneous running of multiple processes on different data streams in the same job. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.

com) y Latin for from the beginning y Designed to support largest and most complex business applications y Graphical. intuitive.abinitio.AB INITIO INTRODUCTION y Data processing tool from Ab Initio software corporation (http://www. and fits the way your business works text .

4) Data Parallelism of Ab Initio is one feature which makes it distinct from the other ETL tools. as it is Pro C based code. 3) Ab Initio follows all three types of parallelism .Amortization. Ex :.Importance of Ab Initio When Compared to other ETL s 1) Able to Process huge amount of data in a less span of time 2) Easy to write complex and custom ETL logics especially in case of Banking and Financial Applications. . you can write custom code . which an ETL tool needs to handle. 5) When Handling complex logics .