Data Warehouse Material

Infosys Technologies Limited
Education and Research Department
Data Warehouses
- Modeling and Design
May 2001
Document no. ER/CORP/CRS/SE18/002
Authorized by Dr. VPKochikar
Version revision 1.10
Signature / Date
This page intentionally left blank
ii
Infosys
Data Warehouses: Modeling and Design
Modification Log
Version revision 0.00a 0.00b Date 19 Sep 97 03 Dec 97 Author(s) Ramkumar R. Ramkumar R. Description Draft version Incorporates comments by S. Yegneshwar (more detail on dimensional modeling concepts) & N. Ranga (emphasize difference between OLTP and DW). Write-up on OLTP uploads issues required. . Draft version with modifications before review. Unit1: Introduction to Datawarehousing. Unit 3 : on Datawarehouse Architecture, Unit 4 : Managing Datawarehouse Project, and Unit 5 : Codds OLAP features and various Case Studies.at the end of each unit
1.00 1.10
04 Jan 01 02 May 01
Jitendra Sahoo Jitendra Sahoo
ER/CORP/CRS/SE18/002
Ver 1.10
Page i
Infosys
Course Description
Course number Se18 Course name Data Warehouses: Modeling & Design
Author Jitendra Sahoo, Pre-requisites for attending course: Knowledge of RDBMS + Data Modeling Suggested course 2 days ( 6 hours ( 3 sessions of 2 hours each)) duration
Course objectives
Sl# 1.
Objective To introduce to data warehousing concepts and terminology.
Demonstrable knowledge/skills Knowledge of the basic terms and terminology where key technologies such as OLAP, Multidimensional databases and Data Mining fit into the Data Warehousing landscape Knowledge of conceptualizing a Datawarehouse and define its scope. Ability to understand Dimensional Modeling and explain the rationale behind the design of a data warehouse. Knowledge of Architecture,. Approach in building Datawarehouse Knowledge of how to manage Datawarehouse Projects and addressing the issues. Ability to conceptualize basic idea to select Datawarehouse tools. Ability to understand importance of sizing a Warehouse. Knowledge of Codds OLAP features.
2.
To introduce Data warehouse modeling and design.
3. 4.
To introduce Datawarehouse Architecture & Framework To address Issues in Datawarehouse Projects Datawarehouse Performance Issues To introduce OLAP Fundamentals
5.
Ver 1.10
Page ii
Infosys
Course Design
Sl# Unit name To introduce to data warehousing concepts and terminology. To introduce Data warehouse modeling and design. Unit objectives and keywords Survey the landscape of data warehousing concepts and technology. data warehouse, data mart, data mining, OLAP, multidimensional database Introduce the dimensional model through case studies dimension, fact, star schema, application constraint, browsing, drilling down, primary/secondary dimension, partitioning dimension, custom fact table Introduce techniques to handle large and/or time-varying dimension tables. snowflake, minidimension, slowly changing dimension Introduce techniques for handling queries requiring summarization of data. Data-intensive/selective queries, full aggregation, level; technique, dynamic/SQL-based aggregation, aggregate navigators. snapshots, dirty dimension, degenerate dimension, factless fact table, coverage table Highlight common physical design and implementation techniques used by designers and database vendors. bitmap index, data cardinality, aggregate explosion Bring together concepts from previous units into an integrated case study Highlight how important is Sizing. Explain Maintenance and archiving techniques, procedures. Highlight the E.F.Codds OLAP features. And different category of OLAP models. Est time (deliver) 1h
1.
2.
2h
3.
To introduce Datawarehouse Architecture & Framework To address Issues in Datawarehouse Projects Datawarehouse Performance Issues To introduce OLAP Fundamentals
1h
4.
1.5h
5.
Ver 1.10
Page i
Infosys
Main References : 1. W. H. Inmon. (199?) Building the Data Warehouse , John Wiley, NY. 2. R. Kimball. (1996) The Data Warehouse Toolkit, John Wiley, NY. 3. M. E. Meredith and A. Khader (1997), Divide and Aggregate: Designing Large Warehouses, technical report, Miller Freeman Inc. 4. Designing the Data Warehouse on Relational Databases, technical report, Stanford Technology Group, Inc (an Informix Company). 5.Datawarehouses- for better business by Anjaneyulu Marempudi (marempudi@msn.com ) 6.http://www.sas.com [sas] SAS Institute website
Ver 1.10
Page ii
Infosys
Review Record
Version. Revision 1.00 Date 09may2001 Reviewer(s) Mr. Neelkanth Mishra Description -1. real project related issues e.g. information sharing (one of the biggest soft-bottlenecks in a DW project), problems of scalability and sizing (e.g. one starts with 20 users, but after success 50more want to be added), changing business models, need for top management support, information security, etc. 2. it may be difficult for the reader to read through 50 pages and then aggregate the stuff that he needs. 3.Presentation-wise as well, I think some of the key words in paragraphs should be highlighted/underlined, as the information may otherwise seem too diffuse for the reader. 09may2001 Mr. Assis Dash Remarks by the Author Comments and remarks by the Reviewer has been taken into consideration. Most of the points have been incorporated.
1. Please add the project issues as caveats for the attendees. 2. Identify and list the risks perceived in a datawarehousing project. 3. Please include as many case studies as possible to give a real life scenario. 4. Some amount of reformatting will help.
Assis was involved since the beginning. He has suggested issues related to projects while HDD was made. Some of the stuff also has been derived from his project, white paper, and documents.
Ver 1.10
Page iii
Infosys
Datawarehouse Modeling and Design
Unit 1.
An Introduction to Data Warehousing
What is Data warehousing?

According to Bill Inmon, known as the father of Data Warehousing, a data warehouse is a subject oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. Subject-oriented means that all relevant data about a subject is gathered and stored as a single set in a useful format; Integrated refers to data being stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures, and physical attributes, even when the underlying operational systems store the data differently; Non-volatile means the data warehouse is read-only: data is loaded into the data warehouse and accessed there; Time-variant data. The relevance of time-variant is in the sense of data getting added on as time goes on. Time being the most important dimension, etc. Data warehousing is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, query, analyze and present information.
Unit 1.1
Data warehousing concepts
Operational vs. informational data: Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application. Informational data is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically: Summarized operational data De-normalized and replicated data Infrequently updated from the operational systems Optimized for decision support applications Possibly "read only" (no updates allowed) Stored on separate systems to lessen impact on operational systems
Ver 1.10
Page 4
Infosys
A data warehouse is a subject-oriented, integrated, non-volatile, time variant collection of data in support of management decisions [Inm]. The end-users of a data warehouse are usually business analysts, as distinct from field personnel or call takers. Question: What do you think is the skill profile of the data warehouse end-user? Operational Data Content Data Organization Nature of Data Data Structure & Format Access Probability Data Update Usage Response Time Source: [STG] Current values Application by application Dynamic Complex; suitable for operational computation High Updated on a field-by-field basis Highly structured repetitive processing Sub-second to 2-3 seconds Decision support Archival, summarized, calculated data Subject areas across enterprise Static until refreshed Simple; suitable for business analysis Moderate to low Accessed and manipulated; no direct update Highly unstructured analytical processing Seconds to minutes
Question: Do the descriptions under Data structure & format fit in with the skill profiles of the respective end-users? A data mart is a scaled down deployment of a data warehouse that contains data focusing on a departmental users analytical requirements. For example, the Ohio-based Huntington Bank Corporation set up a data mart for its general ledger system, to get the ledger system's functional information to the bank's financial analysts and budget coordinators quickly. Data mining is the process of examining data for trends and patterns that might have evaded human analysis. For example, Shokos Sunday circulars contained coupons advertising health and beauty aids, consumables, and household chemicals, which were are all located on the lefthand side of the stores. Shokos data mining exercise revealed that people who were coming in to shop gravitated to the left-hand side of the store for the promotional items and were not necessarily shopping the whole store. Consequently, it added apparel promotions to the Sunday circulars. An on-line Analytical Processing (OLAP) application is intended to provide end-users an ability to perform any business logic and statistical analysis that is relevant. This analysis must happen fast, i.e., it must deliver most responses to users within about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds. Multidimensional databases are non-relational DBMS products that are specialized for use for the kinds of queries in data warehouses. This is in contrast to using specialized analysis tools that run on top of a traditional RDBMS. What is the ROI for a data warehouse? A recent study [Fis] of 45 major companies by the International Data Corporation found an average three-year return on investment in data warehouse systems of 401%. More instructive is the very wide range of returns reported by the companies, from 16,000 percent to minus 1,857 percent. Moral: data warehousing is not a silver bullet; use with care! Multi-dimensional data structures can be implemented with multidimensional databases or extended RDBMSs. Relational databases can support this structure through specific database designs (schema), such as "star-schema", intended for multi-dimensional analysis and highly
Ver 1.10
Page 5
Infosys
indexed or summarized designs. These structures are sometimes referred to as relational OLAP (ROLAP)-based structures. Metadata/Information Catalogue: Metadata describes the data that is contained in the data warehouse (e.g. Data elements and business-oriented description) as well as the source of that data and the transformations or derivations that may have been performed to create the data element.
Unit 1.2
Benefits of Data Warehousing
A well designed and implemented data warehouse can be used to: Understand business trends and make better forecasting decisions Bring better products to market in a more timely manner Analyze daily sales information and make quick decisions that can significantly affect your company's performance Data warehousing can be a key differentiator in many different industries. At present, some of the most popular Data warehouse applications include: sales and marketing analysis across all industries inventory turn and product tracking in manufacturing category management, vendor analysis, and marketing program effectiveness analysis in retail profitable lane or driver risk analysis in transportation
Unit 1.3
evolved.
Datawarehousing Application Class: How it has been
Throughout the history of systems development, the primary emphasis had been given to the operational systems and the data they process. But there is a difference in the fundamental requirements of the operational and analysis systems are different: the operational systems need performance, whereas the analysis systems need flexibility and broad scope. It has rarely been acceptable to have business analysis interfere with and degrade performance of the operational systems. Data warehousing has quickly evolved into a unique and popular business application class. Early builders of data warehouses already consider their systems to be key components of their IT strategy and architecture. In building a Datawarehouse application the source inputs are listed below. 1. Data from legacy systems.
In the 1970s virtually all business system development was done on the IBM mainframe computers using tools such as Cobol, CICS, IMS, DB2, etc. The 1980s brought in the new mini-computer platforms such as AS/400 and VAX/VMS. The late eighties and early nineties made UNIX a popular server platform with the introduction of client/server architecture. By some estimates, more than 70 percent of business data for large corporations still resides in the mainframe environment.
2.
Extracted data from micro desktop databases.
Ver 1.10
Page 6
Infosys
In recent times advanced users will frequently use desktop database programs that allow them to store and work with the information extracted from the legacy sources. Many desktop reporting and analysis tools are increasingly targeted towards end users and have gained considerable popularity on the desktop. Another fall side to this is the difficulty in sharing analyses with others, e.g. during budgeting, one user (say the boss) may create analysis models (say allocation rules) that are to be used by all others. The first user then generates the final output by putting these analyses together. Furthermore, semantics of the data may need to be standardized for use before letting it out to the users. In a desktop environment, this may be nearly impossible. As the data is stored on disparate systems, it is very difficult to ensure that updates to the data are communicated to all users, e.g. say sales data comes in, and one person sends brand-wise summaries to some key users who then forwards them to his sub-ordinates. Some hours after that it is realized that data from one of the warehouses was missed out, and revised reports are sent. Result: Different people working on different versions of the same data. Unnecessary reconciliation issues crop up later. 3. Decision-Support and Management Information System ( MIS)
The last category of analysis systems has been decision support systems and executive information systems. Decision support systems tend to focus more on detail and are targeted towards lower to mid-level managers. Executive information systems have generally provided a higher level of consolidation and a multi-dimensional view of the data, as high level executives need more the ability to slice and dice the same data than to drill down to review the data detail. This category is somehow close to Datawarehousing applications, but it has the following defect. These systems have data in descriptive standard business terms, rather than in cryptic computer fields names. Non-technical user design data names and data structures in these systems for use. Datawarehousing Applications are in prominence today because there are key technology is available, hardware prices are down, good Server software, availability of internet applications, Most importantly lots of tools are available too.
Unit 1.4
Self Review through a Case study.
A large Grocery chain, which has large stores around 500 in 3 states. This has many departments also. Each store deals with around 60000 products. 40000 products are brought from external vendors. Rest 20000 is prepared from different departments. Each product has a product code. 1. 2. 3. 4. 5. 6. Which is the primary key? What is the Place for Data Collection? What are the different business activities? On what the management will be interested in What should be Business Goal? What is the Grain?
Ver 1.10
Page 7
Infosys
7. 8.
What is the business measurement s for the Fact table? Give an approximate Database size. Size of the Fact table.
After answering the above questions give an attempt to following conceptual questions. 1. 2. 3. 4. 5. Define grain statement. Define measure. Find difference between OLTP and OLAP. Supply two SQLs to justify both system Justify how the importance of Time with respect to both OLTP and OLAP. Is OLAP and Datawarehousing go together.:
Ver 1.10
Page 8
Infosys
The dimensional model

Traditional normalized database designs are inappropriate for data warehouses for 2 reasons: DSS processing can involve accessing hundreds of thousands of rows at a time across several tables. Complex joins can seriously compromise performance Usage/access paths in an OLTP environment are known a priori. In DSS, usage is very unstructured; users often decide what data to analyze moments before they request it, and applications cannot be hardcoded for a particular schema
Data modeling is a useful design tool because it allows automatic generation of normalized database schema from an ER diagram. Because traditional normalized database designs are inappropriate, traditional data modeling is also inappropriate as a design tool . It continues to be a useful tool for modeling and understanding business information (the business essence) in a technology-independent way, and provides a foundation for mapping the data in the operational data stores to that in the data warehouse. Database designs for data warehouses Course follow a star schema. There are one or more central fact tables, each surrounded by several dimension tables that provide the foreign keys that define each fact table row. The fact table is like a transaction table while the dimension tables are like master tables. For example, if the attendance of a student in a course is a transaction, the associated dimensions would be Course, Instructor, Student and Course_date. For each such transaction, we might like to store no_of_days_attended; this would then be the fact of the transaction. Note that Date_id is Instructor actually a foreign key. Having a separate table called time is essential for performing the typically required in a business context.
Student
Attendance Fact Course_id Student_id Date_id Instructor_id #of_days_attended lab_days_attended
Time
kind of temporal analysis that is
The driver of data warehouse design is the nature of the standard data warehouse query, which is Give me [aggregated] facts broken down by dimensions D1 and D2 for such-and-such time period. For example, we might be interested in looking at attendance by Course by Day, or the sum of attendances by Course by Year, by Student by Quarter, and so on. This translates into SQL which looks something like this: select D1.attrib_1, D2.attrib2, sum(F.fact1), sum(F.fact2) from fact F, dimension1 D1, dimension2 D2, time T where F.dim1_key = D1.key and F.dim2_key = D2.key and F.timekey = T.key and T.quarter = 1Q1997 group by D1.attrib1, D2.attrib2 order by D1.attrib1, D2.attrib2 Exercise: Write an SQL that reveals the number of attendances from SBU1 by Course by Year. Note that time is conceptually just another dimension. A dimensional constraint or filter such as T.quarter = 1Q1997 is called an application constraint. Exercise : You run a grocery store chain. Your sales fact table records, for each sale, the UPC, the date and time, store id, details of promotional offers on the product sold, and of course the sale
Ver 1.10
Page 9
Infosys
quantity and selling price. Imagine the kinds of temporal analyses you would like to do with your sales data. What attributes would you like the Time dimension table to have? How large could this dimension table possibly get? Browsing is the activity of exploring a single-dimension table, prior to firing the template query above, with a two-fold purpose: (i) choosing attributes for the select clause of the query; this might be done by simply dragging the attribute name onto a graphic representing the template query or report. (ii) choosing application constraints for selecting a subset of rows of the table for the query. Consider the completely denormalized product table, with each row containing information about product, product category, and package description. The browsing activity might result in the application constraint Package_desc = TetraPak after a select package_desc where category = beverage. Drilling down is the action of dragging an attribute name (from a dimension table) onto an existing report. The size of a dimension table is invariably a tiny fraction of the size of the fact table. Besides, a data warehouse is updated only once a day, and it is only a small fraction of these days on which a dimension table is ever updated. The dimension tables are thus left completely unnormalized . A policy of having completely unnormalized dimension table allows graphical browsing and automatic SQL generation for all standard user queries of the kind mentioned above. It also pays to have as many descriptive or qualifying attributes for each dimension as can be imagined, so that the end-user can set a variety of application constraints. (Consider, for example, an analyst who wants to know how the sale of paints on Holi (the festival) days differs from sales on other days.) There is a subtle difference between a dimension and an entity . Time is a dimension, but is Day Month Quarter Year associated with a great many entities, as shown in the accompanying figure. For convenience, one of the hierarchies (the one most commonly used in 4-week Week queries) is usually designated the primary period dimension, and every other hierarchy a secondary dimension. Each level in a hierarchy is said to 13-week roll-up to the next level (though it is apparent that roll-ups are not always uniquely defined). period Dimensional modeling, then, attempts to depict the facts and their associated dimensions, without explicitly depicting the entities and relationships that make up a dimensional hierarchy. Exercise : You want to analyze course attendances as well as course nominations. Develop a star schema to do this. Would you choose to have one fact table or multiple? If you choose the former, will you have any new dimensions? In general, a single fact table is a good idea where multiple types of facts share a subtypesupertype relationship with the bulk of the attributes being common. For example, transactions involving bank accounts have different flavors depending on whether it is a savings account or a checking account that is being operated. This difference in flavor manifests itself as mild variations in the composition of attributes that make up the transaction fact. With relatively unrelated fact types on the other hand, the number of common attributes is small, so the preferred choice is to have a custom fact table for each fact type, and replicate common attributes in all custom fact tables to avoid joins. Exercise : What happens to a table that represents a dimension which has subtypes? It is common to have data points (facts) that are described as an adjective of your base data (e.g. actual sales and budgeted sales). Rather than anticipate all adjectives during warehouse design, we can create a partitioning dimension that holds only the adjectives and their descriptions (each adjective is called a partition), with the fact table row containing a column called just sales,
Ver 1.10
Page 10
Infosys
along with a new foreign key called partition. This makes it easy to add a new type of fact such as forecast sales : just insert a row in the partition for the new adjective forecast, and have the fact table foreign key partition indicate forecast for each record that represents a forecast. .
Unit 1.5
Issues in Dimensional Modeling
Big dimensions
Denormalization increases redundancy and, consequently, size. Sometimes a fully denormalized dimension table does become uncomfortably large. When this happens, the dimension may be normalized or snowflaked in the following manner. The dimension table stores one key for each level of the dimension's hierarchy. The lowest level key joins the dimension table to the central fact table. The rest of the keys join the dimension table to the corresponding higher-level tables. In a snowflake schema, every dimension is normalized in this manner. The word snowflake refers to the shape of the fully normalized schema when represented graphically. Exercise : Snowflake the Time dimension. How will you handle multiple hierarchies? Do you think Time is a big dimension? There is significant difference of opinion about whether snowflaking should be done at all even for big dimensions. Ralph Kimball [Kim] believes it should never be done, while the Stanford Technology Group (an Informix company) believes it is useful, since in a dimension table of 500,000 rows, it is conceivable to save two megabytes per row through normalization and hence save a full gigabyte of disk. Kimball compares the saving to the overall size of the typical warehouse which is about 50 GB. Performance is also a factor to be considered: without snowflaking, a query that needs to analyze sales by brand will have to rummage through 500,000 product rows to filter out perhaps a score brands. Another factor is the complexity of the structures as perceived by the end-user (a business analyst), and the associated loss of browsing ability (recall the definition of browsing given earlier). Finally, load programs and overall maintenance become more difficult to manage as the data model becomes more complex. A dimension such as Customer may have many qualifying attributes, such as age, sex, and income_level, which are of interest to the business analyst not as specific values but as a combination of brackets. Instead of retaining these as individual attributes in the customer dimension table, we can replace these by a demographics_key that points to a row in a minidimension table as shown: Demographics minidimension demographics_key age_bracket income_bracket sex .... Sales fact time_key customer_key demographics_key product_key .... Customer dimension customer_key demographics_key first_name address ....
Such a schema speeds up queries with complex demographic conditions. Also note that not all combinations need to be stored in the minidimension tablea customer in the age group <10 is unlikely to be in a high-income category. Question: Why make demographics_key a part of the fact table? Dimensions change their characteristics, albeit slowly. When a customer changes her address, we can either modify the address in the customers record in the customer dimension table (and lose historical information) or insert a new customer dimension record to capture history.
Ver 1.10
Page 11
Infosys
Exercise : You may be able to limit the number of changes of interest; for example, you might rarely analyze data that is over a year old, and it is rare for marital status to change more than three times a year. How could you handle this efficiently? A demographics_key can also undergo changes over time. A customer may move into a higher income bracket, requiring a change in demographics_key. It is easy to see that demographics_key can be treated just like an attribute, albeit a more complex one.
Unit 1.6
Aggregation strategies
Data-intensive queries access a large number of rows. Data-selective queries touch only a few rows, but contain complex and diverse selection criteria. Data warehouse end-users take more strategic decisions and hence execute a large number of data-intensive queries, unlike OLTP endusers. A typical data-intensive query would be give me sales by region for each brand. Preaggregation strategies are required to reduce response time for such queries. A simple measure of the need for pre-aggregation is to compute the compression ratio, which is (the number of rows reported)/(the number of rows retrieved). Full aggregation refers to the precomputation and storage of all possible aggregates (i.e., combinations of all levels of all dimensions). Exercise : How would you estimate the increase in the overall database size for full aggregation? How many additional tables are necessary to store precomputed aggregates? The level technique allows the answer to this question to be zero. With this technique, every aggregate fact is stored in the base fact table. Fact table rows that stored product category sales by store by day would have the product_key point to rows in the Product table that identified a product category rather than an individual product. A level field in the Product table would have level = category for such rows and level = base for rows representing individual products. Exercise : How would you determine the total number of fact and dimension tables for full aggregation using a different dimension table for each level and a different fact table for each aggregate? The level technique and separate table technique are of course two ends of the [full aggregation] spectrum. With a hybrid approach, some aggregates can be stored using the level technique while others can be stored in separate tables. It may be better to use the level technique for cases where new facts keep pouring in. The alternative to precomputation is dynamic or SQL-based aggregation, which is meaningful for aggregates that are not computed often enough to warrant precomputation. For example, aggregate sales for product categories can be computed by selecting SUM(sale_value) and grouping by category. Aggregates for the complete product hierarchy (sales by sub-category, category, brand, etc.) can be computed by successive select statements that group at the respective level. Again, a hybrid approach is possible along this axis: for example, monthly totals may be precomputed (preaggregated) while yearly totals may be arrived at using SQL-based aggregation of monthly totals. Aggregate navigators are tools that allow user applications to fire base-level SQL as though they were performing pure dynamic aggregation. The navigator maintains definitions of the current aggregation table structure, and uses this to rephrase the SQL to access the relevant aggregate tables instead. The algorithm typically looks for the smallest aggregate fact table whose associated dimension tables contain all the dimensional attributes required for the supplied query. Once this is done, the base-level fact and dimension table names are replaced with the aggregate fact and dimension table names. select category_description, sum (qty) from sales_fact, product, store, time where { join conditions on product, store & time } and store.city = Cincinnati and time.day = 01011996 group by category_description ER/CORP/CRS/SE18/002 select category_description, sum (qty) from category_sales_fact, category_product, store, time where { join conditions on category_product, store & time } and store.city = Cincinnati and time.day = 01011996 group by category_description Ver 1.10 Page 12
Infosys
Exercise : Can you use partitioning dimensions to handle adjectives for aggregates? How about Average itself as an adjective?
Unit 1.7
Self review Case study on Dimensional Modeling
The following case study is to be read and observed with the solution given At the end of it one must be very clear how to draw a Dimensional model. The key to creating an efficient MDDB application is thorough analysis of both the data and its users. After the data elements have been identified for reporting to the end users, the business entities will fall in distinct groups of variables with similar characteristics or dimensions. For example, consider a sales organization, which sells articles to different customers through different suppliers spread at various geographical locations. From the transaction data (base tables) of the organization, we can design a fact table which contains the denormalized sales data at a granularity which is required for creating an MDDB. This fact tables stores the Units and the Dollars of the sales volume for at a daily level. The fact table for this case can be outlined as follows: Product Geography Supplier Time Units Dollars
In this example, the first 4 columns represent the key determinants of the two facts (Products sold in Units and Dollars). In an MDDB model, the fields of the four Dimensions must intersect to determine the values of the facts. To create the dimensions for the MDDB which is to be built from the base table, it advisable to have dimension tables for each of the dimensions. These dimension tables should be used to derive the dimension fields, hierarchies and other attributes, if required. In this case, the dimensions could be derived as follows:
PROD GRP G1 G1
PROD FAMILY F1 F2
ARTICLE A1 A2
Ver 1.10
Page 13
Infosys
Product
Geography
COUNTRY C1 C2
REGION R1 R2
SHOP S1 S2
Supplier
GROUP SUPP G1 G2 YEAR Y1 Y1 QTR Q1 Q1
SUPPLIER SU1 SU2 MONTH M1 M1 DAY D1 D2
Time
Measure
Measure Code Units Dollars
Measure Name Products Sold in Units Products Sold in Dollars
Precision Unit K
Note that This design complies with the classic STAR schema design of a Datawarehouse The fact table contains a compound primary key, with one segment for each dimension, and additional columns of additive , numeric facts Each dimension in the design has a defined hierarchy. The parent-child relationship (additive/semi-additive/non-additive ) could be business driven. Alternatively, the dimension tables can be designed to have the attribute level indicator of each record Each dimension contains (and not restricted to) a key segment Deriving a dimension from a table in an MDDB is always advisable because of the following primary reasons: Dimension size can be reduced by selecting only the valid dimension fields from the table. Thus, reducing the size of the MDDB Modifications to the dimension hierarchies can be handled easily Facilitates better maintenance of the cube build process Facilitates standardization of dimensions across different MDDB applications in an organization Descriptions and levels of the dimension fields can be stored in the tables Facts of the data have been clubbed into a Measure dimension to store different attributes of the facts and handle any changes in future (for example, Precision is one of the properties of the facts included here).
Unit 2.
Data Warehouse Architecture
Ver 1.10
Page 14
Infosys
Unit 2.1
Data Warehousing Architecture Model
The following components should be considered for a successful implementation of a Data Warehousing solution: [an] Open Data Warehousing architecture with common interfaces for product integration Data Modeling with ability to model star-schema and multi-dimensionality Extraction and Transformation/propagation tools to load the data warehouse Data warehouse database server Analysis/end-user tools: OLAP/multidimensional analysis, Report and query Tools to manage information about the warehouse (Metadata) Tools to manage the Data Warehouse environment Transforming operational data into informational data: Creating the informational data, that is, the data warehouse, from the operational systems is a key part of the overall data warehousing solution. Building the informational database is done with the use of transformation or propagation tools. These tools not only move the data from multiple operational systems, but often manipulate the data into a more appropriate format for the warehouse. This could mean: The creation of new fields that are derived from existing operational data Summarizing data to the most appropriate level needed for analysis Denormalizing the data for performance purposes Cleansing of the data to ensure that integrity is preserved.
Ver 1.10
Page 15
Infosys
Even with the use of automated tools, however, the time and costs required for data conversion are often significant. Bill Inmon has estimated 80% of the time required to build a data warehouse is typically consumed in the conversion process. Data warehouse database servers--the heart of the warehouse: Once ready, data is loaded into a relational database management system (RDBMS) which acts as the data warehouse. Some of the requirements of database servers for data warehousing include: Performance, Capacity, Scalability, Open interfaces, Multiple-data structures, optimizer to support for star-schema, and Bitmapped indexing . Some of the popular data stores for data warehousing are relational databases like Oracle, DB2, Informix or specialized Data Warehouse databases like RedBrick, SAS. To provide the level of performance needed for a data warehouse, an RDBMS should provide capabilities for parallel processing - Symmetric Multiprocessor (SMP) or Massively Parallel Processor (MPP) machines, near-linear scalability, data partitioning, and system administration. Data Warehousing Solutions - what is hot? Solution Area
Report and Query
Product
Impromptu BrioQuery Business Objects Crystel Reports DSS Agent/Server DecisionSuite EssBase Express Server PowerPlay Brio Enterprise Business Objects Enterprise Miner Clementine Discovery Server Intelligent Minor Darwin
Vendor
Cognos Brio Technology Business Objects Inc Seagate Software Microstrategy Information Advantage Hyperion Solutions Oracle Corp. Cognos Corporation Brio Technology Business Objects SAS Institute SPSS Pilot Software IBM Thinking Machines
OLAP / MD analysis
Data mining
Data Modeling
Data extraction, transformation, load
ER/Win
DataPropagator InfoPump Integrity Data Re-Eng. Warehouse Manager PowerMart DB2 Oracle Server MS SQL Server RedBrick Warehouse SAS System Teradata DBS
Platinum
IBM Platinum Technology Vality Technology Prism Solutions Informatica IBM Oracle Microsoft Red Brick Corp. SAS Institute NCR
Databases for data warehousing
profitability analysis or risk assessment in banking claims analysis or fraud detection in insurance
Unit 3.
Issues in Datawarehousing Projects.
It is important to recognize the issues involved in building and hence in managing the Datawarehouse. Interestingly some user may say that a data warehouse that is only 50 gigabytes is not a full-fledged data warehouse, and they may refer to it instead as a data mart. For a smaller company, 50 gigabytes or even much less can represent every relevant piece of information covering last 10 years and can well represent a powerful data warehouse.
Ver 1.10
Page 16
Infosys
The issues a project leader must keep in mind.[sas] 1. For separating the data for business analysis from the operational data. 2. The logical transformation of the data, including data warehouse modeling and denormalization of the data 3. The issues associated with physical transformation of the data. 4. The generation of summary views.
Unit 3.1
Carrying Data from OLTP to Warehousing data
These issue is here how to separate and when to separate because the operational data from analysis data have not significantly changed with the evolution of the data warehousing systems, except that now they are considered more formally during the data warehouse building process. In the analysis and design phase building Datawarehouse is done through a journey from existing ER model. Advances in technology to producing standard reports, todays data warehousing systems support very sophisticated online analysis including multi-dimensional analysis. Data warehousing systems are most successful when data can be combined from more than one operational system. When the data needs to be brought together from more than one source application, it is natural that this integration be done at a place independent of the source applications. The primary reason for combining data from multiple source applications is the ability to cross-reference data from these applications. Nearly all data in a typical data warehouse is built around the time dimension. The data warehouse system can serve not only as an effective platform to merge data from multiple current applications; it can also integrate multiple versions of the same application. For example, an organization may have migrated to a new standard business application that replaces an old mainframe-based, custom-developed legacy application. The data warehouse system can serve as a very powerful and much needed platform to combine the data from the old and the new applications. Designed properly, the data warehouse can allow for year-on-year analysis even though the base operational application has changed. Operational systems are designed for acceptable performance for pre-defined transactions. For example, an order processing system might specify the number of active order takers and the average number of orders for each operational hour. Even the query and reporting transactions against the operational system are most likely to be predefined with predictable volume. Even though many of the queries and reports that are run against a data warehouse are predefined, it is nearly impossible to accurately predict the activity against a data warehouse. Data is mostly non-volatile. This attribute of the data warehouse has many very important implications for the kind of data that is brought to the data warehouse and the timing of the data transfer. Many data warehousing projects have failed miserably when they attempted to synchronize volatile data between the operational and data warehousing systems. In short, the separation of operational data from the analysis data is the most fundamental datawarehousing concept. Not only is the data stored in a structured manner outside the operational system, businesses today are allocating considerable resources to build data warehouses at the same time that the operational applications are deployed.
Unit 3.2
Issue 2 :Logical transformation of operational data
The data is logically transformed when it is brought to the data warehouse from the operational systems. The issues associated with the logical transformation of data brought from the operational systems to the data warehouse may require considerable analysis and design effort. The architecture of the data warehouse and the data warehouse model greatly impact the success of the project. This section reviews some of the most fundamental concepts of relational database
Ver 1.10
Page 17
Infosys
theory that do not fully apply to data warehousing systems. Even though most data warehouses are deployed on relational database platforms, some basic relational principles are knowingly modified when developing the logical and physical model of the data warehouses. Importance of the possibility of synchronized data in the source systems, e.g. if the product codes are not standard across the source systems, and product attributes are stored across systems, it becomes impossible to maintain all the product attributes in the warehouse. This is one of the most important concerns to be taken care of before initiating a data-warehousing project. While data scrubbing and cleaning can take care of the past data, for continuous updates in an efficient manner, these requirements become essential. The data warehouse model needs to be extensible and structured such that the data from different applications can be added as a business case can be made for the data. A data warehouse project in most cases cannot include data from all possible applications right from the start. Many of the successful data warehousing projects have taken an incremental approach to adding data from the operational systems and aligning it with the existing data. Data warehouse model aligns with the business structure A data warehouse logical model aligns with the business structure rather than the data model of any particular application. The same logic can be applied to entities in an entity relationship diagram, which are used as the starting point for operational systems. Though the relevant points are being covered i.e. narrow definition of entities in applications and the need to create one consolidated attribute base the impact is not felt perhaps because a direct comparison with ER modeling is not made. I feel we should also introduce the concept of an enterprise data model here.
Unit 3.2.1
De-normalization of data
A data modeler in an operational system would take normalized logical data model and convert it into a physical data model that is significantly de-normalized. De-normalization reduces the need for database table joins in the queries. Some of the reasons for de-normalizing the data warehouse model are the same as they would be for an operational system, namely, performance and simplicity. Static relationships in historical data. Another reason that de-normalization is an important process in data warehousing modeling is that the relationship between many attributes does not change in this historical data. Another important example can be the price of a product. The prices in an operational system may change constantly. Some of these price changes may be carried to the data warehouse with a periodic snapshot of the product price table. In a data warehousing system you would carry the list price of the product when the order is placed with each order regardless of the selling price for this order . maintain dynamic relationships between business entities, whereas a data warehouse system captures relationships between business entities at a given time.
Unit 3.3
Issue 3 : Physical transformation of operational data
Historical data and the current operational application data are likely to have some missing or invalid values Physical transformation of data homogenizes and purifies the data. These data warehousing processes are typically known as data scrubbing or data staging processes. Physical transformation includes the use of easy-to-understand standard business terms, and standard values for the data. A complete dictionary associated with the data warehouse can be a very useful tool. During these physical transformation processes the data is sometimes staged before it is entered into the data warehouse. The data may be combined from multiple applications during this staging step or the integrity of the data may be checked during this process.
Ver 1.10
Page 18
Infosys
The terms and names used in the operational systems are transformed into uniform standard business terms by the data warehouse transformation processes. It is important to give single physical definition of an attribute.As an attribute is defined physically for the data warehouse, it is essential to use meaningful data types and lengths. Use the standard data length and data type for each attribute everywhere it is used. A functional data dictionary can facilitate this consistent use of physical attributes. Second important point is to use consistently entity attribute values All attributes in the data warehouse need to be consistent in the use of predefined values. Different source applications invariably use different attribute values to represent the same meaning. These different values need to be converted into a single, most sensible value as the data is loaded into the data warehouse. Or, if the data is to be used by the same set of users, one may need to store the different attributes too, so that users do not see a disconnect between their operational and decision support systems. A far more important problem is inconsistent definition and use of the entities themselves, e.g. some applications may be storing information at the price code detail level (encoded in the product code), while others may be storing at a planning code level (all price codes, variants, etc. are clubbed). Moreover, because of user habits, some of the codes used in the planning system may be outdated, and replaced by new codes in the sales system. Clubbing information from multiple sources then becomes a big problem.
Unit 3.4 Issue 4: Data with default and missing values to be interpreted consistently.
The data brought into the data warehouse is sometimes incomplete or contains values that cannot be transformed properly. It is very important for the data warehouse transformation process to use intelligent default values for the missing or corrupt data. It is also important to devise a mechanism for users of the data warehouse to be aware of these default values. Some data attributes can easily be defaulted to a reasonable value when the original is missing or corrupt. Other values can be obtained by referencing other current data. For example, a missing product attribute such as unit-of-measure on an order entity can be obtained by accessing the current product database. Some attributes cannot be filled by defaults for missing values. In fact, it may be dangerous to attempt to assign default for certain types of missing values. A poor default may corrupt the data and lead to invalid analysis at a later stage. In these cases, it is safest to leave the missing values as blank. In some cases, it may make sense to pick a specific value or symbol that indicates a missing value. February is not stored in the data warehouse. Also, missing data for part of the year prevents any meaningful year-on-year analysis. It is important to design a good system to log and identify data that is missing from the data warehouse. When a user runs a query against the data warehouse, it is essential to understand the population against which the query is run. Accurate and complete transformations help maintain the integrity of the data warehouse.
Unit 3.5
Issue 5: Mapping Data to reflect Business view
Summary views often are generated not only by summarizing the detail data but also by applying business rules to the detail data. For example, the summary views may contain a filter that applies the exact business rules for considering an order a sale or a filter that applies the business rules for allocating a sale to a channel entity. The summary views can hide the complexities of the detail data from the end user for many, if not most, analysis tasks. The business rules that are applied in generating summary views can be complex. These business rules may determine exactly what constitutes a sale or they may determine how a sale is allocated to a sales or channel entity. In addition to applying the business rules while generating summary views, the data warehousing system may perform complex database operations such as multi-table joins. Product sales may be computed by joining the Sales, Invoice, and Product
Ver 1.10
Page 19
Infosys
tables. The criteria to join these tables may be complex. While individuals mining data in the warehouse detail records need to understand all the complexities of business rules, most users can retrieve effective summary business information without fully understanding the detail data. The single most important reason for building the summary views is the significant performance gains they facilitate. The summary views in a data warehouse provide multiple views into the same detail data. These views are predefined dimensions into the detail data. These views provide an efficient method for the analyst to link with the detail data when necessary.
Unit 3.6 Issue 6: Selecting Tools to be used against the data warehouse
In most data warehousing projects, there is a need to select a preferred data warehouse access tool for the most active users. A small number of users generate most of the analysis activity against the data warehouse. The data warehouse performance can be tuned to the requirements of the tool appropriate for these active users. This tool can be used for training and demonstration of the data warehouse. A user can start with a low-level tool that is already familiar to him or her. After becoming familiar with the data warehouse he or she may be able to justify the cost and effort involved with using a more complex tool.
Unit 3.6.1
An Evaluation Checklist
The choice of an OLAP tool for a particular environment and application depends on the key requirements of the analysts, programmers and the end-users of the application. Before getting into the details of subjecting an OLAP tool to any evaluation criteria and performing any test, one needs to have the performance requirements very clear to guide the evaluation process. Some of the focus areas can be found out by having the following questions answered at the outset:[das/rak]
Data Access Features

What would be the final data format? What are the common selection criteria to be used? Are mathematical operations (addition, subtraction etc.) critical to the selection ? Are statistical operations (Statistical functions -mean, average, standard deviation etc.) critical? Use of other data manipulation features Sort, Discard, and Filter, Shifting critical?
Data Exploration Features

Is graphical representation very important? Are traffic light analysis, key performance analysis, pattern matching, and lifetime analysis useful for the users?
Usability
How much of OLAP familiarity exists with the users? Do the users have any quantified performance expectations? Are expert options, user programming required? Is GUI a deciding factor? What is the tolerable online response time?
Size and Scalability

How big is the is the current volume of data? What is the data growth potential in future? How fast is the data growing? Are the users platform-specific for the tool?
Ver 1.10
Page 20
Infosys
Data Warehouses: Modeling and Design What is the batch update window? What is the maximum number of dimension for a single MDDB? What is the maximum aggregation level?
Critical Resources
What are the critical System resources that need to be optimally used by the OLAP tool? What are the resources that are factors for evaluation?
Set of Features
What features will be high priority for the users? Will the features help the users do the work more productively?
Limitations and Constraints

What are the limitations and constraints that should be ruled out? Are there any problems in the current tool(if any) ?
Unit 3.7
Performance considerations
Physical design for a data warehouse is concerned primarily with query performance and less with storage or update performance. Queries can be speeded up in two ways: by speeding up the retrieval of rows from an individual table, and by speeding up the multiple-table join process. Speeding up retrieval . A bitmap index creates an array where the columns are the domain of the indexed field and the rows correspond to the rows of the table. If we indexed Marital_Status with values Single, Married and Other, we would have three columns in the bitmap. Each value in the array is an on/off bit that indicates the value of the field in the corresponding row. This indexing scheme speeds up row selection by the use of bitwise operations. Bitmap indexes are often used in conjunction with B-trees. Suppose you want to index product_category for which there are 1000 distinct values. Suppose there are 100 leaves for the B-tree; each leaf will then represent a range of 10 values for the field being indexed (product_category). Then a bitmap can be maintained at each leaf; each such bitmap will have the same number of rows as before but only 10 columns. Bitmap-based techniques are suitable only for low-cardinality data (i.e., the indexed field must have no more than a couple of hundred distinct values). For medium and high cardinality data, a more traditional B-tree implementation is usually used. Not surprisingly, bitmap-based indexing schemes are costly to update, but this is okay: remember that updates are a rare phenomenon in a data warehouse! The use of aggregates can relieve the pressure to build indexes. A query that does not constrain over a given dimensional level (e.g., get sales by brand by region does not constrain on the product dimension levels below the level brand) can be redirected to a suitable aggregate table. Thus only one sort order on the master composite index on the fact table needs to be built . In the sales data warehouse for example, this composite index could be time by product by store. Put differently, only queries that constrain on the lowest levels will use the base fact table and this composite index. Speeding up joins. Traditional databases typically join two tables at a time; this can be disastrous for data warehouse queries. Worse, the performance varies dramatically with the order in which the tables are joined. A DBMS specialized for data warehouses will instead proceed as follows. First all dimensional constraints are evaluated and a list of respective primary keys is generated. These are combined to generate a sorted list of composite keys that is matched with the fact table index which is itself a sorted list of composite keys. Note that this process can be parallelized, and indeed is, by leading database vendors.
Ver 1.10
Page 21
Infosys
Another possibility is to reduce the number of dimensional tables by creating dimensions whose instances are actually combinations of two or more dimensions. If, for example, we were interested only in 200 brand-region combinations, a brand-region dimension table containing 200 rows could replace the individual brand and region tables. Storage: aggregate explosion. Full aggregation is often dangerous in practice. Consider a sales data warehouse where there are 10000 products and 100 stores. On any given day, not all products are sold in all stores; perhaps only 1% of the possible 10 6 combinations actually occur. Yet, most products will be sold somewhere, and each store will sell something, so that both product-wise and store-wise [daily] aggregates are relevant. This number is itself 10100, so that the database size will double if these aggregates are precomputed. Adding a dimension (say customer) will clearly compound the problem: ( Exercise : how?) The term-compounded growth factor (CGF) [Pen] indicates the database size with full aggregation as a multiple of its size with no pre-aggregation, and is usually between 1.5 and 2.5 per dimension. The solution to this problem is trial and error: use business analysis and query patterns to decide what aggregates are worth precomputing. Paretos law can be expected to hold in this case: 80% of queries will utilize only 20% of all possible aggregates, so that it is possible to meet performance requirements adequately by constructing only a fraction of all possible aggregates.
Unit 3.8
Risks in Datawarehousing Projects.
* Definitions of data are always inconsistent across user types, upstream data systems and applications. No 2 users/systems agree to a common definition easily. This will happen in RA/Design Stage. * Data ownership in data warehouses is very less. So the sanctity of the data is most of the times suspected. This comes out as a problem only when an application is developed to show the data to the users. This will happen in Testing/Implementation stage. * High dependency on Up-stream systems. Any delays in making the up-stream interfaces ready affect the RA/Design/Development cycle. This will happen in RA/Design/Development. * In a reporting application, problems mostly originated from the up-stream systems are attributed to the application. This gives rise to end-user dissatisfaction. This will happen in Acceptance/Testing/Implementation. * Obtaining test data for data validation is a risk if real time data is of very high confidentiality. This will happen in Development/Testing stage. * Squeezed development cycle due to high visibility of the reporting application. the background process of collecting data from different sources is not visible to the users. This will happen in Development/Implementation stage. * Data-processing time needs to be minimized to ensure availability of most up-to-date data worldwide at all times. This will happen in Production/Implementation stage. * The datamart /warehouse support team is located in a country, it is often expected to address issues from users located in different time zones. This will happen in postproduction/maintenance stage.[das]
Unit 3.9
Case study: Insurance
This case study will touch upon almost all concepts that have been introduced before. Think of an insurance company that insures automobiles, homes and individuals. A transaction is related either to policy formulation or to claims processing. An insurance company sells coverages, and the data warehouse is to be used to assess the profitability of the existing coverages. Also of importance is the efficiency of claims redressal. Exercise : List some queries that would be appropriate for an insurance data warehouse to address.
Ver 1.10
Page 22
Infosys
Because a transaction is related either to policy formulation or to claims processing, we can have two fact tables, one for policy formulation and one for claims processing. Question: What about having just one fact table? Conversely, what about having one fact table for each type of claim (small/large) or each type of policy (domestic/automobile/industrial)? The policy creation and claims processing fact tables have the following structures (attributes in italics indicate those which will also appear in monthly policy and claims snapshots. Policy Creation Fact Table TransactionDate: time dimension with alternate hierarchies EffectiveDate: SYNONYM of TransactionDate InsuredParty#: big, dirty dimension Employee# (Agent/Broker/Rater/Underwriter) Coverage#: This is the companys product. May be represented nicely as a subtype/supertype hierarchy. CoveredItem#: Each Coverage specifies some Covereditems. Policy#: Quite possibly a degenerate dimension TransactionType: (Create/Rate/Underwrite/Cancel...) Fact: Set of transaction attributes functionally determined by the above. Attribute set may differ for each coverage. Additional policy creation snapshot attributes : Premium accrued Premium due No-claims bonus (derived) Claims Processing Fact Table TransactionDate: EffectiveDate: InsuredParty# Employee#: the authorizer of the claim Coverage# CoveredItem# Policy# Claimant#: usually a dirty dimension Claim#: a codified description of a claim ThirdParty#: (Witness/Expert/Payee). This assumes only one predefined third party is involved. TransactionType: (Open/Set reserve/Inspect/Pay...) Fact: Set of transaction attributes functionally determined by the above Additional claims processing snapshot attributes : OutstandingClaims: number of outstanding claims at a point in time (semi additive).
Questions: InsuredParty is a dirty dimension, which means that multiple instances of InsuredParty may actually represent the same insured party. How can we clean it? Does CoveredItem need to be distinguished from Coverage? Or does a coverage automatically specify a CoveredItem too? Is CoveredItem a big dimension? Will CoveredItem figure in a fact table representing a monthly policy status snapshot? The Rater may assign a risk_grade to the policy during the Rate transaction. To which table/dimension does risk_grade belong? A degenerate dimension is one which has no separate dimension table. How might Policy be a degenerate dimension? The CoveredItem called automobile has attributes different from the CoveredItem called computer. Are Automobile and Computer subclasses (subtypes) or instances of CoveredItem ? When and how would you split the policy fact table to handle subtypes represented by different attribute sets of Fact? Would this require queries to always access more than one fact table? Should the attributes of a claim (as represented by the codified identifier Claim #) constitute a dimension or should they be part of the FACTS of the claims processing fact table? Note that the attribute set of a claim depends on the coverage scheme.
Coverages come in a huge number of flavors, just as do customers (insured parties). The set of attributes which are specified most often as a combination to browse coverages can be hived off into a minidimension. Examples of such attributes could be risk_level and market_segment. The subset of rows in the policy creation fact table that represent policy cancellations may have no useful FACT attributes, since the purpose of each such row is merely to record the fact of cancellation of an existing policy. If we fragment the fact table horizontally so as to create a
Ver 1.10
Page 23
Infosys
separate table to store just policy cancellations, we have what is called a factless fact table . Each row of a factless fact table contains only foreign keys corresponding to the relevant dimensions, but no fact attributes per se. A factless fact table can also be used to store information about coverages and covered items that did not have any buyers in, say, a given month. In the simplest case this table would have columns for the foreign keys Coverage#, CoveredItem # and Month. At the end of each month, a row would be inserted into this table for each Coverage/Covered item combination that did not attract any buyers (did not figure in any new policy). This is an example of a coverage table (not to be confused with insurance coverages). Exercise : Write a SQL to determine the number of coverage/covered item combinations that did not figure in any new policy in a given month. Aggregates . Imagine a requirement to report, for each coverage, the total premiums received and total claim payments made by calendar month. This needs to be further aggregated to report total premium and claims payments made by month. Note that the requirement to aggregate across policies usually implies a requirement to also aggregate across other dimensions such as Employee and InsuredParty to get a useful result. Exercises: How would you design the warehouse to handle a query along the lines of How many of our customers have chosen money-back policies? [Hint: this may require some changes to the operational system too.] If each precomputed aggregate is stored in a separate aggregate fact table, what is the maximum number of such aggregate fact tables required? Workout: A sales and marketing data warehouse needs to be built for a manufacturer of hospital health care products. Salesmen are assigned territories, which roll up to districts, regions and areas. A product rolls up to subgroup, group and family, where a subgroup is defined by the assembly line that it rolls out from. (A single manufacturing facility may, of course, have more than one assembly line.) The products are typically bought by hospitals through buying groups to which the respective hospitals belong, though a hospital may sometimes buy through a direct contract or even through a buying group in which it is not a member. We need to be able to analyze these different types of sales. Design the data warehouse.
Unit 3.10 Self-Review: A Case Study to arrive at a complete Datawarehouse Solution.
Data Warehousing for Finance Systems

Aboutthiscasestudy Customer Name Industry Project(s) Name Service offered Focus Area Technology User Profile Application Features
A leading computer manufacturer Computer Hardware and Software Manufacture Data Warehousing RA/Design/Development/Implementation/Warranty Data Reporting for Finance and Operations RDBMS/OLAP Finance Managers, Analysts, Planners and CFO office
Ver 1.10
Page 24
Infosys
24*7 Worldwide Access Executive Level Slicing/Dicing/Drilling Global application with regional granularity Seamless DSS to corp. changes 2 Data warehouses, 1 Datamart and 6 OLAP Cubes 4 Years Historical, Current and 2 years Forecast data available for analysis on MDDB Upstream/downstream interfaces Ad-hoc and Canned reporting Bookmarking Business KPIs Robust security features Admin features Bulletin Board
Client Profile:
The client designs, manufactures and markets personal computers and related personal computing and communicating solutions for sale primarily to education, creative, consumer and business customers. It leads the area of computing in revolutionary products and innovative designs in all aspects.
Client's driver for the warehousing project:

Why did the customer have to undertake this project? What were his drivers? What did he want at the end of the day?
The OLAP Reporting Systems started as part of an initiative for meeting the analytical reporting needs of Client's Finance & Operations executives through presentation of useful data on a timely and accurate manner. Finance executives at Client's site needed key global and regional Actual, Plan and Forecast data for Trend Analysis, Visualization, Budgeting, Planning, and Modeling to support Decision-Making. The transaction systems did not segment or aggregate data for business analysis so, there was a need to have a common, consistent, fast and analysis-ready tool for the Finance users world-wide. The financial data being too sensitive, a two layer user security was required to prevent users from accessing data outside their area as well as during freeze period. Before this project, Finance data available on Mainframe systems was manually processed to produce some custom reports for top-management. The access was restricted and limited to user expertise for data within a certain timeframe.
ER/CORP/CRS/SE18/002 Ver 1.10 Page 25
Infosys
Unit 4.
OLAP FUNDAMENTALS.
In 1993, E.F. Codd & Associates published a white paper, commissioned by Arbor Software (now Hyperion Solutions), entitled 'Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate'. Dr Codd is, of course, very well known as a respected database researcher from the 1960s through to the late 1980s and is credited with being the inventor of the relational database model, but his OLAP rules proved to be controversial due to being vendor-sponsored, rather than mathematically based. [olap]
Basic Features
1. Multidimensional Conceptual View Dr Codd, believe this to be the central core of OLAP. 2. Intuitive Data Manipulation. Dr Codd prefers data manipulation to be done through direct actions on cells in the view, without recourse to menus or multiple actions. 3 Accessibility:. In this rule, Dr Codd essentially describes OLAP engines as middleware, sitting between heterogeneous data sources and an OLAP front-end. Most products can achieve this, but often with more data staging and batching than vendors like to admit. 4 Batch Extraction vs Interpretive . This rule effectively requires that products offer both their own staging database for OLAP data as well as offering live access to external data. Today, this would be regarded as the definition of a hybrid OLAP, which is indeed becoming the most popular architecture, so Dr Codd has proved to be very perceptive in this area. 5: OLAP Analysis Models. Dr Codd requires that OLAP products should support all four analysis models that he describes in his white paper (Categorical, Exegetical, Contemplative and Formulaic). Perhaps Dr Codd was anticipating data mining in this rule? 6: Client Server Architecture .Dr Codd requires not only that the product should be client/server but that the server component of an OLAP product should be sufficiently intelligent that various clients can be attached with minimum effort and programming for integration. This is a much tougher test than simple client/server, and relatively few products qualify. Perhaps he
Ver 1.10
Page 26
Infosys
was anticipating a widely accepted API standard, which OLE DB for OLAP is expected to become. 7: Transparency . This test is also a tough but valid one. Full compliance means that a user of, say, a spreadsheet should be able to get full value from an OLAP engine and not even be aware of where the data ultimately comes from. Like the previous feature, this is a tough test for openness. 8: Multi-User Support .Dr Codd recognizes that OLAP applications are not all read-only and says that, to be regarded as strategic, OLAP tools must provide concurrent access (retrieval and update), integrity and security.
Special Features
9: Treatment of Non-Normalized Data . This refers to the integration between an OLAP engine and de-normalized source data. Dr Codd points out that any data updates performed in the OLAP environment should not be allowed to alter stored de-normalized data in feeder systems. regarded as calculated cells within the OLAP database. 10: Storing OLAP Results: Keeping Them Separate from Source Data . This is really an implementation rather than a product issue. In effect, Dr Codd is endorsing the widely held view that read-write OLAP applications should not be implemented directly on live transaction data, and OLAP data changes should be kept distinct from transaction data.. 11: Extraction of Missing Values. All missing values are cast in the uniform representation defined by the Relational Model 12: Treatment of Missing Values. All missing values to be ignored by the OLAP analyzer regardless of their source.
Reporting Features
13: Flexible Reporting. Dr Codd requires that the dimensions can be laid out in any way that the user requires in reports. We would agree, and most products are capable of this in their formal report writers. Dr Codd does not explicitly state whether he expects the same flexibility in the interactive viewers. 14: Uniform Reporting Performance . Dr Codd requires that reporting performance be not significantly degraded by increasing the number of dimensions or database size. Curiously, nowhere does he mention that the performance must be fast, merely that it be consistent. There are differences between products, but the principal factor that affects performance is the degree to which the calculations are performed in advance and where live calculations are done (client, multidimensional server engine or RDBMS). This is far more important than database size, number of dimensions or report complexity. 15: Automatic Adjustment of Physical Level . Dr Codd requires that the OLAP system adjusts its physical schema automatically to adapt to the type of model, data volumes and sparsity.
Dimension Control
16: Generic Dimensionality . Dr Codd takes the purist view that each dimension must be equivalent in both its structure and operational capabilities. However, he does allow additional operational capabilities to be granted to selected dimensions (presumably including time), but he insists that such additional functions should be grantable to any dimension. 17: Unlimited Dimensions & Aggregation Levels . Technically, no product can possibly comply with this feature, because there is no such thing as an unlimited entity on a limited computer. In any case, few applications need more than about eight or ten dimensions, and few hierarchies have more than about six consolidation levels.. 18: Unrestricted Cross-dimensional Operations .Dr Codd asserts, and we agree, that all forms of calculation must be allowed across all dimensions, not just the 'measures' dimension. In fact, many products that use only relational storage are weak in this area. Most products with a multidimensional database are strong. These types of calculations are important if you are doing complex calculations, not just cross tabulations, and are particularly relevant in applications that analyse profitability.
Ver 1.10
Page 27
Infosys In OLAP server data stored in three different ways . Multidimensional OLAP (MOLAP) Relational OLAP (ROLAP) Hybrid OLAP (HOLAP)
MOLAP
MOLAP is a high performance, multidimensional data storage format. With MOLAP, data is stored on the OLAP server. MOLAP gives the best query performance, because it is specifically optimized for multidimensional data queries. MOLAP storage is appropriate for small to medium-sized data sets where copying all of the data to the multidimensional format would not require significant loading time or utilize large amounts of disk space.
ROLAP
With ROLAP data remains in the original relational tables. A separate set of relational tables is used to store and reference aggregation data. ROLAP is ideal for large databases or legacy data that is infrequently queried.
HOLAP
HOLAP combines elements from MOLAP and ROLAP. HOLAP keeps the original data in relational tables but stores aggregations in a multidimensional format. HOLAP provides connectivity to large data sets in relational tables while taking advantage of the faster performance of the multidimensional aggregation storage.
Ver 1.10
Page 28
Infosys
References and bibliography: [Fis] L. Fisher (1996) Along the Infobahn: Data Warehouses in Strategy & Business, Booz, Allen and Hamilton, Inc. [Guff] F. McGuff (1997). Data Modeling for Data Warehouses, http://members.com/fmcguff/dwmodel [Inm] W. H. Inmon. (199?) Building the Data Warehouse, John Wiley, NY. [In2] W. H. Inmon. (1996) Creating the Data Warehouse Data Model from the Corporate Data Model, Prism Solutions Tech Topic Vol.1 No. 2., Prism Solutions Inc., Sunnyvale, CA. Primary reference [Kim] R. Kimball. (1996) The Data Warehouse Toolkit, John Wiley, NY. [Mer] M. E. Meredith and A. Khader (1997), Divide and Aggregate: Designing Large Warehouses, technical report, Miller Freeman Inc. [STG] Designing the Data Warehouse on Relational Databases, technical report, Stanford Technology Group, Inc (an Informix Company). [Pen] N. Pennies (1997) Database explosion, Business Intelligence Ltd. Data Warehousing - For Better Business Decisions [an]Anjaneyulu Marempudi (marempudi@msn.com) [dash] Data Warehousing For Finance Systems by Assis Dash , Infosys http://www.sas.com [sas] SAS Institute website [olap] What is OLAP by N.Pendese by Business Intelligence Limited 2000 [dash/rakesh] Multidimensional database tool evaluation. By Assiss Das and Dr. Rakesh Agarwal.
Ver 1.10
Page 29

Data Warehouse Material

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehouse Material

Uploaded by

Copyright:

Available Formats

Infosys Technologies Limited

Education and Research Department

Document no. ER/CORP/CRS/SE18/002

Authorized by Dr. VPKochikar

Version revision 1.10

This page intentionally left blank

Data Warehouses: Modeling and Design

Jitendra Sahoo Jitendra Sahoo

Data Warehouses: Modeling and Design

Objective To introduce to data warehousing concepts and terminology.

To introduce Data warehouse modeling and design.

Data Warehouses: Modeling and Design

Data Warehouses: Modeling and Design

Data Warehouses: Modeling and Design

Datawarehouse Modeling and Design

An Introduction to Data Warehousing

What is Data warehousing?

Data warehousing concepts

Datawarehouse Modeling and Design

Datawarehouse Modeling and Design

Benefits of Data Warehousing

Datawarehousing Application Class: How it has been

Extracted data from micro desktop databases.

Datawarehouse Modeling and Design

Self Review through a Case study.

Datawarehouse Modeling and Design

Data Warehouses: Modeling and Design

The dimensional model

Attendance Fact Course_id Student_id Date_id Instructor_id #of_days_attended lab_days_attended

kind of temporal analysis that is

Data Warehouses: Modeling and Design

Data Warehouses: Modeling and Design

Issues in Dimensional Modeling

Data Warehouses: Modeling and Design

Data Warehouses: Modeling and Design

Self review Case study on Dimensional Modeling

Data Warehouses: Modeling and Design

GROUP SUPP G1 G2 YEAR Y1 Y1 QTR Q1 Q1

SUPPLIER SU1 SU2 MONTH M1 M1 DAY D1 D2

Measure Code Units Dollars

Measure Name Products Sold in Units Products Sold in Dollars

Data Warehouse Architecture

Data Warehouses: Modeling and Design

Data Warehousing Architecture Model

Data Warehouses: Modeling and Design

Databases for data warehousing

Issues in Datawarehousing Projects.

Data Warehouses: Modeling and Design

Carrying Data from OLTP to Warehousing data

Issue 2 :Logical transformation of operational data

Data Warehouses: Modeling and Design

Issue 3 : Physical transformation of operational data

Data Warehouses: Modeling and Design

Issue 5: Mapping Data to reflect Business view

Data Warehouses: Modeling and Design

Data Access Features

Data Exploration Features

Size and Scalability

Limitations and Constraints

Data Warehouses: Modeling and Design

Risks in Datawarehousing Projects.

Case study: Insurance