You are on page 1of 74

Introduction to Data

Warehousing

1

Topics Covered











OLTP (Online Transaction Processing) System
OLTP Data – Nature
OLTP Shortcomings
DWH Emergence
What is Data Warehouse?
Data Warehouse vs. Operational Systems
DWH – Characteristics and Attributes
Features of Data Warehouse
Elements of Data Warehouse
Method of Development - Operational Systems
Method of Development - Data Warehouse
DWH - Architecture

2

Topics Covered
















Data Modeling Techniques
Dimension Modeling
Star Schema
Snowflake Schema
Design Principle
Data Marts
Metadata
Surrogate Keys
Types of Facts
Slowly Changing Dimension
Conformed Dimension
Factless Fact
Data Warehouse (Do’s & Don’ts)
On Line Analytical Processing
ROLAP/MOLAP/HOLAP
Data Mining
ETL/OLAP/ Data Mining Tools

3

OLTP (Online Transaction Processing) System: • Improve operational Efficiency • Produce daily or monthly reports to be used by middle and lower management • Keeps detailed information OLTP Data – Nature • • • • • High Volume Changes with time Only current Data available Answers simple queries Little help to decision maker 4 .

OLTP Shortcomings: • Focus on transaction • Large amount of data but – Related to transaction – Does not maintain historical data – Does not maintain summarized data • Does not support analytical report 5 .

DWH Emergence: • • • • • • Management more information conscience Desktop power more increasing Hardware prices decreasing Increasing power of server software Explosion of internet End –user more technology savvy 6 .

 Collection of corporate information. and summarized for quick analysis. timevariant (historical) collection of data designed to address the DSS needs  Purpose of Data Warehouse is to support business decisions and not business operations 7 . integrated. non-volatile.What is Data Warehouse?  A data Warehouse is a structured extensible environment designed for the analysis of non-volatile data. updated and maintained for a long period. derived directly from operational systems and some external data. logically and physically transformed from the multiple source applications to align with business structure. expressed in simple business terms.  Also defined as subject-oriented.

• Each transaction deals with small amount of data • Data Warehouse deals with large amount of data. Operational Systems • Data Warehouse is developed incrementally (time taken to deliver the benefits is long) • Operational Systems are primarily concerned with the handling of a single transaction.Data Warehouse vs. • Basically deals with pre-defined events and hence require faster access. …Contd 8 . which are aggregate in nature.

• Time sensitivity of data – Operational System requires current data – Data Warehouse requires historical data 9 . data from Operational System and Data Warehouse should not be mixed.• Since the pattern of usage of the Data Warehouse and Operational System are different or not consistent.

DWH – Characteristics and Attributes A Data Warehouse is •Subject Oriented •Integrated •Time-variant •Non-volatile •Summarized 10 .

pensions. Product. Always with no exception. –Data is Integrated in terms of •Naming convention •Consistent measurement of variables •Encoding structures etc.DWH – Characteristics and Attributes •Subject Oriented –Data Warehouse world is oriented around the major subjects of the enterprise such as Customer. savings. insurance etc. Vendor. On the contrary the Operational world is designed around applications such as loans. •Integrated –Data found in Warehouse is integrated. 11 .

•Nonvolatile –Data is loaded into warehouse and after that the data in the warehouse does not change. In other words data represents data over a long time horizon . –There are two kind of operations that occur in data warehousing •Initial loading of data •Access of data •Periodic addition of data 12 . Data Warehouse is updated as a batch processing and no online updations are allowed.from five to ten years. Time horizon for Operation system is 60-90 days.DWH – Characteristics and Attributes •Time-variant –Data found in warehouse is Time variant (time series). –DWH may not have the most current information.

Summary views and aggregates of the operations data are kept so as to provide faster retrieval of aggregated information.DWH – Characteristics and Attributes •Summarized – In a Data Warehouse. 13 .

Features of Data Warehouse • • • • • • Repository of information Improved access to integrated data Provides historical perspective Variety of end-users use it for different purposes Requires a major system integration effort Reduces the reporting and analysis impact on operational systems 14 .

Elements of Data Warehouse • Source: – Flat files – Source Database – Any other form • Data Staging Area: Intermediate Area • Target: – Database which holds the Data Warehouse or Data Mart 15 .

Operational Systems – – – – – – – Define requirements Analysis and Planning based on requirements Model (E-R Model) Physical Design Development (Coding) Quality Assurance and User Acceptance Implementation 16 .Method of Development .

Data Warehouse (uses an iterative development methodology) • • • • • • • Subject Definition Data Identification or Data Discovery Data Acquisition Data Cleansing Data Transformation Data Loading Exploitation 17 .Method of Development .

Method of Development .Data Warehouse • Subject Definition – What do I want to analyze ? – What would be the Dimensions ? – Steps • • • • Logical Concept Build logical data model Develop transformational model Translate logical model to physical model • Data Identification or Data Discovery – How I can get what I want to analyze? – Where the needed information/data is stored? 18 .

19 . • Generate derived information not stored in OLTP systems.Method of Development .Data Warehouse • Data Acquisition • Extracting data from RDBMS/DBMS/Flat files • Data Cleansing • Removal of inconsistent data • Removal of Unwanted Data • Removing Extreme Cases (data-mining) • Data Transformation • Convert to consistent Business oriented format.

Data Warehouse Data Transformation • Two steps are involved in this process.Method of Development . – Integration and Conversion » Consistent Naming Conventions » Consistent Encoding Structures – Summarization » Keeps summarized Information » Reduces the volume of data to be processed 20 .

• Provides a time variant attribute to data. analyze and report on data – Simple query and reporting – Multidimensional analysis – OLAP using Slice and Dice.Method of Development . • Exploitation • Enables the users to view.Data Warehouse • Loading the Warehouse • Periodic loading from OLTP environment. Drilling 21 .

DWH . analysis and mining tools – Data Warehouse administration and management 22 . reporting.Architecture • Major Components – Data identification – Cleanup – Extraction. Transformation and loading tools – Metadata repository – Data Marts – Data query.

• Data warehouses can be a significant enabler of commercial business applications. particularly customer relationship management (CRM) systems. 23 .g.Data Warehouse Advantage of DWH: • There are many advantages to using a data warehouse. the item with the most sales in a particular area within the last two years. e. some of them are: • Data warehouses enhance end-user access to a wide variety of data. • Decision support system users can obtain specified trend reports.

or project Centrally initiated or driven by user demand The term business intelligence (BI) dates to 1958. Business intelligence applications can be: • • • Mission-critical and integral to an enterprise's operations or occasional to meet a special requirement Enterprise-wide or local to one division. production. and statistical data mining. and practices for the collection.[1] It refers to technologies. applications. analysis. BI applications include the activities of decision support systems. department. and providing access to data to help enterprise users make better business decisions. interactive "slice-and-dice" pivot-table analyses. Software elements support reporting. analyzing. and many other sources of business data for purposes that include.Business Intelligence: Business intelligence (BI) is a broad category of applications and technologies for gathering. query and reporting. and data mining. statistical analysis. most often using data that has been gathered into a data warehouse or a data mart and occasionally working from operational data. storing. and predictive views of business operations. notably. visualization. 24 . Applications tackle sales. financial. current. business performance management. forecasting. online analytical processing (OLAP). integration. BI systems provide historical. and presentation of business information and also sometimes to the information itself.

. .Architecture Metadata Layer Extraction FS1 FS2 .DWH . FSn Transmission N E T W O R K Legacy System Cleansing S T A G I N G Transformation Data Mart Population Aggregation Summarization ODS DM1 DW DM2 DMn A R E A OLAP ANALYSIS Knowledge Discovery 25 .

Data Modeling Techniques • ER Modeling is based on the Entities and the relationships between those entities. 26 . The ER model is an abstraction tool because it can be used to understand and simplify the ambiguous data relationship in the business world. • Dimension Modeling uses three basic Concepts: – Measures – Facts – Dimensions Dimension Modeling is powerful in representing the requirements of business user in the context of database tables.

Each hierarchy can also have multiple hierarchy levels. Fact is logical collection of related measures and dimensions. Dimensions Hierarchies enables to arrange dimensions into one or many hierarchies. 27 . facts are implemented in the core tables in which all the numeric data is stored. Dimensions are the parameters over which we want to perform Online Analytical Processing. consisting of measures. representing the performance or behavior of the business relative to the dimensions.Dimension Modeling • • • • Measure is a numeric attribute of a fact. In a Data Warehouse.

• The outcome of the DIMENSIONAL MODEL is the STAR SCHEMA or SNOWFLAKE SCHEMA 28 .Dimension Modeling • Business model translates into a specific design called DIMENSIONAL MODEL (also called STAR MODEL).

• Drawbacks • Summary data in the fact table yields poorer performance for summary levels. 29 . reduces number of physical joins. highly de-normalized • Benefits • Easy to understand. low maintenance. huge dimensions tables a problem. easy to define hierarchies. with detail and summary data • Fact table primary key has only one key column per dimension • Each dimension is a single table.Star Schema • Attributes • A single fact table. very simple metadata.

Star Schema A Group of Facts connected to Multiple Dimensions Channel Financial Transactions Time Customer Organization Product 30 .

Snowflake Schema • The snowflake schema is an extension of the star schema. That is. where each point of the star explodes into more points. 31 . • While this saves space. the dimension data has been grouped into multiple tables instead of one large table. The result is more complex queries and reduced query performance. Usage:Whether one uses a star or a snowflake largely depends on personal preference and business needs. • Snowflake schemas normalize dimensions to eliminate redundancy. it increases the number of dimension tables and requires more foreign key joins.

which are split across multiple hierarchies and attributes Time Product Financial Transactions Channel Organization Customer Segment Geography 32 .Snow-Flake Schema Snow-flake Schema (= Extended Star Schema) • A Group of Facts connected to Dimensions.

by combining an understanding of the business with an understanding of what data is available. The Second step in the design is to decide on the grain of the fact table in each business process. 33 .Design Principle • • The first step in design is to decide what business process(es) to model.

Design Principle
Designing a Fact Table.
The first step in designing a fact table is to determine the granularity of
the fact table. By granularity, we mean the lowest level of information that
will be stored in the fact table. This constitutes two steps:
1.
Determine which dimensions will be included.
2.
Determine where along the hierarchy of each dimension the information
will be kept.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward
process, because business processes will often dictate clearly what are the
relevant dimensions. The determining factors usually goes back to the
requirements.
For example, in an off-line retail world, the dimensions for a sales fact
table are usually time, geography, and product. This list, however, is by
no means a complete list for all off-line retailers.

34

What Level Within Each Dimensions To Include
• Determining which part of hierarchy the information is stored along
each dimension is a bit more tricky. This is where user requirement
(both stated and possibly future) plays a major role.
• In the above example, will the supermarket wanting to do analysis
along at the hourly level? (i.e., looking at how certain products may
sell by different hours of the day.) If so, it makes sense to use 'hour' as
the lowest level of granularity in the time dimension.
If daily analysis is sufficient, then 'day' can be used as the lowest level
of granularity.
• Note that sometimes the users will not specify certain requirements,
but based on the industry knowledge, the data warehousing team may
foresee that certain requirements will be forthcoming that may result in
the need of additional details. In such cases, it is prudent for the data
warehousing team to design the fact table such that lower-level
information is included. This will avoid possibly needing to re-design
the fact table in the future. On the other hand, trying to anticipate all
future requirements is an impossible

35

Design Principle


A Data Warehousing almost always demands data expressed at the
lowest possible grain of each dimension, not because queries need
to cut through the database in very precise ways, but...
Effort to normalize any of the table in a dimensional database
solely in order to save disk space are a waste of time.
A dimension tables must not be normalized but should remain as
flat tables. Normalized dimension tables destroy the ability to
browse, and the disk space savings gained by normalizing the
dimension tables are typically less than percent of the total disk
space needed for the overall schema.

36

Each type (grain) of aggregate should occupy its own fact table.Design Principle • • The use of pre-stored summaries (aggregates) is the single most effective tool the data warehousing designer has to control performance. and should be supported by the proper set of dimension tables containing only those dimensional attribute that are defined for that grain of aggregate. 37 .

• Key to a successful Data Warehouse lies in getting a data mart in place as soon as possible than implementing the entire Data Warehouse initiative in one go 38 .Data Marts • What is a Data Mart? – It is a subset of Data Warehouse with a specific purpose in mind.

ORACLE Warehouse Builder.g.Metadata • What is Metadata? – Data about data. content and attributes of the data warehouse. structure. • Metadata repository / document gives detailed description of the source. ETL tools (e. • Metadata created using Data modeling tools. INFORMATICA) or manually 39 .

• Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but . • It is just a unique identifier or number for each row that can be used for the primary key to the table. say. indexing on a numerical value is probably better and you could consider creating a surrogate key called.not only can these change.Surrogate Keys • A surrogate key is a substitution for the natural primary key. AIRPORT_KEY. The only requirement for a surrogate primary key is that it is unique for each row in the table. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME. 40 .

thus the table. 41 .Surrogate Key Pros • Surrogate Keys never need changing • Save space • Improve query performance Cons • Overhead in the key generation process • The user cannot understand the key. • If new developers take over. they will also have to figure out the keys.

42 . • Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table. • Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. but not the others.Types of Facts There are three types of facts: • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.

Example Additive: Fact table (Retailer) with the following columns: • Date • Store • Product • Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact.date. In this case. and product. the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. because you can sum up this fact along any of the three dimensions present in the fact table -. Sales_Amount is an additive fact. store. 43 . For example.

Current_Balance and Profit_Margin are the facts. but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). week. as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?).Example Semi-Additive/Non-Additive: Fact table (bank) with the following columns: • Date • Account • Current_Balance • Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day. Profit_Margin is a non-additive fact. 44 . as well as the profit margin for each account for each day. Current_Balance is a semi-additive fact. for it does not make sense to add them up for the account level or the day level.

The facts for this type of fact tables are mostly additive facts. there are two types of fact tables: • Cumulative: This type of fact table describes what has happened over a period of time. For example. and usually includes more semi-additive and non-additive facts. The first example presented here is a cumulative fact table. • Snapshot: This type of fact table describes the state of things in a particular instance of time.Types of Fact Tables Based on the classifications. The second example presented here is a snapshot fact table. this fact table may describe the total sales by product by store by day. 45 .

46 . So. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. 2003. Illinois. How should ABC Inc. this applies to cases where the attribute for a record varies over time Example: Cust_Key 1001 Name Christina State Illinois • Christina is a customer with ABC Inc. California on January.Slowly Changing Dimension • The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell. She first lived in Chicago. the original entry in the customer lookup table has the following record: • At a later date. she moved to Los Angeles.

Therefore. the customer is treated essentially as two people. No trace of the old record exists. 47 . • Type 3: The original record is modified to reflect the change.Solving a Slow Dimension • There are in general three ways to solve this type of problem. • Type 2: A new record is added into the customer dimension table. and they are categorized as follows: • Type 1: The new record replaces the original record.

Type 1 Cust_Key 1001 Name Christina State Illinois In Type 1 Slowly Changing Dimension. For example. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for 48 the data warehouse to keep track of historical changes . it is not possible to trace back in history. since there is no need to keep track of the old information. the company would not be able to know that Christina lived in Illinois before.This is the easiest way to handle the Slowly Changing Dimension problem. in this case. Disadvantages: . the new information simply overwrites the original information Advantages: . By applying this methodology.All history is lost.

both the original and the new record will be present.This will cause the size of the table to grow fast.Type 2 Cust_Key Name State 1001 Christina Illinois 1010 Christina Chicago • In Type 2 Slowly Changing Dimension. Advantages: .This allows us to accurately keep all historical information.This necessarily complicates the ETL process. 49 . storage and performance can become a concern. Disadvantages: . a new record is added to the table to represent the new information. The new record gets its own primary key. . Therefore. When to use Type 2: -Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. In cases where the number of rows for the table is very high to start with.

and we have the following table (assuming the effective date of change is January 15.Type 3 • • • In Type 3 Slowly Changing Dimension. 2003): • Advantages: This does not increase the size of the table. the original information gets updated. For example. There will also be a column that indicates when the current value becomes active.Name .Original State . This allows us to keep some part of history. if Christina later moves to Texas on December 15. and when such changes will only occur for a finite number of time 50 . since new information is updated. one indicating the original value.Effective Date After Christina moved from Illinois to California. and one indicating the current value. To accommodate Type 3 Slowly Changing Dimension. 2003. the California information will be lost. Usage: • Type 3 is rarely used in actual practice. When to use Type 3: • Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes. we will now have the following columns: Customer Key . Disadvantages: Type 3 will not be able to keep all history where an attribute is changed more than once. there will be two columns to indicate the particular attribute of interest.Current State .

This enables reporting across the complete data warehouse in a simple format. • A conformed dimension is a set of data attributes that have been physically implemented in multiple data marts using the same structure. definitions and concepts in each implementation. domain values. 51 . coherent view of the same piece of data throughout the organization. The same dimension is used in all subsequent star schemas defined.Conformed Dimension • A conformed dimension is a single. attributes.

52 . To provide fast access and intuitive "drill down" capabilities of data originating from multiple operational systems. the Calendar dimension is commonly needed in most data marts. Conformed dimensions promote flexibility in your querying while supporting the benefits of ease of query and departmental subject areas that the starschema approach affords. data replication is expected in the Data Warehouse world. • For example. you can query by date/time from one data mart to another to another. it is often necessary to replicate dimensional data in Data Warehouses and in Data Marts.Conformed Dimension • Unlike in operational systems where data redundancy is normally avoided. regardless of what data mart it is used in your organization. By making this Calendar dimension adhere to a single structure.

Factless Fact A factless fact is a fact table that does not contain numeric addictive values. 53 . but is composed exclusively of keys. There are two types of factless fact tables: – Event-tracking – Coverage. They may consist of nothing but keys.

courses. numeric facts. There are no additive. 54 . teachers. A factless fact table for recording student attendance on a daily basis at a college. students. and facilities. such as college student class attendance.Factless Fact – Event Tracking Event tracking records and tracks event that have occurred. The five dimension tables contain rich descriptions of dates.

Coverage A coverage factless tables support the dimensional model when the primary fact table is sparse.Factless Fact . "Which products were on promotion that did not sell?” 55 . for example. a sales promotion factless table. A factless coverage table used in conjunction with an ordinary sales fact table to answer the question.

• Place data warehouses as close to the user as practical to ensure convenience of access and to lower network costs. Detection of patterns in your data is called Data Mining. • Emphasize on Data Cleansing • Plan for Huge Storage and Performance related issues in advance • Do think that there are patterns in the data of our company. • De-normalize data • Defer functionality (Think incremental) 56 .Data Warehouse (Do’s & Don’ts) Do’s • Do understand how the data warehouse will support the strategic goals of the company. The patterns are waiting to be detected.

They are as important as the design itself.Data Warehouse (Do’s & Don’ts) Don’t • Don't think Normalization. 57 . (Emphasize on technology). ( Storage is cheap ) • Don't go for Big-Bang approach (Iterative approach is the right way) • Don't think it as a product (It is a process rather) • Don't think for current when planning for infrastructure like storage or speed (Think future) • Don't emphasize on tools. think Analysis. Data Mining • Don't ignore user Training & Maintenance. • Don't think reports.

On Line Analytical Processing • OnLine .of or relating to analysis • Processing .to move from one state to another leading to a specific result • OLAP . actively working with • Analytical .Connected to.Connecting to a data source to analyze information for a specific purpose 58 .

Slice and Dice is an ability to move between different combinations of dimensions to see different slices of the information. – Drill-down . • OLAP Activities – Slice & Dice . consistent. managers and executives to gain insight into data through fast.On Line Analytical Processing • OLAP enables analysts. The drilling paths may be defined by the hierarchies within dimensions or other relationships that may be dynamic within or between dimensions 59 .Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

FASMI test • Fast means faster response time • Analysis means that the system can cope with any business logic and statistical analysis • Shared is for multiple access • Multidimensional view of data including full support of hierarchies • Information is all of data and derived information needed 60 .On Line Analytical Processing .

• HOLAP stands for Hybrid OLAP. In a HOLAP system one will find queries on aggregated data as well as on detailed data. The RDBMS will store data at a fine grain level. 61 . but the data is store in a Multidimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast. Seagate Software's Holos is an example HOLAP environment. it is a combination of both worlds.ROLAP/MOLAP/HOLAP • ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions. Users see their data organized in cubes with dimensions. • MOLAP stands for Multidimensional OLAP. but the data is really stored in a Relational Database (RDBMS) like Oracle. response times are usually slow.

data mining technology can generate new business opportunities by providing these capabilities. – Automated prediction of trends and behaviors – Automated discovery of previously unknown pattern 62 .Data Mining Given database of sufficient size and quality.

Data Mining Commonly used data mining techniques • Decision trees • Rule induction • Artificial Neural Networks • Clustering • Market Basket Analysis • Link Analysis Applications • Forecasting • Risk Management • Market Management 63 .

ETL Tools • Informatica • DataStage • Oracle Warehouse Builder (OWB) 64 .

OLAP Tools • Congas Products • Impromptu • Tranformer • PowerPlay • Visulizer • Oracle Products • Oracle Discover Administrator • Discover plus • Discover Desktop 65 .

OLAP Tools • Primary Business Objects products – – – – Business Objects . Designer . WebIntelligence .User administration and metadata management. – Broadcast Agent Scheduling and distribution tool.Interface to design universes. Supervisor . 66 .Thin client reporting tool.Full client reporting tool.

Data Mining Tools – – – – BusinessObject Miner Cognos 4Thought Cognos Scenarios Oracle Data Miner 67 .

68 .com – COGNOS www.learndatamodeling.com – Others Links: http://www.kimballgroup.sas.John Wiley & Sons Inc.dwinfocenter.com www.1keydata.com – Data Warehousing www.com http://www.seagate.html – The Data Warehousing Toolkit .com/html/designtips.cognos.com – Seagate www.com/datawarehousing/concepts.Useful Web Sites/Books – BusinessObjects www.Ralph Kimball Publisher .html http://www.dw-institute.businessobjects.org – SAS Institute www.

Thank You 69 .

2. at report level we need to pull the month portion from the time dimension... This applies to additive facts. so join conditions will change...e. 70 . for non additive fact we need to consider the last day flag of time dimension .. and also the reports will change.How would grain change impact the universe? Is the report grain or DB level grain?? 2 aspects: 1. if your fact table is changing i.. if you r moving to lower grain then ur DB is changing and universe will also have an impact of the same. if u have grain which is greater i mean moving from day to month. instead of having date_key u have month_key (this is possible when u maintain day dimension as well as month dimension).. probably u need to create a dimension and fact will have change. If there is no change in the fact table (DB) we may not have to do any thing. in such a case your universe join conditions will change..

but a low level of concurrent DML transactions. bitmap indexing provides: – – Reduced response time for large classes of ad hoc queries. often dramatically. 71 . Rows that satisfy some. Reduced storage requirements compared to other indexing techniques. – Efficient maintenance during parallel DML and loads. Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. This improves response time. conditions are filtered out before the table itself is accessed. but not all. – Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory. The environments typically have large amounts of data and ad hoc queries.Which index do you use in DWH? Bitmap indexes are widely used in data warehousing environments. For such applications.

101/b10736/indexes. Also you can refer to bitmap index in DWH.com/docs/cd/B14117_01/server. In fact. bitmap indexes include rows that have NULL values.htm http://download. particularly when this column is often queried in conjunction with other indexed columns. in a typical data warehouse environments.oracle. the join condition is an equi-inner join between the primary key column or columns of the dimension tables and the foreign key column or columns in the fact table.000 distinct values is a candidate for a bitmap index.com/docs/cd/B14117_01/server. the bitmaps compress better. A bitmap join index can improve the performance by an order of magnitude.Which index do you use in DWH? Bitmap index – Null values Unlike most other types of indexes.htm#g1008401 72 . yielding to less space consumption than a regular bitmap join index on the join column. Refer the below link for bitmap join index. Indexing of nulls can be useful for some types of SQL statements.101/b10736/schemas. a column with 10. the bitmap for the table to be indexed is built for values coming from the joined tables. – Using Bitmap Join Indexes in Data Warehouses In addition to a bitmap index on a single table. Furthermore. By storing the result of a join. For example.oracle. A bitmap index on this column can outperform a B-tree index. is optimal for a bitmap index. the join can be avoided completely for SQL statements using a bitmap join index. since it is most likely to have a much smaller number of distinct values for a bitmap join index compared to a regular bitmap index on the join column. A gender column. We refer to this ratio as the degree of cardinality. In a bitmap join index. – Cardinality The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct values to the number of rows in the table is small. you can create a bitmap join index. data warehouse administrators also build bitmap indexes on columns with higher cardinalities.doc document Link: http://download. which has only two distinct values (male and female). such as queries with the aggregate function COUNT. which is a bitmap index for the join of two or more tables. However. In a data warehousing environment. a bitmap index can be considered for any non-unique column. on a table with one million rows.

73 . – The initialization parameter STAR_TRANSFORMATION_ENABLED should be set to TRUE. This enables an important optimizer feature for star-queries.Bitmap index in Star Schema: To get the best possible performance for star queries. It is set to FALSE by default for backwardcompatibility. it is important to follow some basic guidelines: – A bitmap index should be built on each of the foreign key columns of the fact table or tables.

a degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table. invoice numbers and the like without forcing their inclusion in their own dimension.what is degenerate dimensions In a data warehouse. Degenerate dimensions are often used when a fact table's grain represents transactional level data and one wishes to maintain system specific identifiers such as order numbers. The decision to use degenerate dimensions is often based on the desire to provide a direct reference back to a transactional system without the overhead of maintaining a separate dimension table. 74 .