You are on page 1of 9

Data Warehousing Concepts:

What is a Data Warehouse? A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading ( !"# solution, an online analytical processing ($"A%# engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by &illiam Inmon' (ub)ect $riented Integrated *onvolatile !ime +ariant

Subject Oriented ,ata warehouses are designed to help you analyze data. -or example, to learn more about your company.s sales data, you can build a warehouse that concentrates on sales. /sing this warehouse, you can answer questions like 0&ho was our best customer for this item last year10 !his ability to define a data warehouse by sub)ect matter, sales in this case, makes the data warehouse sub)ect oriented. Integrated Integration is closely related to sub)ect orientation. ,ata warehouses must put data from disparate sources into a consistent format. !hey must resolve such problems as naming conflicts and inconsistencies among units of measure. &hen they achieve this, they are said to be integrated. Nonvolatile *onvolatile means that, once entered into the warehouse, data should not change. !his is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. !his is very much in contrast to online transaction processing (OLT ! systems, where performance requirements demand that historical data be moved to an archive. A data warehouse.s focus on change over time is what is meant by the term time variant.

"ontrasting OLT and Data Warehousing #nvironments -igure 232 illustrates key differences between an $"!% system and a data warehouse.

Figure 1-1 Contrasting OLTP and Data Warehousing Environments

Data Warehouse $rchitectures ,ata warehouses and their architectures vary depending upon the specifics of an organization.s situation. !hree common architectures are' ,ata &arehouse Architecture (4asic# ,ata &arehouse Architecture (with a (taging Area# ,ata &arehouse Architecture (with a (taging Area and ,ata 5arts#

Data Warehouse $rchitecture (%asic! -igure 236 shows a simple architecture for a data warehouse. nd users directly access data derived from several source systems through the data warehouse. Figure 1-2 Architecture of a Data Warehouse

!ext description of the illustration dwhsg728.gif In -igure 236, the metadata and raw data of a traditional $"!% system is present, as is an additional type of data, summary data. (ummaries are very valuable in data warehouses because they pre3compute long operations in

advance. -or example, a typical data warehouse query is to retrieve something like August sales. A summary in $racle is called a materiali&ed vie'. Data Warehouse $rchitecture ('ith a Staging $rea! In -igure 236, you need to clean and process your operational data before putting it into the warehouse. 9ou can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. -igure 238 illustrates this typical architecture. Figure 1-3 Architecture of a Data Warehouse ith a !taging Area

!ext description of the illustration dwhsg72:.gif

Data Warehouse $rchitecture ('ith a Staging $rea and Data (arts! Although the architecture in -igure 238 is quite common, you may want to customize your warehouse.s architecture for different groups within your organization. 9ou can do this by adding data marts, which are systems designed for a particular line of business. -igure 23; illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Figure 1-" Architecture of a Data Warehouse ith a !taging Area and Data #arts

%usiness Intelligence ) Data Warehouse ) #TL*

Data Warehousing Schemas A schema is a collection of database ob)ects, including tables, views, indexes, and synonyms. 9ou can arrange schema ob)ects in the schema models designed for data warehousing in a variety of ways. 5ost data warehouses use a dimensional model.

!he model of your source data and the requirements of your users help you design the data warehouse schema. 9ou can sometimes get the source model from your company.s enterprise data model and reverse3engineer the logical data model for the data warehouse from this. !he physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters33size of machine, number of users, storage capacity, type of network, and software. Star Schemas !he star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. !he center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in -igure 632. Figure 2-1 !tar !chema

(nowflake schema' Sno'+la,e Schema architecture (nowflake schema architecture is a more complex variation of a star schema design. !he main difference is that dimensional tables in a sno'+la,e schema are normali&ed , so they have a typical relational database design. (nowflake schemas are generally used when a dimensional table becomes very big and when a star schema can<t represent the complexity of a data structure. -or example if a %=$,/>! dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with 4=A*,( for instance#. !he problem is that the more normalized the dimension table is, the more complicated (?" )oins must be issued to query them. !his is because in order for a query to be answered, many tables need to be )oined and aggregates generated.

,ata &arehousing $b)ects -act tables and dimension tables are the two types of ob)ects commonly used in dimensional data warehouse schemas. -act tables are the large tables in your warehouse schema that store business measurements. -act tables typically contain facts and foreign keys to the dimension tables. -act tables represent data, usually numeric and additive, that can be analyzed and examined. xamples include sales, cost, and profit. ,imension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. ,imension tables store the information you normally use to contain queries. ,imension tables are usually textual and descriptive and you can use them as the row headers of the result set. xamples are customers or products. -act Tables A fact table typically has two types of columns' those that contain numeric facts (often called measurements#, and those that are foreign keys to dimension tables. A fact table contains either detail3level facts or facts that have been aggregated. -act tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. !hough most facts are additive, they can also be semi3 additive or non3additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. *on3additive facts cannot be added at all. An example of this is averages. (emi3additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it. In the real 'orld. it is possible to have a +act table that contains no measures or +acts/ These tables are called 0+actless +act tables0. or 0junction tables0/ The 0-actless +act tables0 can +or e1ample be used +or modeling man2)to)man2 relationships or capture events345

Creating a $e Fact Ta%&e 9ou must define a fact table for each star schema. -rom a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.

Dimension Tables A dimension is a structure, often composed of one or more hierarchies, that categorizes data. ,imensional attributes help to describe the dimensional value. !hey are normally descriptive, textual values. (everal distinct dimensions, combined with facts, enable you to answer business questions. >ommonly used dimensions are customers, products, and time. ,imension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. !hese natural rollups or aggregations within a dimension table are called hierarchies. !ypes of ,imension !ables' 2# >onformed ,imension 6# @unk ,imension !ypes of dimension tables' "on+ormed Dimensions ("D!' these dimensions are something that is built once in your model and can be reused multiple times with different fact tables. -or example, consider a model containing multiple fact tables, representing different data marts. *ow look for a dimension that is common to these facts tables. In this example let<s consider that the product dimension is common and hence can be reused by creating short cuts and )oining the different fact tables.(ome of the examples are time dimension, customer dimensions, product dimension. 'ierarchies Aierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. -or example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. &ithin a hierarchy, each level is logically connected to the levels above and below it. ,ata values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. -or example, in the product dimension, there might be two hierarchies33one for product categories and one for product suppliers. ,imension hierarchies also group levels from general to granular. ?uery tools use hierarchies to enable you to drill down into your data to view different levels of granularity. !his is one of the key benefits of a data warehouse. &hen designing hierarchies, you must consider the relationships in business structures. -or example, a divisional multilevel sales organization. Aierarchies impose a family structure on dimension values. -or a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. !hese familial relationships enable analysts to access data quickly. "evels A level represents a position in a hierarchy. -or example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. "evels range from general to specific, with the root level as the highest or most general level. !he levels in a dimension are organized into one or more hierarchies.

T()ica& Dimension 'ierarch( -igure 636 illustrates a dimension hierarchy based on customers. Figure 2-2 T()ica& Leve&s in a Dimension 'ierarch(

(>, 3 (lowly changing dimensions T2pe 4 S"D ,& architecture applies when no history is kept in the database. !he new, changed data simply overwrites old entries. !his approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters#. !ype 2 (>, is easy to maintain and used mainly when losing the ability to track the old history is not an issue. In the T2pe 6 S"D model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current# value is easy to extract and the history is clear. !he fields .effective date. and .current indicator. are very often used in that dimension. T2pe 7 S"D 3 only the information about a previous value of a dimension is written into the database. An .old .or .previous. column is created which stores the immediate previous attribute. In !ype 8 (>, users are able to describe history immediately and can report both forward and backward from the change. Aowever, that model can.t track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.

T2pe 8 S"D idea is to store all historical changes in a separate historical data table for each of the dimensions. Surrogate 9e2s* In order to manage (lowly >hanging ,imensions properly and easily it is highly recommended to use (urrogate Beys in the ,ata &arehouse tables. A Surrogate 9e2 is a technical key added to a fact table or a dimension table which is used instead of a business key (like product I, or customer I,#. (urrogate keys are always numeric and unique on a table level which makes it easy to distinguish and track values changed over time. In practice, in big production ,ata &arehouse environments, mostly the Slo'l2 "hanging Dimensions T2pe 4. T2pe 6 and T2pe 7 are considered and used. It is a common practice to apply different (>, models to different dimension tables (or even columns in the same table# depending on the business reporting needs of a given type of data.

Data (ining Tools* C ,ata mining tools offer a number data discovery techniques to provide expertise to the data and to help identify relevant set of attributes in the data' C ,ata manipulation which consists of construction of new data subsets derived from existing data sources. C 4rowsing, auditing and visualization of the data which helps identify non3typical, suspected relationships between variables in the data. C Aypothesis testing A group of the most significant data mining tools is represented by' C (%(( >lementine C (A( nterprise 5iner C I45 ,46 Intelligent 5iner C (!A!I(!I>A ,ata 5iner C %entaho ,ata 5ining (& BA# C Isoft Alice

#TL process and concepts #TL stands for extraction, transformation and loading. tl is a process that involves the following tasks' e1tracting data from source operational or archive systems which are the primary source of data for the data warehouse trans+orming the data 3 which may involve cleaning, filtering, validating and applying business rules loading the data into a data warehouse or any other database or application that houses data

!he main goal of maintaining an !" process in an organization is to migrate and transform data from the source $"!% systems to feed a data warehouse and form data marts.

You might also like