You are on page 1of 17

Lecture Notes For DBMS and Data Mining and Data Warehousing

UNIT V Lecture 31

Data Warehouse Architechture

Three-Tier Architecture
1. Warehouse database server Almost always a relational DBMS; rarely flat files 2. OLAP servers Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations. Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations. 3. Clients Query and reporting tools Analysis tools Data mining tools (e.g., trend analysis, prediction)
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

DATA WAREHOUSE COMPONENTS


The data in a data warehouse comes from operational systems of the organization as well as from other external sources. These are collectively referred to as source systems. The data extracted from source systems is stored in a area called data staging area, where the data is cleaned, transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data staging area is generally a collection of machines where simple activities like sorting and sequential processing takes place. The data staging area does not provide any query or presentation services. As soon as a system provides query or presentation services, it is categorized as a presentation server. A presentation server is the target machine on which the data is loaded from the data staging area organized and stored for direct querying by end users, report writers and other applications. The three different kinds of systems that are required for a data warehouse are:
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

1. Source Systems 2. Data Staging Area 3. Presentation servers The data travels from source systems to presentation servers via the data staging area.The entire process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and transfer). Oracles ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Servers ETL tool is called Data Transformation Services (DTS). A typical architecture of a data warehouse is shown below:

Each component and the tasks performed by them are explained below: 1. OPERATIONAL DATA The sources of data for the data warehouse is supplied from: The data from the mainframe systems in the traditional network and hierarchical format. Data can also come from the relational DBMS like Oracle, Informix. In addition to these internal data, operational data also includes external data obtained from commercial databases and databases associated with supplier and customers.

2. LOAD MANAGER The load manager performs all the operations associated with extraction and loading data into the data warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

3. WAREHOUSE MANAGER The warehouse manager performs all the operations associated with the management of data in the warehouse. This component is built using vendor data management tools and custom built programs. The operations performed by warehouse manager include: o Analysis of data to ensure consistency o Transformation and merging the source data from temporary storage into data warehouse tables o Create indexes and views on the base table. o Denormalization o Generation of aggregation o Backing up and archiving of data In certain situations, the warehouse manager also generates query profiles to determinewhich indexes ands aggregations are appropriate. 4. QUERY MANAGER The query manager performs all operations associated with management of user queries. This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs. The complexity of a query manager is determined by facilities provided by the enduser access tools and database. 5. DETAILED DATA This area of the warehouse stores all the detailed data in the database schema. In most cases detailed data is not stored online but aggregated to the next level of details. However the detailed data is added regularly to the warehouse to supplement the aggregated data. 6. LIGHTLY AND HIGHLY SUMMERIZED DATA The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles. The purpose of the summarized information is to speed up the query performance. The summarized data is updated continuously as new data is loaded into the warehouse. 7. ARCHIVE AND BACK UP DATA This area of the warehouse stores detailed and summarized data for the purpose of archiving and back up. The data is transferred to storage archives such as magnetic tapes or optical disks. 8. META DATA
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse. It is used for variety of purposed including: (i) The extraction and loading process Meta data is used to map data sources to a common view of information within the warehouse. (ii) The warehouse management process Meta data is used to automate the production of summary tables. (iii)As part of Query Management process Meta data is used to direct a query to the most appropriate data source. The structure of Meta data will differ in each process, because the purpose is different. More about Meta data will be discussed in the later Lecture Notes.

9. END-USER ACCESS TOOLS The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making. These users interact with the warehouse using end user access tools. The examples of some of the end user access tools can be: (i) Reporting and Query Tools (ii) Application Development Tools (iii) Executive Information Systems Tools (iv) Online Analytical Processing Tools (v) Data Mining Tools

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Lecture 32 Data Warehouse vs. Data Marts Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization. Requires extensive business modeling May take years to design and build Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales. Faster roll out, but complex integration in the long run Virtual warehouse: views over operational DBs Materialize some summary views for efficient query processing Easier to build Requisite excess capacity on operational DB servers

Design & Operational Process Define architecture. Do capacity planning. Integrate DB and OLAP servers, storage and client tools. Design warehouse schema, views. Design physical warehouse organization: data placement, partitioning, access methods. Connect sources: gateways, ODBC drivers, wrappers. Design & implement scripts for data extract, load refresh. Define metadata and populate repository. Design & implement end-user applications. Roll out warehouse and applications. Monitor the warehouse OLAP for Decision Support Goal of OLAP is to support ad-hoc querying for the business analyst Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with warehouse data Large data set Semantically enriched to understand business terms (e.g., time, geography) Combined with reporting features Multidimensional view of data is the foundation of OLAP Multidimensional Data Model Database is a set of facts (points) in a multidimensional space A fact has a measure dimension quantity that is analyzed, e.g., sale, budget
By: Sulabh Bansal

Department of Electrical and Electronics

Lecture Notes For DBMS and Data Mining and Data Warehousing

A set of dimensions on which data is analyzed e.g. , store, product, date associated with a sale amount Dimensions form a sparsely populated coordinate system Each dimension has a set of attributes e.g., owner city and county of store Attributes of a dimension may be related by partial order Hierarchy: e.g., street > county >city Lattice: e.g., date> month>year, date>week>year

Operations in Multidimensional Data Model Aggregation (roll-up) dimension reduction: e.g., total sales by city summarization over aggregate hierarchy: e.g., total sales by city and year -> total sales by region and by year Navigation to detailed data (drill-down) e.g., (sales - expense) by city, top 3% of cities by average income Selection (slice) defines a subcube e.g., sales where city = Palo Alto and date = 1/15/96 Visualization Operations (e.g., Pivot)

Approaches to OLAP Servers Relational OLAP (ROLAP) Relational and Specialized Relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces Optimize for each DBMS backend Aggregation Navigation Logic Additional tools and services Multidimensional OLAP (MOLAP) Array-based storage structures Direct access to array data structures Domain-specific enrichment

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Lecture 33

Warehouse Database Schema


Data Warehouse environment usually transforms the relational data model into some special architectures. The determination of which schema model should be used for a data warehouse should be based upon the analysis of project requirements, accessible tools and project team preferences. Points to note: ER design techniques not appropriate Design should reflect multidimensional view Star Schema Snowflake Schema Fact Constellation Schema

Star schema
What is star schema? The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle.

Fact Tables
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level. Dimension Tables A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table. Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels. The main characteristics of star schema: Simple structure -> easy to understand schema Great query effectives -> small number of tables to join Relatively long time of loading data into dimension tables -> de-normalization, redundancy data caused that size of the table could be large. The most commonly used in the data warehouse implementations -> widely supported by a large number of business intelligence tools

Example

Star schema used by example query.

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the snowflake schema article. Fact.Sales is the fact table and there are three dimension tables Dim.Date, Dim.Store and Dim.Product. Each dimension table has a primary key on its PK column, relating to one of the columns (viewed as rows in the example schema) of the Fact.Sales table's three-column (compound) primary key (Date_FK, Store_FK, Product_FK). The non-primary key [Units Sold] column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim.Date dimension). Using schema descriptors with dot-notation, combined with simple suffix decorations for column differentiation, makes it easier to write the SQL for Star Schema queries. This is because fewer underscores are required and table aliasing is minimized. Most SQL database engines allow schemata descriptors, and also permit decoration suffixes on surrogate keys columns. Using square brackets, which are physically easier to type on the keyboard (no shift key needed) are not intrusive and make the code easier to read. For example, the following query extracts how many TV sets have been sold, for each brand and country, in 1997:

SELECT Brand, Country, SUM ([Units Sold]) FROM Fact.Sales JOIN Dim.Date ON Date_FK = Date_PK JOIN Dim.Store ON Store_FK = Store_PK JOIN Dim.Product ON Product_FK = Product_PK WHERE [Year] = 1997 AND [Product Category] = 'tv' GROUP BY Brand, Country

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Lecture 34

Snowflake schema
A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are denormalized with each dimension represented by a single table. A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables ("forks in the road"). The "snowflaking" effect only affects the dimension tables and not the fact tables.

Example

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Snowflake schema used by example query. The example schema shown to the right is a snowflaked version of the star schema example provided in the star schema article. The following example query is the snowflake schema equivalent of the star schema example code which returns the total number of units sold by brand and by country for 1997. Notice that the snowflake schema query requires many more joins than the star schema version in order to fulfill even a simple query. The benefit of using the snowflake schema in this example is that the storage requirements are lower since the snowflake schema eliminates many duplicate values from the dimensions themselves.
SELECT B.Brand, G.Country, SUM (F.Units_Sold) FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK)

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

ON P.Product_Category_Id = C.Id INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country

Common uses Star and snowflake schemas are most commonly found in multi dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schemas are not normalized much, and are frequently designed at a level of normalization short of third normal form. The decision whether to employ a star schema or a snowflake schema should consider the relative strengths of the database platform in question and the query tool to be employed. Star schemas should be favored with query tools that largely expose users to the underlying table structures, and in environments where most queries are simpler in nature. Snowflake schemas are often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex criteria. Benefits of "snowflaking"

Some OLAP multidimensional database modeling tools that use dimensional data marts as data sources are optimized for snowflake schemas. If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate. A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will therefore be easier to implement. A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database. Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snowflaking would typically be required to permit simple query tools to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.
By: Sulabh Bansal

Department of Electrical and Electronics

Lecture Notes For DBMS and Data Mining and Data Warehousing

Fact Constellation schema


What is fact constellation schema? For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.

Department of Electrical and Electronics

By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

Lecture 35 THE E T L (EXTRACT TRANSFORMATION LOAD) PROCESS In this section we will discussed about the 4 major process of the data warehouse. They are extract (data from the operational systems and bring it to the data warehouse), transform (the data into internal format and structure of the data warehouse), cleansing (to make sure it is of sufficient quality to be used for decision making) and load (cleansed data is put into the data warehouse). The four processes from extraction through loading often referred collectively as Data Staging.

EXTRACT Some of the data elements in the operational database can be reasonably be expected to be useful in the decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. Many commercial tools are available to help with the extraction process. Data Junction is one of the commercial products. The user of one of these tools typically has an easy-to-use windowed interface by which to specify the following: (i) Which files and tables are to be accessed in the source database? (ii) Which fields are to be extracted from them? This is often done internally by SQL Select statement. (iii) What are those to be called in the resulting database? (iv) What is the target machine and database format of the output? (v) On what schedule should the extraction process be repeated? TRANSFORM The operational databases developed can be based on any set of priorities, which keeps changing with the requirements. Therefore those who develop data warehouse based on these databases are typically faced with inconsistency among their data sources. Transformation process deals with rectifying any inconsistency (if any). One of the most common transformation issues is Attribute Naming Inconsistency. It is common for the given data element to be referred to by different data names in different databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse. Once all the data elements have right names, they must be converted to common formats. The conversion may encompass the following: (i) Characters must be converted ASCII to EBCDIC or vise versa. (ii) Mixed Text may be converted to all uppercase for consistency. (iii) Numerical data must be converted in to a common format. (iv) Data Format has to be standardized. (v) Measurement may have to convert. (Rs/ $) (vi) Coded data (Male/ Female, M/F) must be converted into a common format.
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

All these transformation activities are automated and many commercial products are available to perform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool. CLEANSING Information quality is the key consideration in determining the value of the information. The developer of the data warehouse is not usually in a position to change the quality of its underlying historic data, though a data warehousing project can put spotlight on the data quality issues and lead to improvements for the future. It is, therefore, usually necessary to go through the data entered into the data warehouse and make it as error free as possible. This process is known as Data Cleansing. Data Cleansing must deal with many types of possible errors. These include missing data and incorrect data at one source; inconsistent data and conflicting data when two or more source are involved. There are several algorithms followed to clean the data, which will be discussed in the coming lecture notes. LOADING Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse. Metadata Repository Administrative metadata source databases and their contents gateway descriptions warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control Business data business terms and definitions ownership of data charging policies operational metadata data lineage: history of migrated data and sequence of transformations applied
Department of Electrical and Electronics By: Sulabh Bansal

Lecture Notes For DBMS and Data Mining and Data Warehousing

currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails.

Warehouse Design Tools Creating and managing a warehouse is hard Development tools defining & editing metadata repository contents (schemas, scripts, rules) Queries and reports Shipping metadata to and from RDBMS catalogue (e.g., Prism Warehouse Manager) Planning & analysis tools impact of schema changes capacity planning refresh performance: changing refresh rates or time windows Warehouse Management Tools Monitoring and reporting tools (e.g., HP Intelligent Warehouse Advisor) which partitions, summary tables, columns are used query execution times for summary tables, types & frequencies of roll downs warehouse usage over time (detect peak periods) Systems and network management tools (e.g., HP OpenView, IBM NetView, Tivoli): traffic, utilization Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Information Advantage InfoAgents & InfoAlert) runaway queries Analysis/Visualization tools: OLAP on metadata OLAP Tools Existing Tools: Seagate, Brio, Cognos Functionality: - Choice of tables - Allowing user to specify interrelation relationships - Use of filtering conditions - Construction of cubes on the fly Main Problems: Cost per license, poor semantics of aggregations across tables, performance for multiple dimension cubes Visual OLAP Tool Tableau:

Department of Electrical and Electronics

By: Sulabh Bansal

You might also like