You are on page 1of 23

DATA WAREHOUSING AND MINING

UNIT-II
DATA WAREHOUSING
Syllabus
Data Warehouse Components, Building a Data warehouse, Mapping Data Warehouse to a
Multiprocessor Architecture, Data Extraction, Clean up and Transformation Tools, Meta data
Objective
To know about data warehousing components
To know the considerations of building a data warehouse
How to map data warehouse to a multiprocessor architecture
To know about Data Extraction, Clean up and Transformation Tools
To know details about Meta data
Data Warehouse Components
1. Overall Architecture
The data warehousing architecture is based on a relational database management system server
that functions as the central repository for informational data. Typically the source data for the
warehouse is coming from the operational applications. As the data enters the data warehouse, it is
transformed into an integrated structure and format.
2. Data Warehouse Database
The central data warehouse database is a cornerstone of the data warehousing environment.
This data base is almost implemented on relational data base management system. Some of the
approaches needed for the data base are
Parallel relational database designs that require a parallel computing platform.
An innovative approach to speed up a traditional RDBMS MDDBS that are based on
proprietary data base technology or implemented using RDBMS
3. Sourcing, Acquisition, Clean Up and transformation Tools
These tools perform all of the conversion, summarization, key changes, structural changes and
condensation needed to transform disparate data into information that can be used by decision support
tool. These functionality includes
Removing unwanted data from operational databases
Converting to common data names and definitions
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating sources data definitions changes
These tools have to deal with some significant issues as follows
Data heterogeneity: This is the difference in the way how the data is defined and used in different
models
Database heterogeneity: DBMS are very different in data models data access language data
languages, data navigation and so on
4. Meta data
Meta data is data about data that describes the data warehouse.Meta data can be classified into
1. Technical metadata:
This contains information about data warehouse data for use by warehouse designers and
administratorswhen carrying out warehouse development and management tasks, Technical metadata
documents include
Information about data sources
Transformation descriptions
Warehouse object and data structure definitions for data targets.
The rules used to perform data clean up and data enhancement.
2. Business Meta data:
Contains information that gives users an easy-to-understand perspective of the information
stored in the data warehouse .Business metadata documents information about.
Subject areas and information object type, including queries, reports, images, video,
and/or audio clips.
Internet home pages.
5. Access Tools
Users interact with the data warehousing using front-end tools. Tools are divided into five
groups
Data query and reporting tools
Application development tools
Executive information system tools
Online analytical processing tools
Data mining tools
1. Query and Reporting Tools

This category is further divided into two groups: Reporting and Managed query tools
Reporting tools can be divided into production reporting tools and desktop report writers

Production reporting tools will let companies generate regular operational reports
or support high-volume batch jobs, such as calculating and paychecks
Report writers, on the other hand, are inexpensive desktop tools designed for end
user.
Managed query tools shield end users from the complexities of SQL and database structures by
inserting a metalayer between users and the database.
Metalayer
It is the software that provides subject oriented views of a database and supports point-and-
click creation of SQL.
2. Applications The tools require such a complex set of queries and sophisticated data models that
the business users may find themselves overwhelmed by the need to become SQL and/or data
modeling reports.
3. OLAP: These tools are based on the concepts of multi-dimensional data bases and allow a
sophisticated user to analyze the data using elaborate, multi dimensional complex views.

4. Data Mining

A critical success factor for any business today is its ability to use information
effectively. Data mining as the process of discovering meaningful new correlations, patterns,
and trends by digging into large amounts of data stored in warehouses, using AI and statistical
and mathematical techniques.
In these areas, data mining can reach beyond the capabilities of the OLAP, especially
since the major attraction of the data mining is its ability to build predictive rather than
retrospective models. Most organizations engage in data mining to
Discover Knowledge
Visualize data
Correct data
5. Data Visualization

Data visualization is not a separate class of tools; rather than it is a method of


presenting the output of the previously mentioned tools in such a way that the entire problem
or/and the solution.
Data visualization goes far beyond simple bar and pie charts. It is a collection of
complex techniques that currently represents an area of intense research.
6.Data Marts

Data marts are presented as an inexpensive alternative to a data warehouse that takes
significantly less time and money to build. It is a subsidiary to a data warehouse of integrated data. It
is created for a dedicated set of users.

A data mart is a set of de normalized, summarized or aggregated data. Data mart is separate
data base server, often on local area network serving a dedicated group of users.

Two types of data marts.

1. Dependent data mart: The data content is sourced from a data ware house, have a high value
because no matter how may are deployed and no matter how many different technology are
use.
2. Independent data mart: Unfortunately the misleading statements about the simplicity and low
cost of data marts sometimes result in organizations or vendors incorrectly positioning them as
an alternative to the data warehouse .This view point defines independent data marts.

The concept of independent data mart is dangerous one. Each enterprise will start to design their
own data marts with out integration. The complex many-to-one problem will be diverted to many-to-
many sourcing and management nightmare. Scalability of data mart is complex.

The business drivers underlying such developments include

Extremely urgent user requirement

The absence of a budget for a full data warehouse strategy

The absence of a sponsor for an enterprise wide decision support strategy

The decentralization of business units

The attraction of easy-to-use tools and a mind-sized project.

The recommended approach by Ralph Kimball is as follows. For any two data marts in an
enterprise, the common dimension must conform to the equality and roll-up rule. In summary data
mart presents two problems those are scalability and integration

6. Data warehouse Administration and Management

Managing data warehouse includes

Security and priority management

Monitoring updates from multiple sources

Data quality checks

Managing and updating meta data

Auditing and reporting data warehouse usage and status.

Purging data

Replicating sub setting and distributing data


Back up and recovery

Data warehouse storage management.

7. Information Delivery System

The information delivery component is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations of choice according to
some user-specified scheduling algorithm. Information delivery system distributes ware house stored
data and other information objects to other data warehouses and end user products

Building a Data warehouse

1. Business Considerations: Return on Investment

1. Approach

The subject oriented nature of the data warehouse determines the scope of the
information in the data warehouse. Organizations embarking on data warehousing
development can chose on of the two approaches

Top-down approach: Meaning that the organization has developed an enterprise data
model, collected enterprise wide business requirement, and decided to build an
enterprise data warehouse with subset data marts

Bottom-up approach: Implying that the business priorities resulted in developing


individual data marts, which are then integrated into the enterprise data warehouse.

2. Organizational Issues

The requirements and environments associated with the informational applications of a


data warehouse are different. Therefore an organization will need to employ different
development practices than the ones it uses for operational applications

2. Design Consideration

In general, a data warehouses design point is to consolidate data from multiple, often
heterogeneous, sources into a query data base. The main factors include
Heterogeneity of data sources, which affects data conversion, quality, time-liness

Use of historical data, which implies that data may be old

Tendency of database to grow very large

Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up
and transformed to fit the warehouse model, and certain transactional attributes of the data are filtered
out. The content and the structure of the data warehouses are reflected in its data model. The data
model is a template for how information will be organized with in the integrated data warehouse
framework.

Meta data: Defines the contents and location of the data in the warehouse, relationship between
the operational databases and the data warehouse, and the business view of the warehouse data that are
accessible by end-user tools. the warehouse design should prevent any direct access to the warehouse
data if it does not use meta data definitions to gain the access.

Data distribution: As the data volumes continue to grow, the data base size may rapidly
outgrow a single server. Therefore, it becomes necessary to know how the data should be divided
across multiple servers. The data placement and distribution design should consider several options
including data distribution by subject area, location, or time.

Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a
specific tool. Selected tools must be compatible with the given data warehousing environment each
other.

Performance consideration: Rapid query processing is a highly desired feature that should be
designed into the data warehouse.

Nine decisions in the design of a data warehouse:

1. Choosing the subject matter

2. Deciding what a fact table represents

3. Identifying and conforming the decisions

4. choosing the facts


5. Storing pre calculations in the fact table

6. Rounding out the dimension table

7. Choosing the duration of the data base

8. The need to track slowly changing dimensions

9. Deciding the query priorities and the query modes.

3. Technical Considerations

A number of technical issues are to be considered when designing and implementing a


data warehouse environment .these issues includes.

The hardware platform that would house the data warehouse.

The data base management system that supports the warehouse data base.

The communication infrastructure that connects the warehouse, data marts, operational
systems, and end users.

The hardware platform and software to support the meta data repository

The systems management framework that enables the centralized management and
administration of the entire environment.

4. Implementation Considerations

A data warehouse can not be simply bought and installed-its implementation requires the
integration of many products with in a data ware house.

Access tools

Data Extraction, clean up, Transformation, and migration

Data placement strategies

Meta data

User sophistication levels: Casual users, Power users, Experts


Mapping Data Warehouse to a Multiprocessor Architecture

1. Relational Data base Technology for data warehouse

The size of a data warehouse rapidly approaches the point where the search of a data warehouse
rapidly approaches the point where the search for better performance and scalability becomes a
real necessity. The search is pursuing two goals

Speed Up: the ability to execute the same request on the same amount of data in less time

Scale-Up: The ability to obtain the same performance on the same request as the data base size
increases.

1. Types of Parallelism

Parallel execution of tasks with in the SQL statements can be done in either of two ways.

Horizontal parallelism: Which means that the data base is partitioned across multiple disks
and the parallel processing occurs in the specific tasks, that is performed concurrently on
different processors against different sets of data

Vertical Parallelism: which occurs among different tasks all components query operations
are executed in parallel in a pipelined fashion. In other words an output from one task
becomes an input into another task as soon as records become available.
2. Data Partitioning

Data partitioning is a key requirement for effective parallel execution of data base operations.
It spreads data from data base tables across multiple disks so that I/O operations such as read
and write can be performed in parallel.

Random partitioning includes random data striping across multiple disks on single servers. In
round robin partitioning, each new record id placed on the new disk assigned to the data base.

Intelligent partitioning assumes that DBMS knows where a specific record id located and does
not waste time searching for it across all disks. This partitioning allows a DBMS to fully
exploit parallel architectures and also enables higher availability.

Intelligent partitioning includes

Hash Partitioning : Hash algorithm is used to calculate the partition number

Key range partitioning : Partitions are based on the partition key

Schema partitioning :Each table is placed in each disk, Useful for small references

User-defined partitioning: Tables are partitioned based on user defined expressions.

2. Database Architecture for parallel Processing

1. Shared-Memory Architecture

Also called as shared-everything style .Traditional approach to implement an RDBMS on


SMP hardware. Simple to implement. The key point of this approach is that a single RDBMS
server can potentially utilize all processors, access all memory, and access the entire
database, thus providing the user with a consistent single system image
2. Shared-disk Architecture

It implements the concept of shared ownership of the entire data base between RDBMS servers,
each of which is running on a node of distributed memory system. Each RDBMS server can
read, write, update and delete records from the same shared data base, which would require the
system to implement a form of distributed lock manager (DLM).

Pining:

In worst case scenario, if all nodes are reading and updating same data, the RDBMS and its
DLM will have to spend a lot of resources synchronizing multiple buffer pool. This problem is
called as pining

Data skew: Un even distribution of data

Shared-disk architectures can reduce performance bottle-necks resulting from data skew
3. Shared-Nothing Architecture

The data is partitioned across many disks, and DBMS is partitioned across multiple
conservers, each of which resides on individual nodes of the parallel system and has an
ownership of its own disk and thus, its own data base partition.

It offers non-linear scalability These requirements includes

Support for function shipping

Parallel join strategies

Support for data repartitioning

Query compilation

Support for data base transactions

Support for the single system image of the data base environment.
4. Combined Architecture

Interserver parallelism of the distributed memory architecture means that each query is
parallelized across multiple servers. While intraserver parallelism of the shared memory
architecture means that a query is parallelized with in the server.

3. Parallel RDBMS Feature

Some of the demands from the DBMS vendors are

Scope and techniques of parallel DBMS operations

Optimized implementation

Application transparency

The parallel environment

DBMS Management tools

Price/Performance
4. Alternative Technologies

In addition to parallel data base technology, a number of vendors are working on other solutions
improving performance in data warehousing environments. These includes

Advanced database indexing products

Specialized RDBMS designed specially for the data warehousing

Multidimensional data bases

5. Parallel DBMS Vendors

1. Oracle

2. Informix

3. IBM

4. Sybase

5. Microsoft

Data Extraction, Clean up and Transformation Tools

1. Tool Requirements

The tools that enable sourcing of the proper data contents and formats from operational
and external data stores into the data warehouse have to perform a number of important
tasks that include

Data transformation from one format to another on the basis of possible differences
between the source and target platforms

Data consolidation and integration, which may include combining several source
records into a single record to be loaded into the warehouse.

Meta data synchronization and management and calculation based on the application of
the business rules that force certain transformation.
2. Vendor Approaches

The tasks of capturing data from a source data system, cleaning and transforming it, and then
loading the results into a target data system can be carried out either by separate products, or
by single integrated solution.

Code generator

Data base data replication tools

Rule-driven dynamic transformation engines

3. Access to Legacy Data

The middleware strategy is the foundation for the tools such as Enterprise/Access from Apertus
Corporation.

The data layer provides data access and transaction services for management of
corporate data asserts

The process layer provides services to manage automation and support for current
business processes.

The user layer manages user interaction with process and/or data layer services. It
allows the user interfaces to change independently of the underlying business
processes.

Meta data
1. Meta data-definition

Meta data is one of the most important aspects of data warehousing. It is data about data stored in the
warehouse and its users.

Meta data contains

The location of and description of the warehouse system and data components

Names, definition, structure and content of the data warehouse and end user views

Identification of authoritative data sources


Integration and transformation rules used to populate the data warehouse; these include the
mapping method from operational data bases into the warehouse, and algorithm used to
convert

Integration and transformation rules used to deliver data to end-user analytical tools

2. Meta data interchange initiative

A MetaData Coalition Introduction

The MetaData Coalition was founded by a group of industry-leading vendors aimed at defining
a tactical set of standard specifications for the access and interchange of meta data between different
software tools. What follows is an overview of Version 1.0 of the MetaData Interchange Specification
(MDIS) initiative taken by the MetaData Coalition. Goals of the MetaData Interchange Specification
Initiative

Situation Analysis

The volatility of our global economy and an increasingly competitive business climate are
driving companies to leverage their existing resources in new, creative, and more effective ways.
Enterprise data, once viewed as merely fodder for the operational systems that ran the day-to-day
mechanics of business, is now being recognized not only as one of these valuable resources but as a
strategic business asset.

However, as the rate of change continues to accelerate-in response to both business pressures
and technological advancement-managing this strategic asset and providing timely, accurate, and
manageable access to enterprise data becomes increasingly critical. This need to find faster, more
comprehensive and efficient ways to provide access to and manage enterprise data has given rise to a
variety of new architectures and approaches, such as data warehouses, distributed client/server
computing, and integrated enterprise-wide applications.

In these new environments, meta data, or the information about the enterprise data, is emerging
as a critical element in effective data management. Vendors as well as users have been quick to
appreciate the value of meta data, but the rapid proliferation of data manipulation and management tools
has resulted in almost as many different "flavors" and treatments of meta data as there are tools.
The Challenge

To enable full-scale enterprise data management, different tools must be able to freely and easily
access, and in some cases manipulate and update, the meta data created by other tools and stored in a
variety of different storage facilities. The only viable mechanism to enable disparate tools from
independent vendors to exchange this meta data is to establish at least a minimum common denominator
of interchange specifications and guidelines to which the different vendors' tools can comply.

Establishing and adhering to a core set of industry meta data interchange specifications will
enable IS managers to select what they perceive as "best of breed" to build the tool infrastructure that
best fits their unique environment needs. In choosing the interchange-compliant tools, they can be
assured of the accurate and efficient exchange of meta data essential to meeting their users' business
information needs.

The MetaData Coalition was established to bring industry vendors and users together to
address a variety of difficult problems and issues with regard to exchanging, sharing, and managing
meta data. This is intended as a coalition of interested parties with a common focus and shared goals,
not a traditional standards body or regulatory group in any way.

Terminology and Basic Assumptions

The MetaData Interchange Specification (MDIS) draws a distinction between: The Application
Metamodel - the tables, etc., used to "hold" the meta data for schemas, etc., for a particular
application; for example, the set of tables used to store meta data in Composer may differ significantly
from those used by the Bachman Data Analyst.

The MetaData Metamodel - the set of objects that the MetaData Interchange Specification can be
used to describe. These represent the information that is common (i.e., represented) by one or more
classes of tools, such as data discovery tools, data extraction tools, replication tools, user query tools,
database servers, etc. The meta data metamodel should be:

Independent of any application metamodel


Character-based so as to be hardware/platform-independent
Fully qualified so that the definition of each object is uniquely identified

Basic Assumptions

The MetaData Coalition has made the following assumptions:

Because users' information needs are growing more complex, the IS organization would ideally
like the interchange specification to support (to the greatest extent possible) the bidirectional
interchange of meta data so that updates can be made in the most natural place. For example,
the user might initially specify the source-to-target mapping between a legacy database and a
RDBMS target in a CASE tool but, after using a data extraction tool to generate and execute
programs to actually move the data, discover that the mapping was somehow incorrect. The
most natural place to test out the "fix" to this problem is in the context of the data extraction
tool. Once the correction is verified, one updates the metamodel in the CASE tool, rather than
having to go to the CASE tool, change the mapping, and trigger the meta data interchange
between the CASE tool and the data extraction tool before being able to test the new mapping.
Vendors would like to support the MetaData Interchange Specification with a minimum amount
of additional development. In light of these assumptions, the meta data model must be
sufficiently extensible to allow a vendor to store the entire metamodel for any application. In
other words, MDIS should provide mechanisms for extending the meta data model so that
additional (and possibly encrypted) information can be passed. An example of when a vendor
might want encryption is in the case of a tool that generates parameters for invoking some
internal routine. Because these parameters might provide other vendors with information
regarding what is considered a proprietary part of their tool, the vendor may wish to encrypt
these parameters.

If one assumed that all updates to the model occurred in the context of a single tool, e.g., the CASE
tool in the example above, the MDIS would not benefit from "carrying along" any of the tool-specific
meta data. However, as the above example indicates, this assumption is not the "natural" meta data
interchange flow. Consequently, some type of mechanism for providing extensions to the type of
information exchanged by the interchange specification is necessary if one hopes to achieve
bidirectional interchange between vendor applications.
The MetaData Interchange Framework

For Version 1.0, the MetaData Council is recommending the ASCII-based batch approach so
that vendors can implement support for the specification with minimum overhead and the customer
benefits from the availability of meta data interchange as quickly as possible.

ASCII Batch Approach

An ASCII Batch approach relies on the ASCII file format that contains the description of the
common meta data components and standardized access requirements that make up the interchange
specification meta data model. In this approach, the entire ASCII file containing the MDIS schema and
access parameters is reloaded whenever a tool accesses the meta data through the specification API.

This approach requires only the addition of a simple import/export function to the tools and
would not require updating the tool in the event of meta data model changes, because the most up-to-
date schema will always be available through the access framework. This eliminates the amount of
retrofitting required to enable tools to remain compliant with the MDIS, because the burden for update
stays primarily within the framework itself.

The MetaData Interchange Specification

There are two basic aspects of the proposed specification:

Those that pertain to the semantics and syntax used to represent the meta data to be exchanged.
These items are those that are typically found in a specifications document.
Those that pertain to some framework in which the specification will be used. This second set
of items is two file-based semaphores that are used by the specification's import and export
functions to help the user of the specification control consistency.
Components defining the semantics and syntax that define the specification:

The Metamodel

The MetaData Interchange Specification Metamodel describes the entities and relationships that
are used to directly represent meta data in the MDIS. The goal in designing this metamodel is twofold:
To choose the set of entities and relationships that represents the objects that the majority of
tools require.
To provide some mechanism for extensibility in the case that some tool requires the
representation of some other type of object. Section 5 describes the metamodel for Version 1.0
of the MetaData Interchange Specification. In the rest of this document the entities that are
directly represented by the specification are referred to as objects in the "public view," while
any other meta data stored in the interchange file is referred to as "private meta data" (i.e., tool-
specific meta data).

The Mechanism for Extending the Metamodel

The mechanism chosen to provide extensibility to the specification is analogous to the


"properties" object found in LISP environments: a character field of arbitrary length that consists of a
set of identifiers and a value, where the identifiers are used by the import function of the specification to
locate and identify the private meta data in question and the value is the actual meta data itself. Note
that because some tools may consider their private meta data proprietary, the actual value for this meta
data may be encrypted.

The MDIS Access Framework

Version 1.0 of the MDIS includes information which will support a bidirectional flow of meta
data while maintaining meta data consistency.

Three types of information are required:

Versioning information in the header of the file containing the meta data
A Tool Profile which describes what type of data elements a tool directly represents and/or
updates
A Configuration Profile which describes the "legal flow of meta data." For example, although
source-to-target mapping may be specified in the context of some analysis tool, once that meta
data has been exported to ETI*EXTRACT and the mapping is changed because of errors found
in expected data, one may want to require that all future changes to mapping originate in
ETI*EXTRACT. If the configuration profile is set properly, the import function for
ETI*EXTRACT would err off if asked to import a conversion specification from the analysis
tool with a version number greater than the version number of the one originally imported from
the mapping tool.

The components of the meta data interchange standard frameworks are

The standard meta data model

The standard access framework

Tool profile

The user configuration

3. Meta data repository

The data warehouse architecture framework represents a higher level of abstraction than the
meta data interchange standard framework and by design. The warehouse design should
prevent any direct access to the warehouse data if it does not use Meta data definitions to gain
the access

The framework provides the following benefits

It provides a comprehensive suite of tools for enterprise wide meta data management

It reduces and eliminates information redundancy, inconsistency, and under utilization.


It simplifies management and improves organization, control, and accounting of
information assets.

It increases identification, understanding, coordinating and utilization of enterprise


wide information assets

It provides effective data administration tools to better manage corporate information


assets with full-function data dictionary

It increases flexibility, control, and reliability of the application development process


and accelerates internal application development.

It leverages investment in legacy systems with the ability inventory and utilize existing
application

It provides a universal relational model for heterogeneous RDBMS to interact and


share information.

It enforces CASE development standards and eliminates redundancy with bthe ability
to share and reuse meta data.
4. Meta data Management

A frequently occurring problem in data warehousing is the inability to communicate to the end
user what information resides in the data warehouse how it can be accessed.

The key to providing users and applications with a roadmap to the information stored in the
warehouse is the meta data.

It defines all data elements and their attributes, data sources and timing, and the rules that govern
data use and data transformation. Meta data needs to be collected as the warehouse is designed and
built. Must enforce integrity and redundancy.

SUMMARY

This unit covers the basics components of data warehousing. This includes the architecture of
data warehousing, components, building data warehousing

Data marts are presented as an inexpensive alternative to a data warehouse that takes
significantly less time and money to build. It is a subsidiary to a data warehouse of integrated
data. It is created for a dedicated set of users.

The information delivery component is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations of choice according
to some user-specified scheduling algorithm. Information delivery system distributes ware
house stored data and other information objects to other data warehouses and end user products

KEY TERMS

Data base components-Parallelism-Metadata-Shared Memory Architecture-Shared disk


architecture-Shared nothing architecture-Data Clean up, Transformation tools.