You are on page 1of 17

Visit: www.geocities.com/chinna_chetan05/forfriends.

html

WHAT IS DATA WAREHOUSE?


A data warehouse, in its simplest perception, is no more than a collection of

the key pieces of information used to manage and direct the business for the most

profitable outcome. In other words, a data warehouse is the data (meta/ f act/

dimension /aggregation) and the process managers (load/warehouse/query) that make

information available, enabling people to make informed decisions.

DEFINATION:

“ A data warehouse is a subject-oriented, integrated, time-varying, non-volatile

collection of data in support of the management’s decision-making process.

SUBJECT-ORIENTED:

Data are organized according to subject instead of application. For example, an

insurance company using data warehouse would organize their data by customer,

premium, and claim instead of different products (auto, life, etc). The data organized

by the subject obtained only the information necessary for the decision support

processing.

NON-VOLATILE:

A data warehouse is always a physically separate store of data, which is transformed

from the application data found in the appropriate environment. The data are not

updated or changed in any way once they enter the data warehouse, but are only

loaded, refreshed and accessed for queries.

TIME-VARYING:

The data warehouse contains a place for sorting data that are 5 to 10 years old, or

older, to be used for comparisons, trends and forecasting.

1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

INTEGRATED:

A data warehouse is usually constructed by integrating multiple, heterogeneous

sources such as relational databases, flat files, and OLTP files. Data cleaning and data

integration techniques are applied to maintain consistency in naming convention,

measures of variables, encoding structure, and physical attributes.

DATA WAREHOUSE ARCHITECTURE:


Data warehouse must be architected to support three major driving factors:

• Populating the warehouse,

• Day-to-day management of the warehouse,

• The ability to cope with requirements evolution.

Based on the logical data model of the data warehouse, we shall see the popular 3-tier

architecture and components of the warehouse at different layers. Tier 1 is essentially

the warehouse server, Tier 2 is the OLAP-engine for analytical processing, and Tier 3

is a client containing reporting tools, visualization tools, data mining tools, query

tools, etc. There is also the backend process which is concerned with extracting data

from multiple operational databases and external sources; with cleaning, transforming

and integrating data for loading into the data warehouse server; and of course, with

periodically refreshing the warehouse. Tier 1 contains the main data warehouse. It can

follow one of three models or some combinations of these. It can be single enterprise

warehouse, or may contain several departmental marts. The third model is to have a

virtual warehouse. Tier 2 follows three different ways of designing the OLAP engine,

namely ROLAP, MOLAP and extended SQL OLAP.

WAREHOUSE SERVER:

As mentioned earlier, there are three data warehouse models.

2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

ENTERPRISE WAREHOUSE:

This model collects all the information about the subjects, spanning the entire

organization. It provides corporate-wide data integration, usually from one or more

operational systems or external information providers. An enterprise data warehouse

requires a traditional mainframe.

DATA MARTS:

Data marts are partitions of the overall data warehouse. A data mart is a subset of a

data warehouse built specifically for a department. They may also contain some

overlapping data. The physical data marts together serve as the conceptual data

warehouse. These marts must provide the easiest possible access to information

required by its user community.

STAND_ALONE MART:

This approach enables a department to implement a data mart with minimal or no

impact on the enterprise’s operational database.

DEPENDENT DATA MART:

Here the management of the data sources by the enterprise database is required. This

data sources include operational databases and external sources of data.

VIRTUAL DATA WAREHOUSE:

In a virtual warehouse, we have a logical description of all the databases and their

structures, and individuals who want to get information from those databases do not

have to know anything about them. This approach creates single “virtual database”

from all data sources. The data source can be local or remote. In this type of a data

warehouse, the data is not moved from the sources. Instead user gets direct access to

the data.

3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

A virtual database is easy and fast, but it is not without problems. Since the

queries must compete with the production data transactions, its performance can be

considerably degraded. Since there is no meta data, no summary data or history, all

the queries must be repeated, creating an additional burden on the system.

META DATA: “data about data”

Meta data provides a catalogue of data in the data warehouse and the pointers to this

data.

A metadata repository should contain:

• A description of the structure of the data warehouse.

• Operational metadata, such as data linkages, currency of data and monitoring

information.

• The summarization processes which include dimension definition, partitions,

aggregation, etc.

• Details of data sources.

• Data related to system performance.

• Business metadata, which includes business terms and definitions, and

changing policies.

TYPES OF METADATA:

Due to the variety of metadata, it is necessary to categorize into different types based

on how they are used

1. BUILD-TIME METADATA:

Whenever we design and build a warehouse, the metadata that we generate can be

termed as build-time metadata. Data links business and warehouse terminology and

describes the data’s technical structure. It is the primary source of most of the

metadata used in the warehouse.

4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

2. USAGE METADATA:

When the warehouse is in the production, usage metadata, which is derived from

build-time metadata, is an important tool for user and data administrators. This

metadata is used differently from build-time metadata, and its structure must

accommodate this fact.

3. CONTROL METADATA:

Most control metadata is of interest only to system programmers. However, one

subset which is generated and used by the tools that populate the warehouse, is of

considerable interest to users and data warehouse administrators. It provides the vital

information about the timeliness of warehouse data and users track the sequence and

timing of warehouse events.

DATA WAREHOUSE PROCESS MANAGERS:

The data warehouse process managers are piece of software responsible for the flow,

maintenance and upkeep of the data, both into and out of the data warehouse

database.

There are three different data warehouse process managers:

 LOAD MANAGER

 WAREHOUSE MANAGER

 QUERY MANAGER

LAOD MANAGER:

The load manager is responsible for any data transformation required and for the

loading of data into the database. The responsibilities are as follows:

 Data source interaction

 Data transformation

5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 Data load

WAREHOUSE MANAGER:

The warehouse manager is responsible for maintaining the data while it is in the data

warehouse. The responsibilities are listed below:

 Data movement

 Metadata management

 Performance monitoring and tuning

 Data archiving

QUERY MANAGER:

The query manager has several distinct responsibilities they are:

 User access to the data

 Query scheduling

 Query monitoring

SECURITY:

Security can affect many different parts of the data warehouse, such as:

 User access

 Data load

 Data movement

 Query generation

PERFORMANCE IMPACT OF SECURITY:

Security also costs. Any security that is implemented will cost in terms of either

processing power or disk space, or both.

VIEWS:

6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Views are a standard RDBMS mechanism for applying restrictions to data access.

Some common restrictions are:

 Restricted DML operations

 Lost query optimization paths

 Restrictions on parallel processing of view projections.

DATA MOVEMENT:

Because of the volumes of data being handled in the data warehouse, data movement

is an expensive process, in terms both of resource and of time.

There are a number of different ways in which bulk data movements can occur:

1. data loads

2. aggregation creation

3. results temporary tables

4. data extracts.

AUDITING:

Clearly any auditing that has to be performed will have a CPU impact, because

each audited action will require some code to be run. Auditing also requires disk

space.

BACKUP AND RECOVERY:

Backup is one of the most important regular operations carried out on any system.

BACKUP STRATEGIES:

1. EFFECT ON DATABASE DESIGN:

There is a major interaction between the backup strategy and the database design. The

two go hand in hand. Data warehouses are large and complex systems that backup

should be integral part of the system. We need to design the whole data warehouse

7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

system in a unified fashion. It is particularly important to manage the design of the

backup, the database and the overnight processing together.

2. DESIGN STRATEGIES:

Read-only tables are one of the main weapon in the battle to reduce the amount of

data needed to be backed up.

Another way of reducing the regular backup requirements is to reduce the amount of

journaling or redo generated. This is possible with some RDBMSs, because they

allow you to turn off logging with certain operations.

RECOVERY STRATEGIES:

The recovery strategy will be built around backup strategy. Any recovery situation

naturally implies that some failures has occurred.

Whatever software we choose, the recovery steps for the failure scenarios below need

to be fully documented:

 Instance failure

 Media failure

 Loss or damage of table space

 Loss or damage of redo log files

 Loss or damage of archive log files

 Failure during data movements

 And others

There are number of data movements scenario that need to be covered:

 Data load into staging tables

 Movement from staging to fact table

 Partition roll-up into larger partitions

8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 Creation of aggregations

DISASTER RECOVERY:

Recovering from a disaster requires the following:

 Replacement / standby systems

 Sufficient tape and disk capacity

 Communication links to users

 Communication links to data sources

 Copies of all relevent pieces of software

 Backup of database

 Application-aware systems administration and operations staff.

DATA WAREHOUSE APPLICATIONS:

1. SALES ANALYSIS:

 Determine real-time product sales

 Analyze historical product sales

 Evaluate successful products and determine key success

factors.

 Rapidly identify preferred customer segments

 Quickly isolate past preferred customer who no longer buy

2. FINANCIAL ANALYSIS:

 Compare actual to budgets on timely basis.

 Review past cash flow trends and forecast future needs

 Identify and analyze key expense generators

 Receive near-real-time, interactive financial statements.

3. HUMAN RESOURCE ANALYSIS:

9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 Evaluate trends in benefit program use.

 Identify the wage and benefits costs to determine company-wide variation.

 Review compliance levels for EEOC and other regulated activities.

4. OTHER AREAS

 Warehouse have also been applied to areas such as logistics, inventory,

purchasing, detailed transaction analysis, and load balancing.

10 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

DATA MINING:
Data mining is the non-trivial process of identifying valid, novel, potentially useful,

and ultimately understandable patterns in data. Data mining attempts to source out

patterns and trends in the data and infers rules from these patterns.

DEFINATIONS:

The term ‘data mining’ refers to the finding of relevant and useful information from

databases. A few definitions are given below:

“Data mining or knowledge discovery in databases, as it is known, is the non-

trivial extraction of implicit, previously unknown and potentially useful

information from the data. This encompasses a number of technical approaches,

such as clustering, data summarization, classification, finding dependency

networks, analyzing changes, and detecting anomalies.”

“Data mining is the search for the relationships and global patterns that exist in

large databases but are hidden among vast amounts of data, such as the

relationship between patient data and their medical diagnosis. This relationship

represents valuable knowledge about the database, and the objects in the

database, if the database is a faithful mirror of the real world registered by the

database.”

“Data mining is the process of discovering meaningful, new correlation patterns

and trends by sifting through large amount of data stored in repositories, using

pattern recognition techniques as well as statistical and mathematical

techniques.”

11 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

DATA MINING TECHNIQUES:

Researchers identify two fundamental goals of data mining:

 Prediction.

 Description.

Prediction makes use of existing variables in the database in order to predict

unknown or future values of interest.

Description focuses on finding patterns describing the data and the subsequent

presentation for user interpretation.

The study of DM techniques is to classify the techniques as

 User-guided or verification-driven data mining.

 Discovery-driven or automatic discovery of rules.

Most techniques of data mining have elements of both the models.

VERIFICATION MODEL:

In this process of data mining, the user makes a hypothesis and tests the hypothesis

on the data to verify its validity. The emphasis is on the user who is responsible for

formulating the hypothesis and issuing the query on the data to affirm or negate the

hypothesis.

DISCOVERY MODEL:

The discovery model differs in its emphasis, in that it is the system automatically

discovering important information hidden in the data. The data is sifted in search of

frequently occurring patterns, trends and generalizations about the data without

intervention or guidance from the user.

The typical discovery driven tasks are :

 Discovery of association rules.

12 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 Discovery of classification rules.

 Clustering.

 Discovery of frequent episodes.

 Deviation detection.

These tasks are of an exploratory nature and cannot be directly handed over to

currently available database technology.

MINING PROBLEMS:

A data mining system can either be a portion of a data warehousing system or a

stand-alone system. Data for data mining need not always be enterprise related data

residing on a relational database. Data source are very diverse and appear in varied

form. It can de textual data, image data, CAD data, map data, ECG data or the much

talked about Genome data.

The DM problems for different types of data:

SEQUENCE MINING:

It is concerned with mining sequence data. It may be noted that in the discovery of

association rules, we are interested in finding associations between items irrespective

of their order of occurrences. Another related area which falls into the larger domain

of temporal data mining is trend discovery. One characteristic of sequence-pattern

discovery in comparison with trend discovery is the lack of shapes, since the causal

impact of a series of events cannot be shaped.

WEB MINING :

With the huge amount of information available online, the WWW is a fertile area for

data mining research. web mining is the use of data mining techniques to

automatically discover and extract information from web documents and services.

13 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Web mining can be broken down into following subtasks:

1. Resource finding.

2. Information selection and preprocessing.

3. Generalization.

4. Analysis.

TEXT MINING:

The term text mining KDT(Knowledge Discovery in Text) was first proposed by

Feldman and Dagan in 1996. Presently the term text mining , is being used to cover

many applications such as text categorization, exploratory data analysis, text

clustering, finding patterns in text databases, finding sequential patterns in texts,

IE(Information Extraction),

Empirical computational linguistic tasks, and association discovery.

SPATIAL DATA MINING:

Spatial data mining is the branch of data mining that deals with spatial(location) data.

The immense explosion in geographically-referenced data accasioned by development

in IT, digital mapping, remote sensing, and the global diffusion of GIS, places

demands on developing data driven inductive approaches to spatial analysis and

modelling.

ISSUES AND CHALLENGES IN DM:

Data mining systems depend on databases to supply the raw input and this raises

problems, such as that databases tend to be dynamic, incomplete, noisy and large.

The difficulties in data mining can be categorized as:

 Limited information.

 Noise or missing data.

 User interaction and prior knowledge.

14 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 Uncertainity.

 Size, updates and irrelevant fields.

DM APPLICATION AREAS:

The applications can be naturally divided into three broad categories:

1. Business and E-Commerce Data.

2. Scientific, Engineering and Health care data.

3. Multimedia Documents and Web Data.

A. BUSINESS AND E-COMMERCE DATA

This is a major source category of data mining applications.

BUSINESS TRANSACTIONS:

Modern business processes are consolidating with millions of customers and

billions of their transactions. Business enterprises require necessary information for

their effective functioning in today’s competitive world.

ELECTRONIC COMMERCE:

Not only does electronic commerce produce large data sets in which the analysis

of marketing patterns and risk patterns is critical but, it is also important to do this

near-real time, in order to meet the demands of online transactions.

B. SCIENTIFIC, ENGINEERING AND HEALTH CARE DATA

 GENOMIC DATA:

Genomic sequencing and mapping efforts have produced a number of databases

which are accessible on the web. Finding relationships between these data

sources is another fundamental challenge for data mining.

 SENSOR DATA:

15 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Remote sensing data is another source of voluminous data. Remote sensing

satellites and a variety of other sensors produce large amounts of geo-referenced

data.

 SIMULATION DATA:

Simulation is now accepted as an important mode of science, supplementing

theory and experiment. Data mining and, more generally, data intensive

computing is proving to be a critical link between theory, simulation, and

experiment.

 HEALTH CARE DATA:

Hospitals, health care organizations, insurance companies, and the concerned

government agencies accumulate large collections of data about patients and

health care-related data.

C. MULTIMEDIA DOCUMENTS AND WEB DATA:

 MULTIMEDIA DOCUMENTS:

Today’s technology for retrieving multimedia items on the web is far

satisfactory. It is becoming harder to extract meaningful information from the

archives of multimedia data as the volume grows.

WEB DATA:

The data on the web is growing not only in volume but also in complexity. Web

data now includes not only text, audio and video material, but also streaming

data and numerical data.

OTHER APPLICATION AREAS:

 RISK ANALYSIS

 TARGETED MARKETING

 CUSTOMER RETENTION

16 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 PROTFOLIO MANAGEMENT

 BRAND LOYALITY

 BANKING

The application area in banking are:

1. detecting pattern of fraudulent credit cards use

2. identifying ‘loyal’ customers

3. predicting customers likely to change their card affiliation

4. determine credit card spending by customer groups

5. finding hidden correlations between different financial indicators

6. identifying stock trading rules from historical market data.

17 Email: chinna_chetan05@yahoo.com