You are on page 1of 10



Data Ware Housing

Data warehouses provide a great deal of opportunities for performing data mining tasks
such as classification and clustering. Typically, updates are collected and applied to the data
warehouse periodically in a batch mode. All the patterns derived from the warehouse by some
datamining algorithm have to be updated as well. Due to the very large size of the databases, it is
highly desirable to perform these updates incrementally. We present the first incremental clustering
algorithm. Algorithm are based on the clustering algorithm DBSCAN which is applicable to any
database containing data from a metric space, e.g., to a spatial database or to a WWW-log database.
Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the
current clustering only in the neighborhood of this object. Thus, efficient algorithms can be given
for incremental insertions and deletions to an existing clustering. Based on the formal definition of
clusters, it can be proven that the incremental algorithm yields the same result as DBSCAN. A
performance evaluation of Incremental DBSCAN on a spatial database, demonstrating the
efficiency of the proposed algorithm. Incremental DBSCAN yields significant speed-up factors
over DBSCAN even for large numbers of daily updates in a data warehouse.

An Introduction to Data Warehousing

Data warehousing has quickly evolved into a unique and popular business application class.
Early builders of data warehouses already consider their systems to be key components of their IT
strategy and architecture. Numerous examples can be cited of highly successful data warehouses
developed and deployed for businesses of all sizes and all types. Hardware and soft wares have
quickly developed products and services that specifically target the data warehousing market.

What is a data warehouse?

A data warehouse is a structured extensible environment designed for the analysis of non-
volatile data, logically and physically transformed from multiple source applications to align with
in a structure, updated and maintained for a long time period, expressed in simple business terms,
and summarized for quick analysis.
A data warehouse is managed data situated after and outside the operational systems. The
traditions of managing data after it passes through the operational systems and the types of analysis
generated from this historical data. The fundamental requirements of the operational and analysis
systems are different: the operational systems need performance, whereas the analysis systems need
flexibility and broad scope. It has rarely been acceptable to have business analysis interfere with
and degrade performance of the operational systems.

Data from legacy systems

In the 1970’s virtually all business system development was done on the IBM mainframe
computers using tools such as Cobol, CICS, IMS, DB2, etc. The 1980’s brought in the new mini-
computer platforms such as AS/400 and VAX/VMS. The late eighties and early nineties made

1 Email:

UNIX a popular server platform with the introduction of client/server architecture. Despite all the
changes in the platforms, architectures, tools, and technologies, a remarkably large number of
business applications continue to run in the mainframe environment of the 1970’s.

Decision-Support and Executive Information Systems

Analysis systems have been decision support systems and executive information systems.
Decision support systems tend to focus more on detail and are targeted towards lower to mid-level
managers. Executive information systems have generally provided a higher level of consolidation
and a multi-dimensional view of the data, as high-level executives need more the ability to slice
and dice the same data than to drill down to review the data detail.
Today’s data warehousing systems provide the analytical tools afforded by their precursors.

Emergence of key enabling technologies

The most significant set of factors has been the enormous forward movement in the
hardware and software technologies. Along with the increase in power, sophisticated processor
hardware architectures such as symmetric multi-processing have come to the mainstream
computing with inexpensive machines.
The powerful desktop hardware and software has allowed for development of the client/server or
multi-tier computing architecture. Almost all data warehouses are accessed by personal computer
based tools. These tools are simple query capabilities available with most productivity packages to
incredibly powerful graphical multi-dimensional analysis tools. Ever increasing power of server
software. The Internet/Intranet trend has very important implications for data warehousing
applications. Highlights of the technological revolution that has greatly impacted data warehousing.
The use of technology by mid and upper level managers has increased significantly. They
have decisively moved beyond using the personal computer for email. This hands-on use of
information and technology by upper management has facilitated the sponsorship of larger projects
such as data warehousing.

Data warehousing attributes and concepts
The concepts are grouped into four sub-sections.

“Warehousing” data outside the operational systems
The primary concept of data warehousing is that the data must be stored effectively and
accessed by separating it from the data in the operational systems.

Integrating data from more than one operational system
Data warehousing systems are most successful when data can be combined from more than
one operational system. When the data needs to be brought together from more than one source
application, it is natural that this integration be done at a place independent of the source
applications. Before
CPU ,the
, of structured data warehouses, analysts
Desktop & in many instances would
Memory from more than one operational system Server
combine data extracted into a single spreadsheet or a
Power Power
database. The data warehouse may very effectively combine data from multiple source applications

l Revolution Hardware &
2 Email: prices

Many large data warehouse architectures allow for the source applications to be integrated
into the data warehouse incrementally.

The primary reason for combining data from multiple source applications is the ability to
cross-reference data from these applications. The typical data warehouse is built around the time
dimension. Time is the primary filtering criterion for a very large percentage of all activity against
the data warehouse. An analyst may generate queries for a given week, month, quarter, or a year.
Another popular query in many data warehousing applications is the review of year-on-year
activity. The time dimension in the data warehouse also serves as a fundamental cross-referencing
attribute. The ability to establish and understand the correlation between activities of different
organizational groups within a company is often cited as the single biggest advanced feature of the
data warehousing systems.
The data warehouse system can serve not only as an effective platform to merge data from
multiple current applications; it can also integrate multiple versions of the same application. The
data warehouse system can serve as a very powerful and much needed platform to combine the data
from the old and the new applications. Designed properly, the data warehouse can allow for year-
on-year analysis even though the base operational application has changed.
Even though many of the queries and reports that are run against a data warehouse are
predefined, it is nearly impossible to accurately predict the activity against a data warehouse. The
process of data exploration in a data warehouse takes a business analyst through previously
undefined paths. It is also common to have runaway queries in a data warehouse that are triggered
by unexpected results or by users lack of understanding of the data model. Further, many of the
analysis processes tend to be all encompassing whereas the operational processes are well

Data is mostly non-volatile

Another key attribute of the data in a data warehouse system is that the data is brought to
the warehouse after it has become mostly non-volatile. This means that after the data is in the data
warehouse, there are no modifications to be made to this information. This attribute of the data
warehouse has many very important implications for the kind of data that is brought to the data
warehouse and the timing of the data transfer. It is important to realize that once data is brought to
the data warehouse, it should be modified only on rare occasions. It is very difficult, if not
impossible, to maintain dynamic data in the data warehouse.
In short, the separation of operational data from the analysis data is the most fundamental
data-warehousing concept. Not only is the data stored in a structured manner outside the
operational system, businesses today are allocating considerable resources to build data warehouses
at the same time that the operational applications are deployed.

Logical transformation of operational data
This sub-section explores the concepts associated with the data warehouse logical model.
The data is logically transformed when it is brought to the data warehouse from the operational
systems. The issues associated with the logical transformation of data brought from the operational

3 Email:

systems to the data warehouse may require considerable analysis and design effort. Some of the
most fundamental concepts of relational database theory that does not fully apply to data
warehousing systems. Even though most data warehouses are deployed on relational database
platforms, some basic relational principles are knowingly modified when developing the logical
and physical model of the data warehouses.

Structured extensible data model

The data warehouse model outlines the logical and physical structure of the data warehouse.
The form of an efficient data warehouse that is expandable to accommodate all of the business data
from multiple operational applications.
The data modeling process needs to structure the data in the data warehouse independent of
the relational data model that may exist in any of the operational systems. The data warehouse
model is likely to be less normalized than an operational system model. The operational systems
are likely to have large amounts of overlapping business reference data. Information about current
products is likely to be used in varying forms in many of the operational systems. The data
warehouse system needs to consolidate all of the reference data. The data warehouse reference table
for products would consolidate and maintain all attributes associated with products that are relevant
for the analysis processes. Some attributes that are essential to the operational system are likely to
be deemed unnecessary for the data warehouse.
Data Ware
Ordering Process

Product Price/Inventory Products
Product Inventory
Product Code

The data warehouse model needs to be extensible and structured such that the data from
different applications can be added as a business case can be made for the data. A data warehouse
project in most cases cannot include data from all possible applications right from the start. Many
of the successful data warehousing projects have taken an incremental approach to adding data
from the operational systems and aligning it with the existing data.
The structure of the data in any single source application is likely to be inadequate for the
data warehouse. The structure in a single application may be influenced by many factors, including.
The data warehouse model breaks away from the limitations of the source application data models
and builds a flexible model that parallels the business structure. This extensible data model is easy
to understand by the business analysts as well as the managers.

Transformation of the operational state information
It is essential to understand the implications of not being able to maintain the state
information of the operational system when the data is moved to the data warehouse. Many of the
attributes of entities in the operational system are very dynamic and constantly modified. A data

4 Email:

warehouse generally does not contain information about entities that are dynamic and constantly
going through state changes.
Te operational state information views are consider as
The order entity in this operational system.
Te order is ready to be filled.
The operation of the operational state information cannot be carried over the data
warehouse system.

Physical transformation of operational data

Physical transformation of data homogenizes and purifies the data. These data warehousing
processes are typically known as "data scrubbing" or "data staging" processes. The "data
scrubbing" processes of analyzing of clean data can be greatly diminished. Physical transformation
includes the use of easy-to-understand standard business terms, and standard values for the data. A
complete dictionary associated with the data warehouse can be a very useful tool. The data may be
combined from multiple applications during this "staging" step or the integrity of the data may be
checked during this process.
Historical data and the current operational application data is likely to have some missing or
invalid values. It is important to note that it is essential to manage missing values or incomplete
transformations while moving the data to the data warehousing system.

Single physical definition of an attribute

Different systems may evolve to use different lengths and data types for the same data
element. The software of an operational application may support very limited data types and it may
impose severe limitations on the names. Software of another application may support a very rich set
of data types, and it may be very flexible with the naming conventions.

System A

Transformation Dataware House System
Detailed Data

System B

It is important to design a good system to log and identify data that is missing from the data
warehouse. When a user runs a query against the data warehouse, it is essential to understand the
population against which the query is run.

Dataware housing is the predominant features for all kind s of databases hence in the
current technology dataware housing has taken a hand in hand processes for all kind of databases
for instances data in the MNC banks data in large number of data bases is stored daily in huge
amounts of transaction are taking place and when ever a customer or an account holder wants to
perform a check of all his transactions from so and so time by using the dataware housing data or

5 Email:

information is retrieved using the servers on the atm centers as well as at the bank and for the fast
retrivation is performed using the data mining concepts.
The physical transformation concepts for data warehousing systems.

Automaticall Quick query
y Generated and
and Updates responses

Dataware house

Analysis on


-In the Banking, Stocks, Marketing sectors as well as in financial sectors as the data bases
increases day-by-day they need to store all the data in the data ware housing this done using the
data ware housing techniques
-In the hospitals the data bases of the operations conducted and the reports of the patation
has to be observed by the doctors and has to suggest the junior doctors through the same data bases
by this we could reduce the data as well as the time.


The data warehousing system may perform complex database operations such as multi-
table joins. Product sales may be computed by joining the Sales, Invoice, and Product tables.
The query will most likely run on a sharply smaller table, as the summary views would have
reduced the data from multiple tables containing millions of rows to tens of thousands of rows. This
may be caused by summarization in very small units or combining multiple summary views into
one data table. The summary views in a data warehouse provide multiple views into the same detail
data. These views are predefined dimensions into the detail data. These views provide an efficient
method for the analyst to link with the detail data when necessary. There is a very interesting
phenomenon that is observed with many data warehousing projects.
In the recent years the data has been processed in large amounts hence several data bases
companies have developed several software in order to store the data in a structural formats and
also reduce the retrieval time over the large number of data bases over past years. Hence the data is
being stored into the table directly in the in a formatted manner in order to reduce the time of the
datamining and also the storage space is also reduced. Various trends and also changed the way of
formatting data in the data ware housing is also been effected.
While individuals mining data in the warehouse detail records need to understand all the

6 Email:



The data entered into the structural tables then processes of Datamining how to store the
data and in which fashion the data has to be retrieved in order to reduce the processing time and
also the disk storage space. The workshop of data entered is the datamining. There is a process of
cleaning this data, bringing it into some form where it could be analyzed and getting the actual task
relevant data and applying the appropriate data-mining algorithm here and finally getting the
results. Data Mining can be classified into descriptive Data Mining where you are trying to describe
a given data set or predictive Data Mining, that is given a data set so what can you predict
something else. There are different views and different classifications for Data Mining. KDD
process involves data cleaning data integration, data selection, transformation, Data Mining, and
pattern evaluation and knowledge presentation. Data Mining can be performed in variety of
information repositories, could be relational could be World Wide Web and so on.


In the introduction to Data Mining you will be looking at what motivated the area of Data
Mining and what were the driving forces for Data Mining. Various techniques for performing Data
Mining such as association rules, classification techniques, clustering.

A Brief History of Data Mining Society

In 1989, a first workshop on knowledge discovery and databases by Shapiro and Piatetsky,
i.e. Shapiro, Piatetsky and Frawley to knowledge discovery in databases was held.
From 1991-94 there were various workshops held in about knowledge discovery in
databases with Fayyad, Piatetsky, Shapiro, smith and Uthuruswamy. From 1995-98 there
were international conferences on knowledge discovery in databases and Data Mining and the
journal of Data Mining and knowledge discovery was in 1997.
Then in 1998 ACM SIGKDD was established with the SIGKDD conferences and
SIGKDD magazine and more conferences in Data Mining that deal with a Pacific Asian
knowledge discovery conference, SIAM Data Mining conference and IEEE international
conference on Data Mining took place.

Motivation: Why Data Mining?

The amount of data stored somewhere and this leads to tremendous amounts of data stored
in databases or Data Warehouses which are essentially transactions that occur or other information.
Data warehousing and online analytical processing (OLAP) is a process of getting all the strategic information
required for an organization in one place and provides a set of analytical tools. Whereas Data Mining deals more with
extraction of interesting knowledge rules, regularities, parameters, and constraints from data in large databases. So in order
to get into the current state of explosion of so much data or information in various persistent stores or information sources.
What is Data Mining?

Data Mining or knowledge discovery in databases. There are alternative names to the Data Mining. Data Mining is
a misnomer in the sense like. It is a process of knowledge discovery in the databases, therefore a better word for this whole

7 Email:

area is knowledge discovery or mining in databases or KDD as it is properly known or knowledge extraction data or pattern
analysis or pattern recognition.

Data Mining -- Potential Applications

Database analysis and support.
Market Analysis and Management.
Corporate Analysis and Risk Management.
Fraud Detection and Management.

Data Mining: a KDD Process

Data Mining is essentially a core aspect of a long process known as the knowledge
discovery in databases process. Essentially you have the databases here. These could be either
independent autonomous information sources or databases or they have been integrated in some
form. The raw data in these databases hence some amounts of data cleaning takes place and the
relevant data is put in the Data Warehouse. Once the data is put in the Data Warehouse, and then
you would select the relevant data from the Data Warehouse. There is a process of cleaning this
data, bringing it into some form where it could be analyzed and getting the actual task relevant data
and applying the appropriate data-mining algorithm here and finally getting the results.
The KDD is processes are learning the application domain that is relevant prior knowledge
and goals of the application. Data Mining is classification, regression, association, or clustering.

Data Mining and Business Intelligence

Data Mining and business intelligence is the lowest level of this pyramid you have the paper files; you have the
information providers, database systems, online transaction processing systems. The Data Mining is the processes where the
discovery of aspect that is information discovery aspect of the knowledge discovery process.

Architecture of Typical Data Mining System

The database or a Data Warehouse server as backhand for the Data Mining engine which
essentially implements the various Data Mining techniques

Data Mining: On What Kind of Data?

Data Mining works over the different kinds of databases and Relational databases, Data
Warehouses, transactional databases, heterogeneous and legacy databases
Data Mining Functionalities

Data Mining functionally is that is what exactly do you get our of Data Mining.
1) Association
2) Classification and prediction
3) Cluster analysis
4) Outlier Analysis
5) Trend and evaluation analysis

8 Email:

Patterns Interesting
Interestingness measure
Data Mining: Confluence of Multiple Disciplines

Classification of Data Mining

Data Mining can be classified into descriptive Data Mining where you are trying to describe
a given data set or predictive Data Mining, that is given a data set so what can you predict
something else. There are different views and different classifications for Data Mining. These are
based on the kinds of databases to be mined, the kinds of knowledge to be discovered, kinds of
techniques utilized and the kinds of applications adopted.
A Multi-Dimensional view of Data Mining classification
OLAP mining: An Integration of Data Mining and Data Warehousing
An OLAM Architecture

There are various APIs. One is between the multidimensional database and the actual
databases and data warehouses. Another one is at the cube level between the OLAM and the
OLAP engine. So with OLAP engine is where you have the data cube and OLAM engine is where
would have the Data Mining techniques implemented and it is interaction between them and finally
you get the result.

Major Issues in Data Mining

The mining methodology and user interaction
- Mining different kinds of knowledge in databases.
- Interactive mining of knowledge at multiple levels of abstraction.
- Incorporation of the background knowledge this is very essential.
- Data Mining query languages and ad-hoc Data Mining.
- Expression and visualization of Data Mining results.
- Pattern evaluation, which deals with interestingness problems how good are the results
that you have got.

Performance and scalability
Diversity of Data Types

A Data Mining tool kit should be able handle relational and complex type of data.
Mining information from heterogeneous databases and global information systems such as World
Wide Web.


In the day today life we are looking for different databases where the retrieval of data is fast
and structured hence the data stored of mined here is already in formatted manner hence the data

9 Email:

will be accessed will be less compared to that of data ware housing, the applications of data mining
-Reduces the time, data as well as the space of the different data bases in case of retrieving
an account holders data over an atm center will be stored in the ascending order format
And the retrieval follows any of the algorithm in order to minimize the retrieval time at the instance
but incase when a customer wants to withdraw cash then here it follows another algorithm in order
to maintain the data base.
-In case of voice synthesizers the voices of all the data base has to maintained in the order
of the quality, quantity, as well as the time of retrieval is reduced by using the data mining


Data Mining deals with discovering interesting patterns from large amounts of data.
It’s a natural evolution of database technology in great demand with wide applications.
Almost every major organization or Fortune 500 Company uses the Data Mining technology
to understand its customers or to compete with their competitors or to enhance the working
of their organization to develop new marketing strategies and to make use of their resources
KDD process involves data cleaning data integration, data selection, transformation, Data
Mining, and pattern evaluation and knowledge presentation. Data Mining can be performed in
variety of information repositories, could be relational could be World Wide Web and so on. Data
Mining functionalities deal with characterization, discrimination, association, classification,
clustering outlier and trend analysis. We looked at the classification of the Data Mining systems
and major issues in Data Mining.

10 Email: