You are on page 1of 13




In this world of exponential growth of data, accessing the desired information or the
extraction of knowledge from data is called Data Mining or KDD (Knowledge Discovery
Analysis). KDD has been mostly used by artificial intelligence and machine learning researchers.
This paper analyzes data from different perspectives to find relationships and patterns among
dozens of fields in large relational databases by latest trends and methods.

Data Warehousing is a repository of data gathered from multiple sources stored under
a unified schema at a single site. In this paper, we will discuss about the Data Warehouse design
using star and snowflake schemas. We are frequently using Star schema, it has more advantages
over the other schemas. Snowflake schemas normalize dimensions to eliminate redundancy.

Both Data Mining and Data Warehousing are important in the present competitive
market world with more applications like Customer Retention, Marketing, Risk Assessment, Fraud
detection and others.

2 Email:

In today’s fiercely competitive market place, companies have an insatiable need for
information. It is becoming increasingly clear that companies poised to experience the great
success will be those firms that can effectively leverage their data to meet organizational needs,
build solid relationships with stakeholder and above all, meet the demands of today’s customers.
Customer data, financial data and Internet-click stream data is a powerful asset provided it can be
integrated and utilized to enhance customer experiences.
The ability to access meaningful data, moving and sharing of data throughout an
organization between departments, officers and business partners in a timely efficient manner
through the use of familiar query and analytical tools is critical. But with the proliferation of
mixed-system environments that are integrated with decision support systems, data marts and
warehouses, electronic business solutions, the challenges increase.
DEF: A Database is a collection of non-redundant data which is sharable between different

What is Data Mining?
Data Mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, Data Mining is the process of finding
correlations or patterns among dozens of fields in large relational databases.
Data Mining is defined as “the non-trivial extraction of implicit, previously
unknown, potentially useful and understandable knowledge from data”.
As mentioned previously, the field of Data Mining is very broad, and there are many
methods and technologies which have become dominant in the field. Not only have there been
developments in the “traditional” areas of Data Mining, there are other areas which have been
identified as being important as future trends in the field.

3 Email:

Latest Trends in Technologies and Methods:
There are many number of Data Mining trends is in terms of technologies and
methodologies which are currently being developed and researched. These trends include methods
for analyzing more complex forms of data, as well as specific techniques and methods. The trends
identified include distributed data mining, hypertext/hypermedia mining, ubiquitous Data Mining
as well as multimedia, spatial, and time series/sequential Data Mining. These are examined in
detail in the upcoming sections.

Distributed / Collective Data Mining:
Much of the Data Mining which is being done currently focuses on distributed and
collective Data Mining database. The information located in different places, in different physical
locations is generally known as distributed Data Mining.
Distributed Data Mining (DDM) is used to offer a different approach to traditional
approaches analysis, by using a combination of localized data analysis, together with a “global
data model”. This is specified as, performing local data analysis for generating partial data models,
and combining the local data models from different data sites to develop the global model.
Ubiquitous Data Mining (UDM):
The advent of laptops, palmtops, cell phones, and wearable computers is making
ubiquitous access to large quantity of data possible. Advanced analysis of data for extracting useful
knowledge is the next natural step in the world of ubiquitous computing. Accessing and analyzing
data from a ubiquitous computing device and Data management in a mobile environment offer
many challenges.
Human-computer interaction is another challenging aspect of UDM. Visualizing
patterns like classifiers, clusters, associations and others, in portable devices are usually difficult.
Moreover, the sociological and psychological aspects of the integration between Data Mining
technology and our lifestyle are yet to be explored.
Hypertext and Hypermedia Data Mining:
Hypertext and Hypermedia Data Mining can be characterized as mining data which
includes text, hyperlinks, text markups, and various other forms of hypermedia information.
As such, it is closely related to both web mining, and multimedia mining, But in
reality these are quite close in terms of content and applications. While the World Wide Web is

4 Email:

substantially composed of hypertext and hypermedia elements, there are other
hypertext/hypermedia data sources which are not found on the web.
Some of the important Data Mining techniques used for hypertext and hypermedia
data mining include classification (supervised learning), clustering (unsupervised learning), semi-
structured learning, and social network analysis.
Multimedia Data Mining:
Multimedia Data Mining is the mining and analysis of various types of data,
including images, video, audio, and animation. The idea of mining data which contains different
kinds of information is the main objective of multimedia Data Mining. As multimedia Data Mining
incorporates the areas of text mining, as well as hypertext/hypermedia mining, these fields are
closely related. Much of the information describing these other areas also applies to multimedia
Data mining.
The developing area in multimedia Data mining is audio Data Mining. The basic
advantage of audio data mining is that while using a technique such as visual data mining may
disclose interesting patterns from observing graphical displays.
Spatial and Geographic Data Mining:
“The extraction of implicit knowledge, spatial relationships or other patterns not
explicitly stored in spatial databases.” is known as spatial Data Mining .Some of the components
of spatial data which can be indexed using multidimensional structures, and required special
spatial data access methods, together with spatial knowledge representation and data access
methods and ability to handle geometric calculations.
Analyzing spatial and geographic data include tasks like understanding and browsing
spatial data, uncovering relationships between spatial dataitems and analysis using spatial
databases and spatial knowledge bases. The applications of these would be useful in such fields as
remote sensing, medical imaging, navigation, and related uses.
Time Series/Sequence Data Mining:
Another important area in Data Mining centers on the mining of time series and
sequence-based data. This involves the mining of a sequence of data, which can either be
referenced by time or is a sequence of data which is ordered in a sequence.
In general, Time series data focuses on the components which exist within the data; it
includes trend movements, seasonal variations, cyclical variations and random movements.

5 Email:

Sequential pattern mining focuses on the identification of sequences which occur frequently in a
time series or sequence of data. In general, full periodic is the situation where all of the data points
in time contribute to the behavior of the series.
Constraint- based Data Mining:
Many of the Data Mining techniques which currently exist are very useful but lack
the benefit of any guidance or user control. Method of implementing some form of human
involvement into Data mining is in the form of constraint-based data mining. This form of Data
mining incorporates the use of constraints which guides the process.
The types of constraints which include clustering, association and classification.They
are Data constraints, Dimension constraints and Rule constraints.
Phenomenal Data Mining:
Phenomenal Data Mining is not a term for a data mining project that went well.
Instead, it focuses on the relationships between data and the phenomenon which are inferred from
the data. Aspects of phenomenal Data mining and in particular the goal to infer phenomena from
data are the need to have access to some facts about the relations between data and related
phenomena. These include the program which examines data for phenomena or also placed in
database which can be drawn upon when doing the data mining. Part of the challenge in creating
such a knowledge base involves the coding of common sense into a database, which has proved to
be a difficult problem so far.

Applications of Data Mining:-
Data Mining collects, stores and organizes data for use in areas such as
• Data Mining and customer relationship management(CRM) software for solving business
decision problems
• Privacy of data in Insurance companies and Government agencies
• Fraud detection in Telecommunications and stock exchanges
• Medical diagnosis to detect abnormal patterns
• Airline reservation to maximize seat utilization
• Intelligent agency to detect abnormal behavior by it employees.

6 Email:

What Is Data Warehousing?

A Data Warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It contains historical data derived from transaction data. It
separates analysis workload from transaction workload and enables an organization to consolidate
data from several sources.
The Need for Data Warehousing
Companies build Data Warehouses because their information can not be adequately
analyzed in the form in which it currently is stored. One of the fundamental ideas in the theory of
Data Warehousing is the difference between operational data processing and decision-support data

A common way of introducing Data Warehousing is to refer to the characteristics of a
Data Warehouses as,

 Subject Oriented

 Integrated

 Nonvolatile

 Time Variant

7 Email:

Subject Oriented:- The data in the warehouse is defined in business terms and is grouped under
business oriented subject headings such as customers, products, sales analysis report and
marketing campaigns achieved through data modeling.
Integrated:- Integration is closely related to subject orientation. Data Warehouses must put data
from disparate sources into a consistent format. They must resolve problems such as naming
conflicts and inconsistencies among units of measure. When they achieve this, they are said to be
Non-volatile:- Once loaded into the Data Warehouse, the data is not updated. Acts as stable
resource for consistent reporting and comparative analysis.
Time-variant:- All data in the Data Warehouse is time stamped at time of entry into the
warehouse or when it is summarized within the warehouse to act as chronological record and to
provide historical and trend analysis possibilities.

Architecture of Data Warehouse:-
Data Warehouses and their architectures vary depending upon the specifics of an
organization's situation.
Three common architectures are:
• Data Warehouse Architecture (Basic)
• Data Warehouse Architecture (with a Staging Area)
• Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture(Basic):

It shows a simple architecture for a Data Warehouse. End users directly access data
derived from several source systems through the data warehouse. The metadata and raw data of a
traditional online transaction processing (OLTP) system is present, as is an additional type of
data, summary data. Summaries are very valuable in data warehouses because they pre-compute
long operations in advance.A summary in Oracle are called a materialized view.

8 Email:

Data Warehouse Architecture(with a staging area):
We can do this programmatically, although most data warehouses use a staging
area instead. A staging area simplifies building summaries and general warehouse management.

Data Warehouse Architecture(with a staging area & Data marts):
We may want to customize your warehouse's architecture for different groups
within our organization. We can do this by adding data marts, which are systems designed for a
particular line of business.
9 Email:

Processes within a Data Warehouse:-
• Extract and load the data
• Clean and transform data into a form that can cope with large data volumes and provide
good query performance
• Backup and archive data
• Manage queries, and direct them to the appropriate data sources
Data Warehouses are not just large databases they are large, complex environments that integrate
many different technologies as such they require a lot of maintenance and management.

Schemas in Data Warehouse:
A schema is a collection of database objects, including tables, views, indexes, and
synonyms. There is a variety of ways of arranging schema objects in the schema models designed
for Data warehousing. Commonly used Schemas are Star schema, Snowflake schema.

Star Schema:
The star schema is the simplest schema. It is called a star schema because the entity-
relationship diagram of this schema resembles a star.The center of the star consists of a large fact
table and the points of the star are the dimension tables.

10 Email:

A Star schema is characterized by one or more very large fact tables that contain the
primary information in the Data warehouse, and a number of much smaller dimension tables, each
of which contains information about the entries for a particular attribute in the fact table.

The main advantages of star schemas are:
• Provide a direct and intuitive mapping between the business entities being analyzed by end
users and the schema design.
• Are widely supported by a large number of business intelligence tools.
Star schemas are used for both simple data marts and very large data warehouses. A
star join is a primary key to foreign key join of the dimension tables to a fact table.

Snowflake Schema:
The Snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of the schema
resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.

11 Email:

Warehouse with a database:

One thing that remains constant, especially in corporate world, is “Change”. And, these
days, change is occurring at an ever-increasing rate. A key challenge is implementing an
information infrastructure that allows your company to rapidly respond to change. One solution to
this challenge is the data warehouse.

Data warehousing is an information infrastructure based on detail data that supports the
decision-making process and provides businesses the ability to access and analyze data to increase
an organization's competitive advantage.
Data warehousing is a process, not an off-the-shelf solution you buy, but hardware--
database and tools integrated into an evolving information infrastructure--that changes with the
dynamics of the business.

12 Email:

In this paper, the concepts like importance, major trends & methods of Data Mining
as well as architecture and design of Data Warehouse using various schemas involved in
effectively managing the Data Warehouse are focused.

It would not be overly optimistic to say that Data Mining has a bright and promising
future, and that the years to come will bring many new developments methods, and technologies.
The field of Data Mining is still young enough that the possibilities are still limitless. Data Mining
tools are continually evolving, building ideas from the latest scientific research. Many of these
tools incorporate the latest algorithms taken from AI, Neural networks, Statistics and Optimization.

Data Warehouse usually contains historical data derived from transaction data, but it
can include data from other sources. The determination of which schema model should be used for
a Data Warehouse should be based upon the requirements and preferences of the Data Warehouse
project team. Star schemas are widely supported by a large number of business intelligence tools
where as Snowflake schemas normalize dimensions to eliminate redundancy.

• Data Base System Concepts by silberschatz, korth and sudershan
• Computers Today-smart facts Data Warehousing by Atanu Roy

13 Email: