You are on page 1of 13

Visit: www.geocities.com/chinna_chetan05/forfriends.

html

A Paper On

“DATA MINING”
---The era of knowledge
engineering.

1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Abstract :

Most organizations have accumulated a great deal of data, but what they really
want is information. What is the profitability of the customers? Which products are
normally sold together? Which customers are likely to jump ship? These are common
business questions, but the answers aren't easy to find. The newest, hottest technology
to address these concerns is data mining. Data Mining is the process of automated
extraction of predictive information from large databases. It predicts future trends and
finds behavior that the experts may miss as it lies beyond their expectations. Data
Mining is part of a larger process called knowledge discovery; specifically, the step in
which advanced statistical analysis and modeling techniques are applied to the data to
find useful patterns and relationships. In this paper we present an overview of the
different processes and techniques involved in Data Mining and with the help of a
case study of an airline we have projected the advantage of data mining.

2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

1. Introduction to Data Mining :
Data mining can be defined as "a
decision support process in which we
search for patterns of information in
data." This search may be done just by
the user, i.e. just by performing queries,
in which case it is quite hard and in most
of the cases not comprehensive enough
to reveal intricate patterns. Data mining
uses sophisticated statistical analysis and
modeling techniques to uncover such
Discovery is the process of looking in a
patterns and relationships hidden in
database to find hidden patterns without
organizational databases - patterns that
a predetermined idea or hypothesis about
ordinary methods might miss. Once
what the patterns may be. In other
found, the information needs to be
words, the program takes the initiative in
presented in a suitable form, with
finding what the interesting patterns are,
graphs, reports, etc.
without the user thinking of the relevant
questions first.
1.1 Data Mining Processes
From a process-oriented view, there are
In predictive modeling patterns
three classes of data mining activity:
discovered from the database are used to
discovery, predictive modeling and
predict the future. Predictive modeling
forensic analysis, as shown in figure
thus allows the user to submit records
column.
with some unknown field values, and the
system will guess the unknown values
based on previous patterns discovered
from the database. While discovery finds
patterns in data, predictive modeling
applies the patterns to guess values for
new data items.

3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Forensic analysis is the process of In episodic mining we look at data from
applying the extracted patterns to find one specific episode such as a specific
anomalous or unusual data elements. To direct marketing campaign. We may try
discover the unusual, we first find what to understand this data set, or use it for
is the norm, and then we detect those prediction on new marketing campaigns.
items that deviate from the usual within Analysts usually perform episodic
a given threshold. Discovery helps us mining.
find "usual knowledge," but forensic In strategic mining we look at larger sets
analysis looks for unusual and specific of corporate data with the intention of
cases. gaining an overall understanding of
specific measures such as profitability.
1.2 Data Mining Users and Activities Hence, a strategic mining exercise may
Data mining activities are usually look to answer questions such as: "where
performed by three different classes of do our profits come from?" or "how do
users - executives, end users and our customer segments and product
analysts. usage patterns relate to each other?"
• Executives need top-level
insights and spend far less time In continuous mining we try to
with computers than the understand how the world has changed
other groups. within a given time period and try to
• End users are sales people, gain an understanding of the factors that
market researchers, scientists, influence change. For instance, we may
engineers, physicians, etc. ask: "how have sales patterns changed

• Analysts may be financial this month?" or "what were the changing

analysts,statisticians, consultants, sources of customer attrition last

or database designers. quarter?"

These users usually perform three types 1.3 Data Mining Applications

of data mining activity within a Virtually any process from

corporate environment: episodic, pharmacology to customer service can

strategic and continuous data mining. be studied, understood, and improved
using data mining. The top three end

4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

uses of data mining are, not surprisingly, customer relationships. By determining
in the marketing area - customer characteristics of customers who are
profiling, targeted marketing, and likely to leave for a competitor, a
market-basket analysis. company can take action to retain that
In customer profiling, characteristics of customer because doing so is usually far
good customers are identified with the less expensive than acquiring a new
goals of predicting; who will become customer.
one and helping marketers target new Fraud detection is of great interest to
prospects. Data mining can find patterns telecommunications firms, credit-card
in a customer database that can be companies, insurance companies, stock
applied to a prospect database so that exchanges, and government agencies.
customer acquisition can be The aggregate total for fraud losses is
appropriately targeted. For example, by enormous. But with data mining, these
identifying good candidates for mail companies can identify potentially
offers or catalogs direct-mail marketers fraudulent transactions and contain the
can reduce expenses and increase their damage.
sales. Targeting specific promotions to Financial companies use data mining to
existing and potential customers offers determine market and industry
similar benefits. characteristics as well as predict
Market-basket analysis helps retailers individual company and stock
understand which products are performance. Another interesting niche
purchased together or by an individual application is in the medical field: Data
over time. With data mining, retailers mining can help predict the effectiveness
can determine which products to stock in of surgical procedures, diagnostic tests,
which stores, and even how to place medications, service management, and
them within a store. Data mining can process control.
1.4 Data Mining Techniques
also help assess the effectiveness of Data Mining has three major
promotions and coupons. components Clustering or Classification,
Another common use of data mining in Association Rules and Sequence
many organizations is to help manage Analysis.

5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

1.4.1 Classification
The clustering techniques analyze a set the case of association) e.g. if a shopper
of data and generate a set of grouping buys item A in the first week of the
rules that can be used to classify future month, and then he buys item B in the
data. The mining tool automatically second week etc.
identifies the clusters, by studying the
pattern in the training data. Once the 1.4.4 Neural Nets and Decision Trees
clusters are generated, classification can For any given problem, the nature of the
be used to identify, to which particular data will affect the techniques we
cluster, an input belongs. For example, choose. Consequently, we'll need a
one may classify diseases and provide variety of tools and technologies to find
the symptoms, which describe each class the best possible model. Classification
or subclass. models are among the most common, so
1.4.2 Association the more popular ways for building them
An association rule is a rule that implies have been explained here.
certain association relationships among a Classifications typically involve at least
set of objects in a database. In this one of two workhorse statistical
process we discover a set of association techniques - logistic regression (a
rules at multiple levels of abstraction generalization of linear regression) and
from the relevant set(s) of data in a discriminate analysis. However, as data
database. For example, one may mining becomes more common, neural
discover a set of symptoms often nets and decision trees are also getting
occurring together with certain kinds of more consideration. Although complex
diseases and further study the reasons in their own way, these methods require
behind them. less statistical sophistication on the part
1.4.3 Sequential Analysis of the user.
In sequential Analysis, we seek to Neural nets use many parameters (the
discover patterns that occur in sequence. nodes in the hidden layer) to build a
This deals with data that appear in model that takes and combines a set of
separate transactions (as opposed to data inputs to predict a continuous or
that appearing the same transaction in categorical variable.

6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

trees represent a series of rules to lead to
a class or value.
For example, we may wish to classify
loan applicants as good or bad credit
risks. Figure below shows a simple
decision tree that solves this problem.
Armed with this tree and a loan
application, a loan officer could
determine whether an applicant is a good
or bad credit risk. An individual with
"Income > $40,000" and "High Debt"
would be classified as a "Bad Risk,"
Source: "Introduction to Data whereas an individual with "Income <
Mining and Knowledge Discovery" by "Two Crows
$40,000" and "Job > 5 Years" would be
Corporation"
classified as a "Good Risk."

The value from each hidden node is a
function of the weighted sum of the
values from all the preceding nodes that
feed into it. The process of building a
model involves finding the connection
weights that produce the most accurate
results by "training" the neural net with
data. The most common training method
is back-propagation, in which the output
result is compared with known correct
values. After each comparison, the
weights are adjusted and a new result
computed. After enough passes through
the training data, the neural net typically
becomes a very good predictor. Decision

7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Source: "Introduction to Data
Mining and Knowledge Discovery" by "Two Crows
Corporation"

2. Data Mining on Frequent
Flier Program - A case study :

2.1 Introduction
'An application' was built for an airlines
company that wanted to explore hidden
trends in its data. The airlines company
wanted to improve its service levels by
identifying the customer behavior in
different sectors. The data that was
mined has information about the
members of the frequent flier program
and their travel as well as redemption
details. The model below was broadly
followed for building the mining
prototype.

8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Building the above model is a
continuous process incorporating several
feedback loop and considerable
interaction among the components. At
each stage, there are various checks to
ensure that the model is in fact meeting
the required objectives.

2.2 Problem Selection
To make the best use of data mining, one
must make a clear statement of the
objectives.
We may wish to increase response to a
direct mail campaign. Different goals,
such as "increasing the response rate"
and "increasing the value of a response,"
will require very different models. An
effective problem statement will also
include a way to measure the results of
our knowledge discovery project.
The points that were selected for mining
in the Frequent Flier Program are:
1. To identify the characteristics of the
customers who are frequent users of the
airline. These characteristics were
sectors most frequently flown, class
flown, period of year, hometown vis a
vis sector flown.
2. To find the relationship among the
sectors based on the customer behavior.

9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

• The three most common sectors
2.3 Solution Selection in which the customer redeems.
This step required a number of • The classes flown when
iterations, to converge to a final redeeming.
approach to solution selection. For the • For each of these, the number of
problem 2.2.1 neural networks have different dates the journeys were
been used to find out the impact of made also captured, and how
various attributes on the members' flying many people traveled with the
behavior. Then rule induction was member.
• Consolidation of travel
applied to get the final result, as a set of information based on the
rules. For the problem 2.2.2 Factor number of flights.
analysis and Association rules were used
to find out the relationship among the The following issues in data quality were
sectors, based on the customer behavior. found:
• There were an insufficient
2.4 Data Selection & Preparation number of attributes
This step is the most time consuming. captured about the customer
The data preparation steps may consume and the flights, making the
between 50 and 85 percent of the time mining algorithms
and effort of the whole knowledge inefficient. For example
discovery process. Here the flight data customer demographics
was summarized to capture the essential such as income, marital
elements of each customer's flying and status, purpose of journey
redemption behavior using the following etc. were not present.
fields:
• In many of the fields that
The three most common sectors flown
were captured, much of the
by the customer:
data was incomplete or
• The class that he flies in these inaccurate. For example, the
sectors. profession field, which
might be one of the most

10 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

important for determining counterparts was by category type.
flying behavior, had code Invitees and discretionary were found to
'O'- other for more than 60% have lower than average usage whereas
of the records. members flew more.
2.6.2 Booking Type 88% of the
2.5 Build Model customers booked their tickets through
The discovery analysis and predictive agents, and make an average of 39 trips
modeling were identified as the per year, which was significantly lower
appropriate activities to build the model. than the other booking types viz. in-
Association rules and factor analysis house travel department, self and
techniques were used for modeling secretary.
discovery analysis and neural nets; rule
induction and decision trees were used 2.6.3 Sector Type
for predictive modeling. Most of flights among the metros were
The following tools were used to mine made by a common set of people. For
the data - Clementine, Business Miner example the fliers in the Bombay-Delhi
and Intelligent Miner. Clementine was sectors normally fly in Bombay-Madras,
used for applying neural networks, rule Bombay- Calcutta, Bombay-Bangalore,
induction and association. Business Bombay-Ahmedabad and in the return
Miner was used for building decision sectors. A significant number of people
trees and Intelligent miner for doing flew in the Gauhati-Bagdogra sector but
factor analysis and finding associations. there wasn't a single flier in the return
sector. Instead there have been flights in
2.6 Results Bagdogra-Delhi sector by the people
who had flown in the Gauhati-Bagdogra

2.6.1 Category Type sector. The same people were also flying

The different categories to which the in Delhi-Gauhati sector and Delhi-

fliers belong were Members, Invitees, Bagdogra sector.

Discretionary, Corporate clients, NRI Most of the flights made in the

etc., The most important characteristic in Bangalore-Hyderabad sector were made

which frequent fliers differed from their

11 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

by the fliers of the Delhi-Bangalore Data mining offers great promise in
sector. helping organizations uncover hidden
Fliers in the sector Jammu-Srinagar were patterns in their data. However, data
basically the fliers of Delhi-Jammu mining tools must be guided by users
sector. Also, most of the fliers in the who understand the business,the data
Delhi-Srinagar sector also flew in Delhi- and the general nature of the analytical
Jammu sector. methods involved. Realistic expectations
can yield rewarding results across a wide
range of applications, from improving
revenues to reducing costs.
2.7 Model Monitoring Building models is only one step in
After using the model, one should knowledge discovery. It's vital to collect
measure how well it has worked. For and prepare the data properly and to
example, suppose we build a model that check models against the real world. The
identifies people who are likely to leave "best" model is often found after
our long distance telephone service for building models of several different
another (known as churn). We know the types and by trying out various
rate of churn prior to using the model, technologies or algorithms.
and we can predict what the churn rate
will be after we design interventions The data mining area is still relatively
intended to keep good customers. Notice young, and tools that support the whole
that it's not the model alone but the of the data mining process in an easy to
actions taken based on the model that use fashion are rare. However, one of the
will determine its success. The results most important issues facing researchers
obtained from the airlines application is the use of techniques against very
were checked against the original large data sets. All the mining
database and were found to be techniques are based on Artificial
significant. Intelligence, where they are generally
executed against small sets of data,
3. Conclusion : which can fit in memory. However, in
data mining applications these

12 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

techniques must be applied to data held • #Data preparation for data
in very large databases. These include mining, Dorian Pyle.
use of parallelism and development of • #Visualizing data mining models,
new database oriented techniques. Kurt Thearling,
However, much work is required before • http://www3.shore.net/~kht/text/
data mining can be successfully applied dmviz/modelviz.htm
to large data sets. Only then will the true • #Data Mining and Knowledge
potential of data mining be able to be Discovery in Databases,
realized. • http://www.cs.sfu.ca/research/gro
ups/DB/sections/publication/kdd/
References kdd.html

13 Email: chinna_chetan05@yahoo.com