You are on page 1of 28

Data Mining

• Data mining is the process of analyzing data


from different views and summarizing it into
useful data

• “Data mining, also popularly referred to as


knowledge discovery from data (KDD), is the
automated or convenient extraction of patterns
representing knowledge implicitly stored or
captured in large databases, data warehouses, the
Web, other massive information repositories or
data streams.”
• It is the process of analyzing data from different
perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cuts costs, or both

• Technically, data mining is the process of finding


correlations or patterns among dozens of fields
in large relational databases.
• It can be used for descriptive as well as
predictive analytics
Motivation:
• Data explosion problem

– Automated data collection tools and mature database technology


lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining


– Data warehousing and on-line analytical processing

– Extraction of interesting knowledge (rules, regularities, patterns,


constraints) from data in large databases
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data

• Data mining may help scientists


– in classifying and segmenting data
– in Hypothesis Formation
Data Mining Tasks
• Prediction Tasks

– Use some variables to predict unknown or future values of other variables


• Description Tasks

– Find human-interpretable patterns that describe the data.

Common data mining tasks


– Classification [Predictive]
– Clustering [Descriptive]

– Association Rule Discovery [Descriptive]


– Sequential Pattern Discovery [Predictive]

– Regression [Predictive]
– Deviation Detection [Predictive]
Classification: Application

• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past
and present customers, to find attributes.
– How often the customer calls, where he calls, what
time-of-the day he calls most, his financial status,
marital status, etc.
• Label the customers as loyal or disloyal.
Clustering Definition

• Given a set of data points, each having a set of


attributes, and a similarity measure among them, find
clusters such that

– Data points in one cluster are more similar to one another.

– Data points in separate clusters are less similar to one


another.
Clustering: Application

• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be selected
as a market target to be reached with a distinct marketing
mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
Association Rule Discovery: Definition

• Given a set of records each of which contain some number of


items from a given collection;
– Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery

• Supermarket shelf management.


– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
– A classic rule --
• If a customer buys diaper and milk, then he is very likely to buy
beer:
Outlier Detection

• In data mining, anomaly detection (also outlier


detection) is the identification of items, events or
observations which do not conform to an expected
pattern or other items in a dataset.

• Typically the anomalous items will translate to some


kind of problem such as bank fraud, a structural
defect, medical problems or errors in a text. 
Sequential pattern mining
• Sequential pattern mining is a topic of data
mining concerned with finding statistically relevant patterns
between data examples where the values are delivered in a
sequence

Applications of sequential pattern mining ƒ


Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera,
within 3 months. ƒ
• Medical treatment
• natural disasters (e.g., earthquakes)
• Stock markets
• Telephone calling patterns
Regression
• Regression is another Predictive data-mining model is also
known as supervised learning technique. This technique
analyzes the dependency of some attribute values, which is
dependent upon the values of other attributes mainly, present
in same item. In the regression techniques target value are
known. For example, you can predict the child’s behavior
based on family history.
Applications

• Banking: loan/credit card approval


– predict good customers based on old customers
• Customer relationship management:
– identify those who are likely to leave for a competitor.
• Targeted marketing:
– identify likely responders to promotions
• Fraud detection: telecommunications, financial
transactions
– from an online stream of event identify fraudulent
events
• Manufacturing and production:
– automatically adjust knobs when process parameter
changes
Applications (continued)

• Medicine: disease outcome, effectiveness of treatments


– analyze patient disease history: find relationship
between diseases

• Molecular/Pharmaceutical: identify new drugs

• Web site/store design and promotion:


– find affinity of visitor to pages and modify layout
Applications (continued)

• Market Basket Analysis

• Molecular/Pharmaceutical: identify new drugs

• Web site/store design and promotion:


– find affinity of visitor to pages and modify layout

• Agriculture:
– Data mining is emerging technology in agriculture field for crop
yield analysis with respect to four parameters namely year, rainfall,
production and area of sowing. Yield prediction is a very important
agricultural problem that remains to be solved based on the
available data.
Data Mining Process

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Model Building
5. Testing and Evaluation
6. Deployment
1. Business Understanding

• The key element of any data mining study is to know what the
study is for
• Specific goals such as “What are the common characteristics of the
customers we have lost to our competitors recently?” or
• “What are typical profiles of our customers, and how much value
does each of them provide to us?” are needed

• At this early stage, a budget to support the study should also be


established
2. Data Understanding

• Identify the relevant data from many available databases

• First and foremost, the analyst should be clear and concise about the
description of the data mining task so that the most relevant data can be
identified

• Normally, data sources for business applications include


demographic data (such as income, education, number of
households, and age), sociographic data (such as hobby, club
membership, and entertainment), transactional data (sales record,
credit card spending, issued checks), and so on
Step 3: Data Preparation

• The purpose of data preparation (more commonly called data


preprocessing) is to take the data identified in the previous step and
prepare it for analysis by data mining methods
Step 4: Model Building

• In this step, various modelling techniques are selected and applied


to an already prepared data set in order to address the specific
business need

• Depending on the business need, the data mining task can be of a


prediction (either classification or regression), an association, or a
clustering type. Each of these data mining tasks can use a variety of
data mining methods and algorithms

• One of the most popular algorithms, including decision trees for


classification, k -means for clustering, and the Apriori algorithm for
association rule mining
Step 5: Testing and Evaluation

• The developed models are assessed and evaluated for their accuracy
and generality. This step assesses the degree to which the selected
model (or models) meets the business objectives and, if so, to
what extent (i.e., do more models need to be developed and
assessed). Another option is to test the developed model(s) in a real-
world scenario if time and budget constraints permit.
• Even though the outcome of the developed models is expected to
relate to the original business objectives, other findings that are
not necessarily related to the original business objectives but
that might also unveil additional information or hints for future
directions often are discovered.
Step 6: Deployment

• Development and assessment of the models is not the end of the


data mining project. Even if the purpose of the model is to have a
simple exploration of the data, the knowledge gained from such
exploration will need to be organized and presented in a way that
the end user can understand and benefit from.
• Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a
repeatable data mining process across the enterprise. In many cases,
it is the customer, not the data analyst, who carries out the
deployment steps. However, even if the analyst will not carry out
the deployment effort, it is important for the customer to understand
up front what actions need to be carried out in order to actually
make use of the created models.

You might also like