You are on page 1of 21

Data Mining

Overview
 Introduction
 Explanation of Data Mining Techniques
 Advantages
 Applications
 Privacy
Data Mining
 What is Data Mining?
 “The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
 KDD – “Knowledge Discovery in Databases” (3)
 “Attempts to discover rules and patterns from data”
 Discover Rules  Make Predictions
 Areas of Use
 Internet – Discover needs of customers
 Economics – Predict stock prices
 Science – Predict environmental change
 Medicine – Match patients with similar problems  cure
Example of Data Mining
 Credit Card Company wants to discover information about
clients from databases. Want to find:
 Clients who respond to promotions in “Junk Mail”
 Clients that are likely to change to another competitor
 Clients that are likely to not pay
 Services that clients use to try to promote services affiliated
with the Credit Card Company
 Anything else that may help the Company provide/ promote
services to help their clients and ultimately make more
money.
Data Mining & Data Warehousing
 Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored under
a unified schema, at a single site.” (Silberschatz)
 Collect data  Store in single repository
 Allows for easier query development as a single repository
can be queried.

 Data Mining:
 Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
 Knowledge is power.
Discovery of Knowledge
Data Mining Techniques
 Classification
 Clustering
 Regression
 Association Rules
Classification
 Classification: Given a set of items that have several classes, and
given the past instances (training instances) with their
associated class, Classification is the process of predicting the
class of a new item.
 Therefore to classify the new item and identify to which class it
belongs
 Example: A bank wants to classify its Home Loan Customers into
groups according to their response to bank advertisements. The
bank might use the classifications “Responds Rarely, Responds
Sometimes, Responds Frequently”.
 The bank will then attempt to find rules about the customers
that respond Frequently and Sometimes.
 The rules could be used to predict needs of potential customers.
Technique for Classification
 Decision-Tree Classifiers
Job

Engineer Doctor
Carpenter

Income Income Income

>100K
<30K >50K <40K >90K <50K

Bad Good Bad Good Bad Good

Predicting credit risk of a person with the jobs specified.


Clustering
 “Clustering algorithms find groups of items that are
similar. … It divides a data set so that records with
similar content are in the same group, and groups are
as different as possible from each other. ” (2)

 Example: Insurance company could use clustering to


group clients by their age, location and types of
insurance purchased.

 The categories are unspecified and this is referred to


as ‘unsupervised learning’
Clustering
 Group Data into Clusters
 Similar data is grouped in the same cluster
 Dissimilar data is grouped in the same cluster

 How is this achieved ?


 K-Nearest Neighbor

 A classification method that classifies a point by

calculating the distances between the point and points in


the training data set. Then it assigns the point to the
class that is most common among its k-nearest
neighbors (where k is an integer).(2)

 Hierarchical
 Group data into t-trees
Regression
 “Regression deals with the prediction of a value, rather
than a class.” (1, P747)
 Example: Find out if there is a relationship between
smoking patients and cancer related illness.

 Given values: X1, X2... Xn


 Objective predict variable Y
 One way is to predict coefficients a0, a1, a2
 Y = a0 + a1X1 + a2X2 + … anXn
 Linear Regression
Regression
 Example graph:
 Line of Best Fit
 Curve Fitting
Association Rules
 “An association algorithm creates rules that describe how
often events have occurred together.” (2)

 Example: When a customer buys a hammer, then 90%


of the time they will buy nails.
Association Rules
 Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”(1, p748)
 Example:
 People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
 People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support

 Situations where there is high support for the


antecedent are worth careful attention
 E.g. Hotdog sausages should be placed in near hotdog buns
in supermarkets if there is also high confidence.
Association Rules
 Confidence: “is a measure of how often the consequent is
true when the antecedent is true.” (1, p748)
 Example:
 90% of Hotdog bun purchases are accompanied by hotdog
sausages.
 High confidence is meaningful as we can derive rules.
 Hotdog bun Hotdog sausage
 2 rules may have different confidence levels and
have the same support.
 E.g. Hotdog sausage  Hotdog bun may have a
much lower confidence than Hotdog bun  Hotdog
sausage yet they both can have the same support.
Advantages of Data Mining
 Provides new knowledge from existing data
 Public databases
 Government sources
 Company Databases

 Old data can be used to develop new knowledge

 New knowledge can be used to improve services or products

 Improvements lead to:


 Bigger profits
 More efficient service
Uses of Data Mining
 Sales/ Marketing
 Diversify target market
 Identify clients needs to increase response rates
 Risk Assessment
 Identify Customers that pose high credit risk
 Fraud Detection
 Identify people misusing the system. E.g. People who have
two Social Security Numbers
 Customer Care
 Identify customers likely to change providers
 Identify customer needs
Applications of Data Mining
(4)

Source IDC 1998


Privacy Concerns
 Effective Data Mining requires large sources of data
 To achieve a wide spectrum of data, link multiple data
sources
 Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
 Shopping History
 Credit History
 Bank History
 Employment History

 The users life story can be painted from the collected


data
References
1. Silberschatz, Korth, Sudarshan, “Database System
Concepts”, 5th Edition, Mc Graw Hill, 2005
2. http://www.twocrows.com/glossary.htm, “Two Crows,
Data Mining Glossary”
3. http://en.wikipedia.org/wiki/Data_mining, “Wikipedia”
4. http://phoenix.phys.clemson.edu/tutorials/excel/regressi
on.html
5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf

You might also like