You are on page 1of 66

Introduction

Data Mining
• Data Mining is defined as the procedure of
extracting information from huge sets of data.

• Data mining is mining knowledge from data.


Why Data Mining?
Evolution of Data
Applications
 Loan/credit card approval
 market segmentation
 fraud detection
 better marketing
 trend analysis
 customer churn
 Web site design and promotion

9
Loan/Credit card approvals

• In a modern society, a bank does not know


its customers.

• Only knowledge a bank has is their


information stored in the computer.

• Credit agencies and banks collect a lot of


customers’ behavioural data from many
sources. This information is used to predict
the chances of a customer paying back a
loan.

10
Market Segmentation
• Large amounts of data about customers
contains valuable information

• The market may be segmented into many


subgroups according to variables that are good
discriminators

• Not always easy to find variables that will help


in market segmentation

11
Fraud Detection
• Very challenging since it is difficult to define
characteristics of fraud. Often based on
detecting changes from the norm.

• In statistics, it is common to throw out the


outliers but in data mining it may be useful to
identify them since they could either be due to
errors or perhaps fraud.

12
Better Marketing
• When customers buy new products, other
products may be suggested to them when
they are ready.

• As noted earlier, in mail order marketing for


example, one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer pay for the purchase?

13
Trend analysis

• In a large company, not all trends are always


visible to the management.

• It is then useful to use data mining software


that will identify trends.

14
Customer Churn
• In businesses like telecommunications,
companies are trying very hard to keep their
good customers and to perhaps persuade good
customers of their competitors to switch to
them.

• In such an environment, businesses want to


find which customers are good, why customers
switch and what makes customers loyal.

• Cheaper to develop a retention plan and retain


an old customer than to bring in a new
customer.

15
Web site design

• A Web site is effective only if the visitors


easily find what they are looking for.

• Data mining can help discover affinity of


visitors to pages and the site layout may be
modified based on this information.

16
What is Machine Learning (ML)
• Machine Learning is a type of AI that provides
computers with the ability to learn without being
explicitly programmed.

• Machine learning focuses on the development of


computer programs that can change when
exposed to new data.
Normally 2/3 training data and 1/3 testing
data
Machine Learning Algorithms
Classification Algorithms
Anomaly detection algorithm
Regression algorithm
• This algorithm is used to predict numeric values
• For example:
Clustering algorithm
• It help to understand the structure of the dataset
• Divide the datasets into groups of similar
characteristics
Data mining with Weka
• After installation
• Create short cut on desktop
• ARFF format
(attribute-relation file format )
• 14 instances and five attributes (center left)

• Attributes are called outlook, temperature,


humidity, windy, and play (lower left).

• The first attribute, outlook, is selected by default


(you can choose others by clicking them) and has
no missing values, three distinct values, and no
unique values;

• the actual values are sunny, overcast, and rainy,


and they occur five, four, and five times
The Edit button brings up an editor that allows you
to inspect the data, search for particular values and
edit them, and delete instances and attributes.

Right-clicking on values and column headers brings


up corresponding context menus.
Activity: Exploring a dataset
• Open the contact-lenses dataset.
• 1. How many instances are there?
• 24
• 2. How many attributes are there?
• 5
• 3. How many possible values are there for
the age attribute?
• 3
• 4. Which of these attributes has reduced as a
possible value?
•  tear-prod-rate

You might also like