You are on page 1of 34

INTRODUCTION

DATA MINING

Instructor:
Doaa Adil Mohamed Altayeb
Email : doaaarabi@gmail.com

12/1/20 1
Content

 Current Situation.
 Why mine data?
 What is data mining?
 What is not data mining?
 Related disciplines.
 Decisions in Data Mining.
 Data Mining tasks.
 Data mining and different concepts.
 Mining market.
Current Situation …

 Data Explosion problem: We


are drowning in data, but
starving for knowledge!

 Data rich but information poor!


Why Mine Data? Opportunities

 Lots of data is being collected


and warehoused.

 Computers have become


cheaper and more powerful.
Why Mine Data? business view

 Competitive Pressure is Strong.


Why Mine Data? Scientific Viewpoint

 Data collected and stored at enormous


speeds (GB/hour)

 Traditional techniques infeasible for raw


data
Motivation ….
“Necessity is the Mother of Invention”

 Solution: Data Warehousing & Data Mining.


What Is Data Mining?

 Data mining:

“Extraction of interesting non-trivial,


implicit, previously unknown and
potentially useful information or
patterns from data in large databases”
Data mining and (KDD) Process

 Data mining is the core of Pattern Evaluation


knowledge Discovery in
Database process.
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Examples: What is (not) Data Mining?

 What is not Data  What is Data Mining?


Mining?
– Certain names are more
– Look up phone number prevalent in certain US
in phone directory locations (O’Brien, O'Rourke,
O’Reilly… in Boston area)
– Query a Web search – Group together similar
engine for information documents returned by search
about “Amazon” engine according to their
context (e.g. Amazon rainforest,
Amazon.com,)
Data Mining: combination of Multiple
Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
Decisions in Data Mining

 Databases to be mined

 Knowledge to be mined

 Techniques utilized

 Applications adapted
Data Mining Tasks

 Tasks categories:
─ Prediction Tasks.
─ Description Tasks.
Predictive Tasks

• predictive modelling is Similar to the human learning


experience.
• Model is developed using a supervised learning
approach, which has two phases:

1. Training builds a model using a large sample of


historical data called a training set where the class
labels are known.

2. Testing determine its accuracy and physical


performance characteristics.
Descriptive Tasks

• Find human-interpretable patterns that describe the data.

• Use unsupervised learning approach :

₋ The class labels of training data is unknown.


₋ establishing the existence of relationships, classes or
clusters in the data.
Common Data Mining tasks

 Predictive:  Descriptive:
• Classification • Clustering
• Regression • Association Rule
Discovery
• Deviation Detection
• Sequential Pattern
Discovery
Classification Definition

• Given a collection of records (training set ), Find a model


for class attribute as a function of the values of other
attributes.

• Goal: previously unseen records should be assigned a class


as accurately as possible.
Example of a Decision Tree

c al c al us
i i o
or or nu Splitting Attributes
t e g
t e g
nti
a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Cheat
Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Apply Model to Test Data

Start from the root of tree. Test Data


Refund Marital Taxable
Status Income Cheat
Refund
No Married 80K ?
Yes No 10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Classification Application

 Direct Marketing
Goal: Reduce cost of mailing by targeting
a set of consumers likely to buy a new
cell-phone product.
Regression

• Predict a value of a given continuous valued variable


based on the values of other variables.
Regression application

 Predicting sales amounts:


Goal: Predicting sales amounts of new product based on
advertising expenditure.
Deviation/Anomaly Detection

• Detect significant deviations from normal behavior.


Deviation Detection application

 Credit Card Fraud Detection:


Goal: Predict fraudulent cases in credit card
transactions.
Clustering Definition

 Given a set of data points, each having a set of attributes, find


clusters such that:
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one
another.
Illustrating Clustering

Original points:
Illustrating Clustering (cont.)
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Clustering Application

 Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset reached with a distinct marketing mix.
Association Rule Definition

• Given a set of records contain number of items:

find dependency rules which will predict occurrence of

an item based on occurrences of other items.


Association Rule: Application

 Supermarket shelf management:


Goal: To identify items that are bought together by
sufficiently many customers.

TID Items
1 Bread, Coke, Milk
2 Water, Bread Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Water, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Water}
{Water}
4 Water, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Sequential Pattern Discovery: Definition

• Given is a set of objects, with each object associated with


its own timeline of events:

find rules that predict strong sequential dependencies


among different events
Sequential Pattern Discovery Application

 Healthcare area:
Goal: To help in properly identify the disease based on
sequence of symptoms.

- (headache , fever) (backache) (…) --> (Malaria)


Mining market

• Around 20 to 30 mining tool vendors.


• 15 majors data mining software systems in 2017:

1. SAP Business 9. Sisense


Objects 10. Board
2. Oracle Data Mining 11. Salesforce Analytics
3. Rapid Miner Cloud
4. Microsoft SharePoint 12. DOMO
5. IBM Cognos 13. KNIME
6. Orange 14. Limestats
15. RockDaisy
7. SPSS Modeler
8. Dundas BI
END

Thank you for your Attention

You might also like