Instructor:: Doaa Adil Mohamed Altayeb

INTRODUCTION
DATA MINING
Instructor:
Doaa Adil Mohamed Altayeb
Email : doaaarabi@gmail.com
12/1/20 1
Content
 Current Situation.
 Why mine data?
 What is data mining?
 What is not data mining?
 Related disciplines.
 Decisions in Data Mining.
 Data Mining tasks.
 Data mining and different concepts.
 Mining market.
Current Situation …
 Data Explosion problem: We

are drowning in data, but
starving for knowledge!
 Data rich but information poor!

Why Mine Data? Opportunities
 Lots of data is being collected

and warehoused.
 Computers have become

cheaper and more powerful.
Why Mine Data? business view
 Competitive Pressure is Strong.

Why Mine Data? Scientific Viewpoint
 Data collected and stored at enormous

speeds (GB/hour)
 Traditional techniques infeasible for raw

data
Motivation ….
“Necessity is the Mother of Invention”
 Solution: Data Warehousing & Data Mining.

What Is Data Mining?
 Data mining:
“Extraction of interesting non-trivial,

implicit, previously unknown and
potentially useful information or
patterns from data in large databases”
Data mining and (KDD) Process
 Data mining is the core of Pattern Evaluation

knowledge Discovery in
Database process.
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
Examples: What is (not) Data Mining?
 What is not Data  What is Data Mining?

Mining?
– Certain names are more
– Look up phone number prevalent in certain US
in phone directory locations (O’Brien, O'Rourke,
O’Reilly… in Boston area)
– Query a Web search – Group together similar
engine for information documents returned by search
about “Amazon” engine according to their
context (e.g. Amazon rainforest,
Amazon.com,)
Data Mining: combination of Multiple
Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
Decisions in Data Mining
 Databases to be mined
 Knowledge to be mined
 Techniques utilized
 Applications adapted
Data Mining Tasks
 Tasks categories:
─ Prediction Tasks.
─ Description Tasks.
Predictive Tasks
• predictive modelling is Similar to the human learning

experience.
• Model is developed using a supervised learning
approach, which has two phases:
1. Training builds a model using a large sample of

historical data called a training set where the class
labels are known.
2. Testing determine its accuracy and physical

performance characteristics.
Descriptive Tasks
• Find human-interpretable patterns that describe the data.
• Use unsupervised learning approach :
₋ The class labels of training data is unknown.

₋ establishing the existence of relationships, classes or
clusters in the data.
Common Data Mining tasks
 Predictive:  Descriptive:
• Classification • Clustering
• Regression • Association Rule
Discovery
• Deviation Detection
• Sequential Pattern
Discovery
Classification Definition
• Given a collection of records (training set ), Find a model

for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class

as accurately as possible.
Example of a Decision Tree
c al c al us
i i o
or or nu Splitting Attributes
t e g
t e g
nti
a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Cheat
Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Apply Model to Test Data
Start from the root of tree. Test Data

Refund Marital Taxable
Status Income Cheat
Refund
No Married 80K ?
Yes No 10
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Classification Application
 Direct Marketing
Goal: Reduce cost of mailing by targeting
a set of consumers likely to buy a new
cell-phone product.
Regression
• Predict a value of a given continuous valued variable

based on the values of other variables.
Regression application
 Predicting sales amounts:

Goal: Predicting sales amounts of new product based on
advertising expenditure.
Deviation/Anomaly Detection
• Detect significant deviations from normal behavior.

Deviation Detection application
 Credit Card Fraud Detection:

Goal: Predict fraudulent cases in credit card
transactions.
Clustering Definition
 Given a set of data points, each having a set of attributes, find

clusters such that:
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one
another.
Illustrating Clustering
Original points:
Illustrating Clustering (cont.)
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Clustering Application
 Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset reached with a distinct marketing mix.
Association Rule Definition
• Given a set of records contain number of items:
find dependency rules which will predict occurrence of
an item based on occurrences of other items.

Association Rule: Application
 Supermarket shelf management:

Goal: To identify items that are bought together by
sufficiently many customers.
TID Items
1 Bread, Coke, Milk
2 Water, Bread Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Water, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Water}
{Water}
4 Water, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Sequential Pattern Discovery: Definition
• Given is a set of objects, with each object associated with

its own timeline of events:
find rules that predict strong sequential dependencies

among different events
Sequential Pattern Discovery Application
 Healthcare area:
Goal: To help in properly identify the disease based on
sequence of symptoms.
- (headache , fever) (backache) (…) --> (Malaria)

Mining market
• Around 20 to 30 mining tool vendors.

• 15 majors data mining software systems in 2017:
1. SAP Business 9. Sisense

Objects 10. Board
2. Oracle Data Mining 11. Salesforce Analytics
3. Rapid Miner Cloud
4. Microsoft SharePoint 12. DOMO
5. IBM Cognos 13. KNIME
6. Orange 14. Limestats
15. RockDaisy
7. SPSS Modeler
8. Dundas BI
END
Thank you for your Attention

Instructor:: Doaa Adil Mohamed Altayeb

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instructor:: Doaa Adil Mohamed Altayeb

Uploaded by

Copyright:

Available Formats

INTRODUCTION

 Data Explosion problem: We

 Data rich but information poor!

 Lots of data is being collected

 Computers have become

 Competitive Pressure is Strong.

 Data collected and stored at enormous

 Traditional techniques infeasible for raw

 Solution: Data Warehousing & Data Mining.

“Extraction of interesting non-trivial,

 Data mining is the core of Pattern Evaluation

Data Warehouse Selection

 What is not Data  What is Data Mining?

• predictive modelling is Similar to the human learning

1. Training builds a model using a large sample of

2. Testing determine its accuracy and physical

• Find human-interpretable patterns that describe the data.

• Use unsupervised learning approach :

₋ The class labels of training data is unknown.

• Given a collection of records (training set ), Find a model

• Goal: previously unseen records should be assigned a class

Training Data Model: Decision Tree

Start from the root of tree. Test Data

• Predict a value of a given continuous valued variable

 Predicting sales amounts:

• Detect significant deviations from normal behavior.

 Credit Card Fraud Detection:

 Given a set of data points, each having a set of attributes, find

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

• Given a set of records contain number of items:

find dependency rules which will predict occurrence of

an item based on occurrences of other items.

 Supermarket shelf management:

• Given is a set of objects, with each object associated with

find rules that predict strong sequential dependencies

- (headache , fever) (backache) (…) --> (Malaria)

• Around 20 to 30 mining tool vendors.

1. SAP Business 9. Sisense

Thank you for your Attention

You might also like