You are on page 1of 49

Université d’Alger 1

Benyouçef benkhedda

Data Mining
Dr . BOUFENAR Chaouki

Master 1
Ingénierie des Systèmes
Informatiques Intelligents
2018/2019

18/05/2019 Cours de Data Mining 1


What Motivated Data Mining?
 Natural evolution of information technology

 Wide availability of huge amounts of data


 Imminent need for turning data into useful information

18/05/2019 Cours Data mining 2


What is Data Mining ?

“ Data mining refers to extracting or “mining” knowledge from large amounts


of data ” [1]

The term “ Data Mining ” is misnomer ?!

More appropriate term is “ Knowledge Mining ”

Data mining = Knowledge Discovery from Data (KDD)

18/05/2019 Cours Data mining 3


What is Data Mining ?

Data mining is the core of KDD process

Cleaning Preprocessing Transformation Data mining Evaluation

18/05/2019 Cours Data mining 4


Data mining as a confluence of
multiple disciplines

Data Mining
Algorithms Visualisation

18/05/2019 Cours Data mining 5


Goals of Data Mining

Predictions

Earthquakes Sales Volumes

18/05/2019 Cours Data mining 6


Goals of Data Mining

Identification
Security & Crime Detection

Mining Gene Expression on


data for Drug Discovery

18/05/2019 Cours Data mining 7


Goals of Data Mining

Classification

18/05/2019 Cours Data mining 8


Goals of Data Mining

Optimisation

Time Optimisation

Space Optimisation

Sales maximisation

18/05/2019 Cours Data mining 9


Data Mining Techniques

18/05/2019 Cours Data mining 10


Classification Vs Prediction

Target attributes

Categorical/Discret Numerical/Continuous

Classification Prediction

 learn which loan applicants are “safe” and


which are “risky” for the bank  marketing manager would like to predict
how much a given customer will spend
 analyze breast cancer data in order to
during a sale at AllElectronics
predict which one of three specific
treatments a patient should receive

18/05/2019 Cours Data mining 11


Classification process
Learning

Training data

Classification
Algorithm

Classification rules
 IF age = youth THEN
loan_decision = risky
 IF income = high THEN
loan_decision = safe
 IF age = middle_aged AND income = low
THEN loan_decision = risky

18/05/2019 Cours Data mining 12


Classification process
Classification

Test data

Classification
rules

New data
(Dnnnn , Middle-age , Low )
loan_decision = ?

risky

18/05/2019 Cours Data mining 13


Data Cleaning and Preprocessing

Real-world data

Missing values Noisy inconsistent

 Ignore the tuple


 Fill in the missing value manually
 Use a global constant (“Unknown” or −∞)
 Use the attribute mean
 Use the most probable value

18/05/2019 Cours Data mining 14


Data Cleaning and Preprocessing

Real-world data

Missing values Noisy inconsistent

Noise is a random error or variance in a measured variable

Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Irregular data Smoothing regular data


Remove irregularities

Binning Regression Clustering

18/05/2019 Cours Data mining 15


Data Cleaning and Preprocessing
4 8 15 21 21 24 25 28 34

Binning
Partition into (equal-frequency) bins

Bin 1 Bin 2 Bin 3


4 8 15 21 21 24 25 28 34

Smoothing by bin means Smoothing by bin boundaries

9 9 9 22 22 22 29 29 29 4 4 15 21 21 24 25 25 34

Bin 1 Bin 2 Bin 3 Bin 1 Bin 2 Bin 3

18/05/2019 Cours Data mining 16


Data Cleaning and Preprocessing
Regression

Regression is a set of statistical methods for estimating the relationships among variables

Linear regression quantifies the relationship between one or more predictor variables
(independent or explanatory variables) and one outcome variable (dependent variable)

18/05/2019 Cours Data mining 17


Data Cleaning and Preprocessing
Clustering

similar values are organized into groups, or clusters. Values that fall outside of the set of
clusters may be considered outliers

18/05/2019 Cours Data mining 18


Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Smoothing Aggregation Generalisation Normalisation Attribute construction

18/05/2019 Cours Data mining 19


Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Aggregation Generalisation Normalisation Attribute construction


The daily sales data may be aggregated so as to compute monthly and annual total amounts.

 Data cubes store multidimensional aggregated information.

18/05/2019 Cours Data mining 20


Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Generalisation Normalisation Attribute construction

Low-level Low-level
Generalisation
data concepts

 Categorical attributes, like street, can be generalized to higher-level concepts, like


city or country
 values for numerical attributes, like age, may be mapped to higher-level concepts, like
youth, middle-aged, and senior
18/05/2019 Cours Data mining 21
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Normalisation Attribute construction

the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0
or 0.0 to 1.0

 Min-max Normalisation
Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000 − 12,000
• Then $73,000 is mapped to : 1.0 − 0 + 0 = 0,716
98,000 − 12,000
 z-score normalization
Let Ā= 54,000, σA= 16,000, for the attribute income
73,600 − 54,000
• a value of $73,600 for income is transformed to : = 1,225
16,000
18/05/2019 Cours Data mining 22
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Attribute construction

new attributes are constructed and added from the given set of attributes to help the mining
process

we may wish to add the attribute area based on the attributes height and width.

By attribute construction can discover missing values.

18/05/2019 Cours Data mining 23


Attributes Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.

if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music_taste.

 Speed-up the mining processes by removing the irrelevent or redundant attributes


 Reduces the number of attributes appearing in the discovered patterns
 Make the patterns easier to understand.

18/05/2019 Cours Data mining 24


Data Mining

Techniques

Supervised Unsupervised

Classification Clustering

o Data are labeled with pre-defined classes o Class labels are Unknown

o Test data are classified into these classes o Establish the existence of classes (Clusters)
in the data

18/05/2019 Cours Data mining 25


Performance Measure
Confusion Matrix

Error Rate =

Precision =

FP Rate =

Specificity = 1-FP Rate

Sensibility = Rappel =
18/05/2019 Cours Data mining 26
Spliting Dataset
Data set

Train Validation Test

Training Dataset : The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of


a model fit on the training dataset while tuning model hyper-parameters.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.

18/05/2019 Cours Data mining 27


Overfitting

Overfitting : a model that is too specialized on Training Set data and that will not
generalize well

• Properties
• Generalisable correlations
• Fluctuations
• Random variations
lead to Overfitting Badly predictions on Test data
• Noise
• Outlier

18/05/2019 Cours Data mining 28


Overfitting

• Blue line : a prediction function


• Green points : Training Set data
• Red points : Testing Set data

18/05/2019 Cours Data mining 29


Overfitting

Underfitting Appropriate fitting Overfitting


High bias High Variance

18/05/2019 Cours Data mining 30


How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 31


How to avoid Overfitting?

Gather more data The more data you get, the less likely the model is to overfit.

Adding more data


Data augmentation

Simplify the model the model becomes unable to overfit all the samples

Early termination forced the model to generalize

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 32


How to avoid Overfitting?

Gather more data

Data augmentation The more data you get, the less likely the model is to overfit.

18/05/2019 Cours Data mining 33


How to avoid Overfitting?

Gather more data

Data augmentation Collecting more data is a tedious and expensive process


 Dilatation, Rotation, Adding noise, …
Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 34


How to avoid Overfitting?

Gather more data


Reducing its complexity 

Data augmentation  # of estimators in a random forest,

# of parameters in a neural network


Simplify the model

Early termination
Model lighter, train faster and run faster.
L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 35


How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization When the testing error starts to increase, it’s time to stop!

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 36


How to avoid Overfitting?
add a penalty to the loss function
Gather more data

The L1 penalty The L2 penalty


Data augmentation

Simplify the model

minimize the squared magnitude


Early termination minimize the squared magnitude

L1 and L2 regularization • The model is forced to make compromises on its weights,


as it can no longer make them arbitrarily large.
For Deep Learning : Dropout and Dropconnect
• This makes the model more general, which helps combat
overfitting.

18/05/2019 Cours Data mining 37


How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect


Randomly deactivate either neurons (dropout) or connections (dropconnect) during the
training.

18/05/2019 Cours Data mining 38


Data Mining
Supervised

Step 01 : Learning

Better generalisable

Learning
Training data Model
Algorithm

Step 02 : Testing

Test Model Evaluation


data

18/05/2019 Cours Data mining 39


Data Mining
Supervised Learning

18/05/2019 Cours Data mining 40


Data Mining
Unsupervised
 The data have no target attribute

 We want to explore the data to find some intrinsic structures in them

Methods

Hierarchical Not Hierarchical

Hierarchical Cluster Analysis K-Means self organizing maps Clustering


Classification Ascendante centres mobiles Cartes topologiques de
Hiérarchique Kohonen

18/05/2019 Cours Data mining 41


Data Mining
Unsupervised

Hierarchical method

The optimal number of classes is determined by reading the tree


Very expensive in computation time

Not Hierarchical method

Allow a classification of a huge sets of data


We initially impose the number of classes

18/05/2019 Cours Data mining 42


Data Mining
Hierarchical method

18/05/2019 Cours Data mining 43


Data Mining
Hierarchical method

Principe

 Repeatdely combine two nearest objects

Data structures

 Data matrix (Object-by-Variable)  Dissimilarity matrix (Object-by-Object)

 n objets (persons)
 p variables (age, height, weight, gender, …)

18/05/2019 Cours Data mining 44


Data Mining
Hierarchical method

Classification criteria

 Similarity measurement between objects : Euclidean distance

, = −

 Similarity measurement between groups of objects

• Single linkage (smaller distance)

• Complete linkage (Large distance)

18/05/2019 Cours Data mining 45


Data Mining
Hierarchical method

How partition is good?

 Intra cluster distance for each cluster is Min


Two criterias ?????
 Inter cluster distance for each cluster is Max

Intra cluster distance = SSE (Sum Square Error ) intra class or cluster = Intra inertia

− : The mean in cluster q

Inter cluster distance = SSE (Sum Square Error ) inter class or cluster = Inter inertia

− : The mean in all

18/05/2019 Cours Data mining 46


Data Mining
Hierarchical method

How partition is good?

Huygens theory

− = − + −

One criteria

18/05/2019 Cours Data mining 47


References

J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier


Inc. (2006).

18/05/2019 Cours Data mining 48


Cours de Compilation L3 SI - Analyse
18/05/2019 49
Syntaxique

You might also like