Data Mining

Université d’Alger 1
Benyouçef benkhedda
Data Mining
Dr . BOUFENAR Chaouki
Master 1
Ingénierie des Systèmes
Informatiques Intelligents
2018/2019
18/05/2019 Cours de Data Mining 1

What Motivated Data Mining?
 Natural evolution of information technology
 Wide availability of huge amounts of data

 Imminent need for turning data into useful information
18/05/2019 Cours Data mining 2

What is Data Mining ?
“ Data mining refers to extracting or “mining” knowledge from large amounts

of data ” [1]
The term “ Data Mining ” is misnomer ?!
More appropriate term is “ Knowledge Mining ”
Data mining = Knowledge Discovery from Data (KDD)

What is Data Mining ?
Data mining is the core of KDD process
Cleaning Preprocessing Transformation Data mining Evaluation

Data mining as a confluence of
multiple disciplines
Data Mining
Algorithms Visualisation

Goals of Data Mining
Predictions
Earthquakes Sales Volumes

Identification
Security & Crime Detection
Mining Gene Expression on

data for Drug Discovery

Classification

Optimisation
Time Optimisation
Space Optimisation
Sales maximisation

Data Mining Techniques

Classification Vs Prediction
Target attributes
Categorical/Discret Numerical/Continuous
Classification Prediction
 learn which loan applicants are “safe” and

which are “risky” for the bank  marketing manager would like to predict
how much a given customer will spend
 analyze breast cancer data in order to
during a sale at AllElectronics
predict which one of three specific
treatments a patient should receive

Classification process
Learning
Training data
Classification
Algorithm
Classification rules
 IF age = youth THEN
loan_decision = risky
 IF income = high THEN
loan_decision = safe
 IF age = middle_aged AND income = low
THEN loan_decision = risky

Classification process
Classification
Test data
Classification
rules
New data
(Dnnnn , Middle-age , Low )
loan_decision = ?
risky

Data Cleaning and Preprocessing
Real-world data
Missing values Noisy inconsistent
 Ignore the tuple

 Fill in the missing value manually
 Use a global constant (“Unknown” or −∞)
 Use the attribute mean
 Use the most probable value

Real-world data
Missing values Noisy inconsistent
Noise is a random error or variance in a measured variable
Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Irregular data Smoothing regular data

Remove irregularities
Binning Regression Clustering

4 8 15 21 21 24 25 28 34
Binning
Partition into (equal-frequency) bins
Bin 1 Bin 2 Bin 3

4 8 15 21 21 24 25 28 34
Smoothing by bin means Smoothing by bin boundaries
9 9 9 22 22 22 29 29 29 4 4 15 21 21 24 25 25 34
Bin 1 Bin 2 Bin 3 Bin 1 Bin 2 Bin 3

Regression
Regression is a set of statistical methods for estimating the relationships among variables
Linear regression quantifies the relationship between one or more predictor variables
(independent or explanatory variables) and one outcome variable (dependent variable)

Clustering
similar values are organized into groups, or clusters. Values that fall outside of the set of
clusters may be considered outliers

Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Smoothing Aggregation Generalisation Normalisation Attribute construction

Data Transformation
Aggregation Generalisation Normalisation Attribute construction

The daily sales data may be aggregated so as to compute monthly and annual total amounts.
 Data cubes store multidimensional aggregated information.

Data Transformation
Generalisation Normalisation Attribute construction
Low-level Low-level
Generalisation
data concepts
 Categorical attributes, like street, can be generalized to higher-level concepts, like

city or country
 values for numerical attributes, like age, may be mapped to higher-level concepts, like
youth, middle-aged, and senior
Data Transformation
Normalisation Attribute construction
the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0
or 0.0 to 1.0
 Min-max Normalisation
Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000 − 12,000
• Then $73,000 is mapped to : 1.0 − 0 + 0 = 0,716
98,000 − 12,000
 z-score normalization
Let Ā= 54,000, σA= 16,000, for the attribute income
73,600 − 54,000
• a value of $73,600 for income is transformed to : = 1,225
16,000
Data Transformation
Attribute construction
new attributes are constructed and added from the given set of attributes to help the mining
process
we may wish to add the attribute area based on the attributes height and width.
By attribute construction can discover missing values.

Attributes Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.
if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music_taste.
 Speed-up the mining processes by removing the irrelevent or redundant attributes

 Reduces the number of attributes appearing in the discovered patterns
 Make the patterns easier to understand.

Data Mining
Techniques
Supervised Unsupervised
Classification Clustering
o Data are labeled with pre-defined classes o Class labels are Unknown
o Test data are classified into these classes o Establish the existence of classes (Clusters)
in the data

Performance Measure
Confusion Matrix
Error Rate =
Precision =
FP Rate =
Specificity = 1-FP Rate
Sensibility = Rappel =
Spliting Dataset
Data set
Train Validation Test
Training Dataset : The sample of data used to fit the model.
Validation Dataset: The sample of data used to provide an unbiased evaluation of

a model fit on the training dataset while tuning model hyper-parameters.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.

Overfitting
Overfitting : a model that is too specialized on Training Set data and that will not
generalize well
• Properties
• Generalisable correlations
• Fluctuations
• Random variations
lead to Overfitting Badly predictions on Test data
• Noise
• Outlier

Overfitting
• Blue line : a prediction function

• Green points : Training Set data
• Red points : Testing Set data

Overfitting
Underfitting Appropriate fitting Overfitting

High bias High Variance

How to avoid Overfitting?
Gather more data
Data augmentation
Simplify the model
Early termination
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect

Gather more data The more data you get, the less likely the model is to overfit.
Adding more data

Data augmentation
Simplify the model the model becomes unable to overfit all the samples
Early termination forced the model to generalize

Gather more data
Data augmentation The more data you get, the less likely the model is to overfit.

Gather more data
Data augmentation Collecting more data is a tedious and expensive process

 Dilatation, Rotation, Adding noise, …
Simplify the model
Early termination

Gather more data

Reducing its complexity
Data augmentation # of estimators in a random forest,
# of parameters in a neural network

Simplify the model
Early termination
Model lighter, train faster and run faster.

Gather more data
Data augmentation
Simplify the model
Early termination
L1 and L2 regularization When the testing error starts to increase, it’s time to stop!

add a penalty to the loss function
Gather more data
The L1 penalty The L2 penalty

Data augmentation
Simplify the model
minimize the squared magnitude

Early termination minimize the squared magnitude
L1 and L2 regularization • The model is forced to make compromises on its weights,

as it can no longer make them arbitrarily large.
• This makes the model more general, which helps combat
overfitting.

Gather more data
Data augmentation
Simplify the model
Early termination

Randomly deactivate either neurons (dropout) or connections (dropconnect) during the
training.

Data Mining
Supervised
Step 01 : Learning
Better generalisable
Learning
Training data Model
Algorithm
Step 02 : Testing
Test Model Evaluation

data

Data Mining
Supervised Learning

Data Mining
Unsupervised
 The data have no target attribute
 We want to explore the data to find some intrinsic structures in them
Methods
Hierarchical Not Hierarchical
Hierarchical Cluster Analysis K-Means self organizing maps Clustering

Classification Ascendante centres mobiles Cartes topologiques de
Hiérarchique Kohonen

Data Mining
Unsupervised
Hierarchical method
The optimal number of classes is determined by reading the tree

Very expensive in computation time
Not Hierarchical method
Allow a classification of a huge sets of data

We initially impose the number of classes

Data Mining
Hierarchical method

Data Mining
Hierarchical method
Principe
 Repeatdely combine two nearest objects
Data structures
 Data matrix (Object-by-Variable)  Dissimilarity matrix (Object-by-Object)
 n objets (persons)
 p variables (age, height, weight, gender, …)

Data Mining
Hierarchical method
Classification criteria
 Similarity measurement between objects : Euclidean distance
, = −
 Similarity measurement between groups of objects
• Single linkage (smaller distance)
• Complete linkage (Large distance)

Data Mining
Hierarchical method
How partition is good?
 Intra cluster distance for each cluster is Min

Two criterias ?????
 Inter cluster distance for each cluster is Max
Intra cluster distance = SSE (Sum Square Error ) intra class or cluster = Intra inertia
− : The mean in cluster q
Inter cluster distance = SSE (Sum Square Error ) inter class or cluster = Inter inertia
− : The mean in all

Data Mining
Hierarchical method
How partition is good?
Huygens theory
− = − + −
One criteria

References
J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier

Inc. (2006).

Cours de Compilation L3 SI - Analyse
18/05/2019 49
Syntaxique

Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Université d’Alger 1

18/05/2019 Cours de Data Mining 1

 Wide availability of huge amounts of data

18/05/2019 Cours Data mining 2

“ Data mining refers to extracting or “mining” knowledge from large amounts

The term “ Data Mining ” is misnomer ?!

More appropriate term is “ Knowledge Mining ”

Data mining = Knowledge Discovery from Data (KDD)

18/05/2019 Cours Data mining 3

Data mining is the core of KDD process

Cleaning Preprocessing Transformation Data mining Evaluation

18/05/2019 Cours Data mining 4

18/05/2019 Cours Data mining 5

Earthquakes Sales Volumes

18/05/2019 Cours Data mining 6

Mining Gene Expression on

18/05/2019 Cours Data mining 7

18/05/2019 Cours Data mining 8

18/05/2019 Cours Data mining 9

18/05/2019 Cours Data mining 10

 learn which loan applicants are “safe” and

18/05/2019 Cours Data mining 11

18/05/2019 Cours Data mining 12

18/05/2019 Cours Data mining 13

Missing values Noisy inconsistent

 Ignore the tuple

18/05/2019 Cours Data mining 14

Missing values Noisy inconsistent

Noise is a random error or variance in a measured variable

Irregular data Smoothing regular data

Binning Regression Clustering

18/05/2019 Cours Data mining 15

Bin 1 Bin 2 Bin 3

Smoothing by bin means Smoothing by bin boundaries

Bin 1 Bin 2 Bin 3 Bin 1 Bin 2 Bin 3

18/05/2019 Cours Data mining 16

18/05/2019 Cours Data mining 17

18/05/2019 Cours Data mining 18

Smoothing Aggregation Generalisation Normalisation Attribute construction

18/05/2019 Cours Data mining 19

Aggregation Generalisation Normalisation Attribute construction

 Data cubes store multidimensional aggregated information.

18/05/2019 Cours Data mining 20

Generalisation Normalisation Attribute construction

 Categorical attributes, like street, can be generalized to higher-level concepts, like

Normalisation Attribute construction

By attribute construction can discover missing values.

18/05/2019 Cours Data mining 23

 Speed-up the mining processes by removing the irrelevent or redundant attributes

18/05/2019 Cours Data mining 24

18/05/2019 Cours Data mining 25

Specificity = 1-FP Rate

Train Validation Test

Training Dataset : The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of

18/05/2019 Cours Data mining 27

18/05/2019 Cours Data mining 28

• Blue line : a prediction function

18/05/2019 Cours Data mining 29

Underfitting Appropriate fitting Overfitting

18/05/2019 Cours Data mining 30

Gather more data

Simplify the model