Introduction and preliminaries
Data Mining for Business and Governance
Dr. Gonzalo Nápoles
Learning goals
Learning goals covered in the course
Use exploratory data analysis and visualization techniques to extract
insights from raw data describing a classification problem.
Remarks
This learning goal will pave the road
for more complex contents.
Learning goals covered in the course
Design and configure supervised models to tackle classification
problems while understanding their building blocks.
We will study
Naïve Bayes, Random Forests, Nearest
Neighbors and Decision Trees.
Learning goals covered in the course
Design and configure unsupervised models to extract patterns from the
data by means of cluster analysis and association rules.
We will study
k-means, fuzzy c-means, hierarchal
clustering, association rules.
Learning goals covered in the course
Compute measures associated with relevant algorithmic components
of supervised and unsupervised data mining models.
We will study
entropy, information gain, distance functions,
performance metrics.
Learning goals covered in the course
Draw conclusions on the potential and limitations of datasets, algorithms
and models, and their application in society.
We will study
hyperparameters and expected performance,
explainable AI and fairness.
Course organization
Course organization
The course will be delivered through theoretical lectures and practical
tutorials guided by the instructors. Tutorials are aimed at further
elaborating on main theoretical concepts.
For tutorials
Students will receive Python notebooks
with enough explanations.
Course organization
The course will be evaluated through a final exam. The exam will be
written, on-campus and closed-book consisting of 30 multiple-choice
questions carrying equal weight.
Remark
Coding skills will not be assessed
in the final exam!
Course organization
We will publish weekly quizzes on Canvas with exercises resembling the
structure and complexity of those in the final exam.
Additionally
There will be neither a midterm exam nor
a programming project.
Course organization
The reading material (consisting of selected book chapters) will help
students polish their understanding about the concepts discussed
during the theoretical lectures.
Remark
Reading these chapters is optional yet
highly recommended.
Getting started
features describing the problem and outcome
Pattern classification
X1 X2 X3 Y
0.5 0.9 0.5 c1
training data used to build the model
0.2 0.5 0.1 c2 § In this problem, we have three numerical
0.5 0.9 0.4 c1 variables (features) to be used to predict
0.1 1.0 0.9 c3 the outcome (decision class).
0.4 1.0 1.0 c2
0.9 0.3 0.5 c1
§ This problem is multi-class since we have
1.0 0.1 0.8 c3
three possible outcomes.
1.0 0.4 1.0 c1
0.5 0.0 0.5 c2
§ The goal in pattern classification is to build
0.8 0.0 0.9 c2
a model able to generalize well beyond
1.0 1.0 1.0 c1
the historical training data.
0.5 0.7 0.3 c3
0.6 0.8 0.2 ?
How to proceed with this new instance?
What will we cover in this lecture?
We will discuss how to deal with missing values, how to compute the
correlation/association between two features, methods to encode
categorical features and handle class imbalance.
In the tutorial
We will further elaborate on these topics
and exploratory data analysis.
Missing values
features describing the problem and outcome
Missing values
X1 X2 X3 Y
0.5 ? 0.5 c1
0.2 0.5 0.1 c2 § Sometimes, we have instances that have
training data used to build the model
0.5 0.9 0.4 c1 missing values for some features.
0.1 ? ? c3
0.4 ? 1.0 c2
§ It is of paramount importance to deal with
0.9 ? 0.5 c1
this situation before building any machine
1.0 0.1 0.8 c3
learning or data mining model.
1.0 ? ? c1
0.5 0.0 0.5 c2
§ Missing values might result from fields that
0.8 ? 0.9 c2
are not always applicable, incomplete
1.0 ? 1.0 c1
measurements, lost values.
0.5 ? ? c3
0.5 ? 0.7 c2
0.5 0.9 0.1 c1
Imputation strategies for missing values
The simplest strategy would be to remove the feature containing missing
values. This strategy is recommended when the majority of the instances
(observations) have missing values for that feature.
However
There are situations in which we have a
few features or the feature we want to
remove is deemed relevant.
Imputation strategies for missing values
If we have scattered missing values and few features, we might want to
remove the instances having missing values.
However
There are situations in which we have a
limited number of instances.
Imputation strategies for missing values
The third strategy is the most popular. It consists of replacing the missing
values for a given feature with a representative value such as the mean,
the median or the mode of that feature.
However
We need to be aware that we are
introducing noise.
Imputation strategies for missing values
Fancier strategies include estimating the missing values with a machine
learning model trained on the non-missing information.
Remark
More about about missing values will be
covered in the Statistics course.
Autoencoders to impute missing values
Autoencoders are deep neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces the problem
dimensionality while the decoder completes the pattern.
Learning
They use unsupervised learning to adjust the weights
that connect the neurons.
Missing values and recommender systems
latent features
Feature scaling
Normalization
original § Different features might encode different
value measurements and scales (the age and
new value
height of a person).
𝑥 − min(𝑥)
𝑥! =
max 𝑥 − min(𝑥) § Normalization allows encoding all numeric
features in the [0,1] scale.
§ We subtract the minimum from the value
maximum minimum
feature value feature value to be transformed and divide the result
by the feature range.
Standardization
original mean value § This transformation method is similar to the
value normalization, but the transformed values
might not be in the [0,1] interval.
𝑥 − µ(𝑥)
𝑥! =
σ(𝑥) § We subtract the mean from the value to
be transformed and divide the result by
the standard deviation.
new value
standard
deviation
§ Normalization and standardization might
lead to different scaling results.
Normalization versus standardization
(a) original data (b) normalized (c) standardized
These feature scaling approaches might be
affected by extreme values.
Feature interaction
Correlation between two numerical variables
Sometimes, we need to measure the correlation between numerical
features describing a certain problem domain.
For example
What is the correlation between gender
and income in Sweden?
Correlation between two numerical variables
To what extent can the data be approximated
with a linear regression model?
Pearson’s correlation
𝑖-th value of the 𝑖-th value of the
𝑥 variable 𝑦 variable § It is used when we want to determine
the correlation between two numerical
variables given 𝑘 observations.
∑ 𝑥" − 𝑥̅ 𝑦" − 𝑦2
𝑅= § It is intended for numerical variables only
∑ 𝑥" − 𝑥̅ # ∑ 𝑦" − 𝑦2 #
and its value lies in [-1,1].
mean value mean value § The order of variables does not matter
of 𝑥 of 𝑦 since the coefficient is symmetric.
Correlation between age and glucose levels
Age (x) Glucose (y) 𝑥! − x# 𝑦! − 𝑦# 𝑥! − 𝑥̅ " 𝑦! − 𝑦# "
1 43 99 33 3.36 324
2 21 65 322.66 406.69 256
3 25 79 32.33 261.36 4
4 42 75 -5 0.69 36
5 57 87 95 250.69 36
6 59 81 0 318.02 0
𝑥( = 41.16 𝑦# = 81 ∑ = 478 ∑ = 1240.83 ∑ = 656
478
𝑅= = 0.53
1240.83 × 656
Association between two categorical variables
Sometimes, we need to measure the association degree between two
categorical (ordinal or nominal) variables.
For example
What is the association between
gender and eye color?
The 𝜒 ! association measure
number of observed § It is used when we want to measure the
observations value
association between two categorical
variables given 𝑘 observations.
(
#
#
𝑂" − 𝐸"
𝜒 =4 § We should compare the frequencies of
𝐸" values appearing together with their
"&'
individual frequencies.
expected
value § The first step in that regard would be to
create a contingency table.
How many times The 𝜒 ! association measure
these categories
were observed
together.
§ Let us assume that a categorical variable
𝑋 involves 𝑚 possible categories while 𝑌
involves 𝑛 categories.
- / #
#
𝑂". − 𝐸".
𝜒 = 44 § The observed value gives how many time
𝐸".
"&' .&' each combination was found.
§ The expected value is the multiplication of
𝑝" ∗ 𝑝#
𝐸"# = the individual frequencies divided by the
𝑘
number of observations.
Association between gender and eye color
This is the contingency table for two
26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13
How to proceed? We have 26 males from which 6 have blue eyes, 8 have
green eyes and 12 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.
6 − 26 ∗ 15/50 ' 8 − 26 ∗ 13/50 ' 12 − 26 ∗ 22/50 '
'
𝜒(%) = + +
26 ∗ 15/50 26 ∗ 13/50 26 ∗ 22/50
blue green brown
Association between gender and eye color
This is the contingency table for two
26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13
How to proceed? We have 24 females from which 9 have blue eyes, 5 have
green eyes and 10 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.
9 − 24 ∗ 15/50 ' 5 − 24 ∗ 13/50 ' 10 − 24 ∗ 22/50 '
'
𝜒(() = + +
24 ∗ 15/50 24 ∗ 13/50 24 ∗ 22/50
blue green brown
Encoding strategies
Encoding categorical features
Some machine learning, data mining algorithms or platforms cannot
operate with categorical features.
Therefore
We need to encode these features as
numerical quantities.
Encoding categorical features
The first strategy is referred to as label encoding and consists of assigning
integer numbers to each category. It only makes sense if there is an
ordinal relationship among the categories.
For example
Weekdays, months, star-based hotel
ratings, income categories.
We have three instances of a problem aimed One-hot encoding
at classifying animals given a set of features
(not shown for simplicity).
§ It is used to encode nominal features that
lack an ordinal relationship.
instance 1 instance 2 instance 3
§ Each category of the categorical feature
is transformed into a binary feature such
In these instances, we replace the categorical that one marks the category.
feature with three binary features.
cat dog rabbit § This strategy often increases the problem
Instance 1 1 0 0 dimensionality notably since each feature
Instance 2 0 1 0 is encoded as a binary vector.
Instance 3 0 0 1
Class imbalance
Class imbalance
§ Sometimes, we have problems with much
more instances belonging to a decision
class than the other classes.
§ In this example, we have more instances
labelled with the negative decision class
than the positive one.
§ Classifiers are tempted to recognize the
majority decision class only.
Simple strategies
§ One strategy is to select some instances
from the majority decision class, provided
we retain enough instances.
§ Another method consists of creating new
instances belonging to the minority class
(creating random copies).
§ These strategies are applied to the data
when building the model.
SMOTE
§ SMOTE (Synthetic Minority Oversampling
Technique) is a popular strategy to deal
with class imbalance.
Green squares denote synthetic instances generated
around the minority instances.
§ SMOTE creates synthetic instances in the
neighbourhoods of instances belonging
to the minority class.
§ Caution is advised since the classifier is
forced to learn from artificial instances,
SMOTE arbitrarily assumes that artificial instances which might induce noise.
belong to the minority class.
Introduction and preliminaries
Data Mining for Business and Governance
Dr. Gonzalo Nápoles