0% found this document useful (0 votes)

39 views46 pages

Lecture1-Introduction To Data Mining

The document discusses strategies for dealing with missing values in datasets, including removing features or instances with missing values, imputing missing values using statistical measures like the mean or median, and using machine learning models like autoencoders. It also covers computing correlation between features and encoding categorical features.

Uploaded by

parisafakhari30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views46 pages

Lecture1-Introduction To Data Mining

Uploaded by

parisafakhari30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction and preliminaries

Data Mining for Business and Governance

Dr. Gonzalo Nápoles

Learning goals
Learning goals covered in the course

Use exploratory data analysis and visualization techniques to extract

insights from raw data describing a classification problem.

Remarks
This learning goal will pave the road
for more complex contents.
Learning goals covered in the course

Design and configure supervised models to tackle classification

problems while understanding their building blocks.

We will study
Naïve Bayes, Random Forests, Nearest
Neighbors and Decision Trees.
Learning goals covered in the course

Design and configure unsupervised models to extract patterns from the

data by means of cluster analysis and association rules.

We will study
k-means, fuzzy c-means, hierarchal
clustering, association rules.
Learning goals covered in the course

Compute measures associated with relevant algorithmic components

of supervised and unsupervised data mining models.

We will study
entropy, information gain, distance functions,
performance metrics.
Learning goals covered in the course

Draw conclusions on the potential and limitations of datasets, algorithms

and models, and their application in society.

We will study
hyperparameters and expected performance,
explainable AI and fairness.
Course organization
Course organization

The course will be delivered through theoretical lectures and practical

tutorials guided by the instructors. Tutorials are aimed at further
elaborating on main theoretical concepts.

For tutorials
Students will receive Python notebooks
with enough explanations.
Course organization

The course will be evaluated through a final exam. The exam will be
written, on-campus and closed-book consisting of 30 multiple-choice
questions carrying equal weight.

Remark
Coding skills will not be assessed
in the final exam!
Course organization

We will publish weekly quizzes on Canvas with exercises resembling the

structure and complexity of those in the final exam.

Additionally
There will be neither a midterm exam nor
a programming project.
Course organization

The reading material (consisting of selected book chapters) will help

students polish their understanding about the concepts discussed
during the theoretical lectures.

Remark
Reading these chapters is optional yet
highly recommended.
Getting started
features describing the problem and outcome
Pattern classification
X1 X2 X3 Y
0.5 0.9 0.5 c1
training data used to build the model

0.2 0.5 0.1 c2 § In this problem, we have three numerical

0.5 0.9 0.4 c1 variables (features) to be used to predict
0.1 1.0 0.9 c3 the outcome (decision class).
0.4 1.0 1.0 c2
0.9 0.3 0.5 c1
§ This problem is multi-class since we have
1.0 0.1 0.8 c3
three possible outcomes.
1.0 0.4 1.0 c1
0.5 0.0 0.5 c2
§ The goal in pattern classification is to build
0.8 0.0 0.9 c2
a model able to generalize well beyond
1.0 1.0 1.0 c1
the historical training data.
0.5 0.7 0.3 c3

0.6 0.8 0.2 ?

How to proceed with this new instance?
What will we cover in this lecture?

We will discuss how to deal with missing values, how to compute the
correlation/association between two features, methods to encode
categorical features and handle class imbalance.

In the tutorial
We will further elaborate on these topics
and exploratory data analysis.
Missing values
features describing the problem and outcome
Missing values
X1 X2 X3 Y
0.5 ? 0.5 c1
0.2 0.5 0.1 c2 § Sometimes, we have instances that have
training data used to build the model

0.5 0.9 0.4 c1 missing values for some features.

0.1 ? ? c3
0.4 ? 1.0 c2
§ It is of paramount importance to deal with
0.9 ? 0.5 c1
this situation before building any machine
1.0 0.1 0.8 c3
learning or data mining model.
1.0 ? ? c1
0.5 0.0 0.5 c2
§ Missing values might result from fields that
0.8 ? 0.9 c2
are not always applicable, incomplete
1.0 ? 1.0 c1
measurements, lost values.
0.5 ? ? c3
0.5 ? 0.7 c2
0.5 0.9 0.1 c1
Imputation strategies for missing values

The simplest strategy would be to remove the feature containing missing

values. This strategy is recommended when the majority of the instances
(observations) have missing values for that feature.

However
There are situations in which we have a
few features or the feature we want to
remove is deemed relevant.
Imputation strategies for missing values

If we have scattered missing values and few features, we might want to

remove the instances having missing values.

However
There are situations in which we have a
limited number of instances.
Imputation strategies for missing values

The third strategy is the most popular. It consists of replacing the missing
values for a given feature with a representative value such as the mean,
the median or the mode of that feature.

However
We need to be aware that we are
introducing noise.
Imputation strategies for missing values

Fancier strategies include estimating the missing values with a machine

learning model trained on the non-missing information.

Remark
More about about missing values will be
covered in the Statistics course.
Autoencoders to impute missing values

Autoencoders are deep neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces the problem
dimensionality while the decoder completes the pattern.

Learning
They use unsupervised learning to adjust the weights
that connect the neurons.
Missing values and recommender systems

latent features
Feature scaling
Normalization

original § Different features might encode different

value measurements and scales (the age and
new value
height of a person).

𝑥 − min(𝑥)
𝑥! =
max 𝑥 − min(𝑥) § Normalization allows encoding all numeric
features in the [0,1] scale.

§ We subtract the minimum from the value

maximum minimum
feature value feature value to be transformed and divide the result
by the feature range.
Standardization

original mean value § This transformation method is similar to the

value normalization, but the transformed values
might not be in the [0,1] interval.

𝑥 − µ(𝑥)
𝑥! =
σ(𝑥) § We subtract the mean from the value to
be transformed and divide the result by
the standard deviation.
new value
standard
deviation
§ Normalization and standardization might
lead to different scaling results.
Normalization versus standardization

(a) original data (b) normalized (c) standardized

These feature scaling approaches might be

affected by extreme values.
Feature interaction
Correlation between two numerical variables

Sometimes, we need to measure the correlation between numerical

features describing a certain problem domain.

For example
What is the correlation between gender
and income in Sweden?
Correlation between two numerical variables

To what extent can the data be approximated

with a linear regression model?
Pearson’s correlation

𝑖-th value of the 𝑖-th value of the

𝑥 variable 𝑦 variable § It is used when we want to determine
the correlation between two numerical
variables given 𝑘 observations.

∑ 𝑥" − 𝑥̅ 𝑦" − 𝑦2
𝑅= § It is intended for numerical variables only
∑ 𝑥" − 𝑥̅ # ∑ 𝑦" − 𝑦2 #
and its value lies in [-1,1].

mean value mean value § The order of variables does not matter
of 𝑥 of 𝑦 since the coefficient is symmetric.
Correlation between age and glucose levels

Age (x) Glucose (y) 𝑥! − x# 𝑦! − 𝑦# 𝑥! − 𝑥̅ " 𝑦! − 𝑦# "

1 43 99 33 3.36 324
2 21 65 322.66 406.69 256
3 25 79 32.33 261.36 4
4 42 75 -5 0.69 36
5 57 87 95 250.69 36
6 59 81 0 318.02 0

𝑥( = 41.16 𝑦# = 81 ∑ = 478 ∑ = 1240.83 ∑ = 656

478
𝑅= = 0.53
1240.83 × 656
Association between two categorical variables

Sometimes, we need to measure the association degree between two

categorical (ordinal or nominal) variables.

For example
What is the association between
gender and eye color?
The 𝜒 ! association measure

number of observed § It is used when we want to measure the

observations value
association between two categorical
variables given 𝑘 observations.
(
#
#
𝑂" − 𝐸"
𝜒 =4 § We should compare the frequencies of
𝐸" values appearing together with their
"&'
individual frequencies.

expected
value § The first step in that regard would be to
create a contingency table.
How many times The 𝜒 ! association measure
these categories
were observed
together.
§ Let us assume that a categorical variable
𝑋 involves 𝑚 possible categories while 𝑌
involves 𝑛 categories.
- / #
#
𝑂". − 𝐸".
𝜒 = 44 § The observed value gives how many time
𝐸".
"&' .&' each combination was found.

§ The expected value is the multiplication of

𝑝" ∗ 𝑝#
𝐸"# = the individual frequencies divided by the
𝑘
number of observations.
Association between gender and eye color

This is the contingency table for two

26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 26 males from which 6 have blue eyes, 8 have
green eyes and 12 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

6 − 26 ∗ 15/50 ' 8 − 26 ∗ 13/50 ' 12 − 26 ∗ 22/50 '

'
𝜒(%) = + +
26 ∗ 15/50 26 ∗ 13/50 26 ∗ 22/50

blue green brown

Association between gender and eye color

This is the contingency table for two

26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 24 females from which 9 have blue eyes, 5 have
green eyes and 10 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

9 − 24 ∗ 15/50 ' 5 − 24 ∗ 13/50 ' 10 − 24 ∗ 22/50 '

'
𝜒(() = + +
24 ∗ 15/50 24 ∗ 13/50 24 ∗ 22/50

blue green brown

Encoding strategies
Encoding categorical features

Some machine learning, data mining algorithms or platforms cannot

operate with categorical features.

Therefore
We need to encode these features as
numerical quantities.
Encoding categorical features

The first strategy is referred to as label encoding and consists of assigning

integer numbers to each category. It only makes sense if there is an
ordinal relationship among the categories.

For example
Weekdays, months, star-based hotel
ratings, income categories.
We have three instances of a problem aimed One-hot encoding
at classifying animals given a set of features
(not shown for simplicity).
§ It is used to encode nominal features that
lack an ordinal relationship.

instance 1 instance 2 instance 3

§ Each category of the categorical feature
is transformed into a binary feature such
In these instances, we replace the categorical that one marks the category.
feature with three binary features.

cat dog rabbit § This strategy often increases the problem

Instance 1 1 0 0 dimensionality notably since each feature
Instance 2 0 1 0 is encoded as a binary vector.
Instance 3 0 0 1
Class imbalance
Class imbalance

§ Sometimes, we have problems with much

more instances belonging to a decision
class than the other classes.

§ In this example, we have more instances

labelled with the negative decision class
than the positive one.

§ Classifiers are tempted to recognize the

majority decision class only.
Simple strategies

§ One strategy is to select some instances

from the majority decision class, provided
we retain enough instances.

§ Another method consists of creating new

instances belonging to the minority class
(creating random copies).

§ These strategies are applied to the data

when building the model.
SMOTE

§ SMOTE (Synthetic Minority Oversampling

Technique) is a popular strategy to deal
with class imbalance.
Green squares denote synthetic instances generated
around the minority instances.

§ SMOTE creates synthetic instances in the

neighbourhoods of instances belonging
to the minority class.

§ Caution is advised since the classifier is

forced to learn from artificial instances,
SMOTE arbitrarily assumes that artificial instances which might induce noise.
belong to the minority class.
Introduction and preliminaries
Data Mining for Business and Governance

Dr. Gonzalo Nápoles

Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
FINAL LECTURE 3,4.pptx - AutoRecovered
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered
73 pages
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
80 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Understanding Probability and Statistics
No ratings yet
Understanding Probability and Statistics
42 pages
Data Mining: Statistical Analysis Techniques
No ratings yet
Data Mining: Statistical Analysis Techniques
24 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Features
No ratings yet
Features
42 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Data Mining & Analysis Guide
No ratings yet
Data Mining & Analysis Guide
148 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
3-Random Projection and Compressed Sensing Technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing Technique-13-01-2025
84 pages
PPT1
No ratings yet
PPT1
93 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Feature Engineering for BE Students
No ratings yet
Feature Engineering for BE Students
91 pages
Chap2 Data
No ratings yet
Chap2 Data
101 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
02data Part1
No ratings yet
02data Part1
51 pages
Data Science Statistics Cheat Sheet
No ratings yet
Data Science Statistics Cheat Sheet
10 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Chapter1 Introduction
No ratings yet
Chapter1 Introduction
38 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
DM Lec02
No ratings yet
DM Lec02
32 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
MIQ - Lecture 1, 2, 3 and 4 ?
No ratings yet
MIQ - Lecture 1, 2, 3 and 4 ?
15 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
DWDM (Unit-4) - 2
No ratings yet
DWDM (Unit-4) - 2
23 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Data Pre-processing for Structured Data
No ratings yet
Data Pre-processing for Structured Data
39 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages