You are on page 1of 56

CPE316 - Introduction to

Machine Learning

Week 3
Introduction to Supervised Learning

Assoc. Prof. Dr. Caner ÖZCAN


Nothing is impossible. The word itself says: "I’m possible!"
~ Audrey Hepburn
Machine Learning Glossary

• https://developers.google.com/machine-learning/glossary#model

• https://scikit-learn.org/stable/glossary.html

• https://seaborn.pydata.org/examples/index.html

3
Applications
• Association
• Supervised Learning
• Classification
• Regression
• Unsupervised Learning
• Reinforcement Learning

4
Supervised Learning: Uses

• Prediction of future cases: Use the rule to predict the output for
future inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it explains
• Outlier detection: Exceptions that are not covered by the rule, e.g.,
fraud

5
Supervised Learning
• We discuss supervised learning starting from the simplest case, which
is learning a class from its positive and negative examples.
• We generalize and discuss the case of multiple classes, then
regression, where the outputs are continuous.

6
Learning a Class from Examples
• Class C of a “family car”
• Prediction: Is car x a family car?
• Knowledge extraction: What do people expect from a family car?
• Output:
Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power

7
Training set X

• ‘+’ denotes a positive example of


the class (a family car)

• ‘−’ denotes a negative example


(not a family car)

Training set for the class of a “family car.”


Each data point corresponds to one example car.
8
Training set X
 x1  𝑥1 (e.g., in U.S. dollars)
x=  𝑥2 (e.g., engine volume in cubic centimeters).
x2 
Each car is represented by such an ordered pair (x, r):

 1 if x is positive
r=
0 if x is negative

The training set contains N such examples:

X = {xt ,r t }tN=1

Training set for the class of a “family car.”


Each data point corresponds to one example car.
9
Class C

(p1  price  p2 ) AND (e1  engine power  e2 )

• After further discussions with the expert and


the analysis of the data, we may have reason to
believe that for a car to be a family car, its price
and engine power should be in a certain range.

Example of a hypothesis class.


The class of family car is a rectangle in the
price-engine power space. 10
Class C
• C is the actual class and h is our induced
hypothesis.
• The point where C is 1 but h is 0 is a false
negative, and the point where C is 0 but h is 1
is a false positive.
• Other points—namely, true positives and true
negatives—are correctly classified.

• Equation fixes H, the hypothesis class from


which we believe C is drawn, namely, the set of
rectangles.
• The learning algorithm then finds hypothesis
the particular hypothesis, h ∈ H, specified by a
Example of a hypothesis class.
particular quadruple of (𝑝1ℎ , 𝑝2ℎ , 𝑒1ℎ , 𝑒2ℎ ), to
approximate C as closely as possible. 11
Hypothesis class H
12

• The aim is to find h∈ H that is as similar as possible to C.


• The hypothesis h makes a prediction for an instance x:

 1 if h classifies x as a positive
h( x) = 
0 if h classifies x as a negative

• The empirical error is the proportion of training


instances where predictions of h do not match the
required values given in X.
• The error of hypothesis h given the training set X is:

E (h | X ) = 1(h(xt )  r t )
N

t =1

where 1(a != b) is 1 if a != b and is 0 if a = b


Hypothesis class H
13

• In our example, the hypothesis class H is the set of all


possible rectangles.

• Each quadruple (𝑝1ℎ , 𝑝2ℎ , 𝑒1ℎ , 𝑒2ℎ ) defines one hypothesis,


h, from H, and we need to choose the best one.

• Generalization, how well our hypothesis will correctly


classify future examples that are not part of the
training set.
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G

• The most specific hypothesis, S, that is the hypothesis


tightest rectangle that includes all the positive
examples and none of the negative examples.

• The most general hypothesis, G, is the largest


rectangle we can draw that includes all the positive
examples and none of the negative examples.

h ∈ H, between S and G is
consistent and make up the
version space (Mitchell, 1997) 14
Margin
• Choose h with largest margin

• It seems intuitive to choose h halfway between S and G;


this is to increase the margin, which is the distance
between the boundary and the instances closest to it.

• We should use an error (loss) function which not only


checks whether an instance is on the correct side of the
boundary but also how far away it

• Instead of h(x) that returns 0/1, we need to have a


hypothesis that returns a value which carries a measure
of the distance to the boundary.
15
VC Dimension
• N points can be labeled in 2N ways as +/–
• H shatters N if there
exists h  H consistent
for any of these:
VC(H ) = N

• The maximum number of points that can be shattered


by H is called the Vapnik-Chervonenkis (VC) dimension of H. An axis-aligned rectangle
shatters 4 points. Only
rectangles covering two
points are shown.
16
Probably Approximately Correct (PAC)
Learning
• How many training examples N should we have, such that with probability at least 1 ‒ δ, h has
error at most ε ?
(Blumer et al., 1989)

• Each strip is at most ε/4


• Probability that we miss a strip 1‒ ε/4
• Probability that N instances miss a strip (1 ‒ ε/4)N
• Probability that N instances miss 4 strips 4(1 ‒ ε/4)N
• 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
• 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

The difference between h and C


is the sum of four rectangular
strips, one of which is shaded. 17
Noise and Model Complexity
• Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple hypothesis class.

• Use the simpler rectangular because


• Simpler to use
(lower computational
complexity) When there is noise,
• Easier to train (lower there is not a simple
space complexity/fewer parameters) boundary between the
• Easier to explain positive and negative
instances.
(more interpretable)
• Generalizes better (lower
variance - Occam’s razor)

18
Multiple Classes, Ci i=1,...,K
The training set: X = {x t ,r t }tN=1

where r has K dimensions and


1 if x t
C i
ri = 
t

0 if x t
C j , j  i
Train hypotheses
hi(x), i =1,...,K:

K-class problem, we have K hypotheses


to learn such that
 if t
Ci
hi (x ) = 
t
1 x
Three hypotheses, each one covering the instances of one class
0 if x t
C j , j  i
and leaving outside the instances of the other two classes. 19
‘?’ are reject regions where no, or more than one, class is chosen.
Regression
• In classification, given an input, the output that is generated is Boolean; it is a
yes/no answer.
• When the output is a numeric value, what we would like to learn is not a class,
C(x) ∈ {0, 1}, but is a numeric function.
• In machine learning, the function is not known but we have a training set of
examples drawn from it
X = x , r t =1
t t N

rt 
• If there is no noise, the task is interpolation. We would like to find the function
f(x) that passes through these points such that we have

( )
r t = f xt 20
Regression
• In time-series prediction, for example, we have data up to the present and we
want to predict the value for the future.
• In regression, there is noise added to the output of the unknown function

( )
r t = f xt + 

where f(x) ∈ is the unknown function and  is random noise.


• We would like to approximate the output by our model g(x). The empirical error
on the training set X is


E (g | X ) =  r − g (x )
1 N t
N t =1
t 2

21
Regression
• The square of the difference is one error (loss) function that can be used; another
is the absolute value of the difference.
• Our aim is to find g(·) that minimizes the empirical error.
• Our approach is the same; we assume a hypothesis class for g(·) with a small set
of parameters.
• If we assume that g(x) is linear, we have

g ( x ) = w1 x1 + ... + wd xd + w0 =  j =1 w j x j + w0
d

22
Regression
• Let us now go back to our example in previous section where we estimated the price of a
used car.
• There we used a single input linear model
g (x ) = w1 x + w 0

where w1 and w0 are the parameters to learn from data. The w1 and w0 values should
minimize

 r − (w x )
N
E (w1 , w0 | X ) =
1 2
t
1
t
+ w0
N t =1

• The output may be taken as a higher-order function of the input—for example, quadratic

g (x ) = w 2 x 2 + w1 x + w 0
23
Regression
• Linear, second-order, and sixth-
g (x ) = w1 x + w 0 order polynomials are fitted to the
same set of points.
g (x ) = w 2 x 2 + w1 x + w 0
• The highest order gives a perfect
fit but given this much data it is
very unlikely that the real curve is
so shaped.
• The second order seems better
than the linear fit in capturing the
trend in the training data.

24
Model Selection & Generalization
• If the training set we are given contains only a small subset of all
possible instances, the solution is not unique.
• This is an example of an ill-posed problem;where the data by itself is
not sufficient to find a unique solution.
• If learning is ill-posed, and data by itself is not sufficient to find the
solution, we should make some extra assumptions to have a unique
solution with the data we have.
• The set of assumptions we make to have learning possible is called
the inductive bias of the learning algorithm.
• The need for inductive bias, assumptions about hypothesis class H

25
Model Selection & Generalization
• Learning is not possible without inductive bias, and now
the question is how to choose the right bias.
• This is called model selection, which is choosing between
possible H.
• In answering this question, we should remember that the
aim of machine learning is rarely to replicate the training
data but the prediction for new cases.
• We would like to be able to generate the right output for
an input instance outside the training set, one for which
the correct output is not given in the training set.
• How well a model trained on the training set predicts the
right output for new instances is called generalization.
26
*Yi-xin
Underfitting
• For best generalization, we should match the
complexity of the hypothesis class H with the
complexity of the function underlying the
data.
• If H is less complex than the function, we have
underfitting, for example, when trying to fit a
line to data sampled from a third-order
polynomial.
• In such a case, as we increase the complexity,
the training error decreases.
• But if we have H that is too complex, the data
is not enough to constrain it and we may end
up with a bad hypothesis, h ∈ H.
Siora Photography
27
Overfitting
• If there is noise, an overcomplex hypothesis may learn not only the underlying
function but also the noise in the data and may make a bad fit, for example, when
fitting a sixth-order polynomial to noisy data sampled from a third-order
polynomial.
• This is called overfitting.
• In such a case, having more training data helps but only up to a certain point.
• Given a training set and H, we can find h∈ H that
has the minimum training error but if H is not
chosen well, no matter which h∈ H we pick,
we will not have good generalization.

28
Sharoon Saxena
Triple Trade-Off
• In all learning algorithms that are trained from example data, there is
a trade-off between three factors:
• the complexity of the hypothesis C(H) we fit to data, namely, the capacity of
the hypothesis class,
• the amount of training data (N), and
• the generalization error (E) on new examples.

• As N increases, E decreases.
• As C(H ) increases, E first decreases first and then increases.

29
Train and Validation Set
• We can measure the generalization ability of a hypothesis, namely, the
quality of its inductive bias, if we have access to data outside the training
set.
• We simulate this by dividing the dataset we have into two parts.
• We use one part for training (i.e., to fit a hypothesis), and the remaining
validation set part is called the validation set and is used to test the
generalization ability.
• Assuming large enough training and validation sets, the hypothesis that is
the most accurate on the validation set is the best one (best inductive
bias).

30
Cross-Validation and Test Set
• Dividing data process is called cross-validation.
• Note that if we then need to report the error to give an idea about the
expected error of our best model, we should not use the validation error.
• We have used the validation set to choose the best model, and it has
effectively become a part of the training set.
• We need a test set, containing examples not used in training or validation.
• We split the data as
• Training set (50%)
• Validation set (25%)
• Test set (25%)
• Resampling when there is few data
31
Holdout Method for Model Evaluation and
Selection
• The dataset is split into two parts for The dataset is split into three different
model evaluation. sets – training, validation, and test for
• Generally, 70-30% split is used for model selection.
splitting the dataset. The hold-out method can also be used for
hyperparameters tuning.

32
Leave-p-out Cross-Validation
• Leave-p-out cross-validation involves using p observations as the validation set
and the remaining observations as the training set.
• This is repeated on all ways to cut the original sample on a validation set
of p observations and a training set.

33
K-Fold Cross-Validation
• For more advanced statistical evaluation, experienced experimenters often prefer
the so-called K-fold cross-validation.
• To begin with, the set of pre-classified examples is divided into K equally sized (or
almost equally-sized) subsets which the machine-learning jargon sometimes (not
quite correctly) refers to as “folds.”

34
K-Fold Cross-Validation
• K-fold cross-validation then runs K experiments.
• In each, one of the K subsets is removed so as to be used only for testing (this
guarantees that, in each run, a different testing set is used).
• The training is then carried out on the union of the remaining K-1 subsets.
• Again, the results are averaged, and the standard deviation calculated.

35
Dimensions of a Supervised Learner

1. Model: g (x | )
where g(·) is the model, x is the input, and θ are the parameters.

2. Loss function: E ( | X ) =  L(r t , g (xt | ))


t

where r t is desired output and our approximation to it, g(x t |θ).

3. Optimization procedure (to find θ∗ that minimizes the total error):


 * = arg min E ( | X )

where arg min returns the argument that minimizes.


36
Homework

End-to-End Machine Learning Project


(Chapter 2)

Read the chapter and apply all codes by


creating your own notebook.

Due date: next week

37
Homework

Discover and Visualize the Data to Gain Insights


Visualizing Geographical Data
Looking for Correlations
Experimenting with Attribute Combinations

Prepare the Data for Machine Learning Algorithms


Data Cleaning
Handling Text and Categorical Attributes
Custom Transformers
Feature Scaling
Transformation Pipelines
38
CPE316 - Introduction to
Machine Learning

Week 3
LAB.

Assoc. Prof. Dr. Caner ÖZCAN


Data Exploration and Cleaning

• Data science is a discipline that uses mathematical, statistical, and


programming skills to analyze and process data. This field is used to
explore large datasets, analyze them, and derive meaningful insights
from the data.
• Data exploration and cleaning are essential steps in data science, and
they are crucial in ensuring the accuracy of the data. Proper execution
of these steps helps data scientists to obtain more accurate results.
In this lesson, we will explore the Titanic disaster training set.

• The Titanic dataset is a collection of data that represents the


passengers who survived or perished in the sinking of the Titanic. The
dataset contains various features of the passengers.
What is Seaborn
• Seaborn is a data visualization library in Python programming language.

• Seaborn is based on Matplotlib and aims to create more attractive and


informative graphics using Matplotlib's basic plotting functions.

• Seaborn is particularly useful for statistical data visualization, providing a


high-level interface to create various types of statistical graphics.

• It provides a range of functions for visualizing data in the form of


heatmaps, line plots, scatter plots, bar plots, etc., and makes it easy to
customize and style the plots.
Commonly Used Plot
1.Line plot - sns.lineplot()
2.Scatter plot - sns.scatterplot()
3.Bar plot - sns.barplot()
4.Box plot - sns.boxplot()
5.Violin plot - sns.violinplot()
6.Heat map - sns.heatmap()
7.Pair plot - sns.pairplot()
8.Facet grid - sns.FacetGrid()
9.Joint plot - sns.jointplot()
10.Count plot - sns.countplot()
Visualization with Plotly
• Plotly provides data visualization in Python. It comes with different
functions.
• It’s an interactive graphic library.
• Plotly is a great charting alternative for your data exploration and
analysis.
• It offers interactive dashboards that help you navigate and better
understand your data.
• Before start to visualizing you should install dependicies
pip install plotly
Visualization with Plotly
• Plotly has a pattern and visualization makes according to this pattern
which is:
• trace()
• data()
• layout()
• pattern()
Visualization with Plotly
• When we are creating the trace part:
• x = The column to be placed on the x-axis is written.
• y = The column to be placed on the y-axis is written
• mode = The type of plot to be used.
• name = The name of the trace
• marker = It used with dictionaries, defines the color and
transparency.
• text = It is the information to which the value belongs when hovering
over the plot.
Visualization with Plotly
• data = the list to which traces are added.
• layout = It is a dictionary and contains the following;
• title = The information about data
• x axis = = It is a dictionary and contains the following;
• title = The name of the x axis
• ticklen = thickness of the header on the x-axis
• zeroline = When it is false, lines crossing zero are disabled.
Visualization with Plotly
• fig = A figure containing the data and the layout is created.
• iplot() = Plot the figure which contains the data and layout.
• Most common visualization ways with plotly:
• 3D Scatter Plot
• Bar Plot
• Box Plot
• Scatter Plot
• Scatter Plot Matrix
The Digit Dataset
• This dataset is made up of 1797 8x8
images.
• Each image, like the one shown below,
is of a hand-written digit.
• In order to utilize an 8x8 figure like
this, we’d have to first transform it into
a feature vector with length 64.

https://scikit-
learn.org/stable/auto_examples/
index.html#dataset-examples
The Iris Dataset
• This data sets consists of 3
different types of irises’
(Setosa, Versicolour, and
Virginica) petal and sepal
length, stored in a 150x4
numpy.ndarray

• The rows being the samples


and the columns being:
Sepal Length, Sepal Width,
Petal Length and Petal
Width.
Plot Randomly Generated Classification
Dataset
• This example plots several randomly
generated classification datasets.
• For easy visualization, all datasets have 2
features, plotted on the x and y axis. The
color of each point represents its class
label.
• The first 4 plots use the make_classification
with different numbers of informative
features, clusters per class and classes.
• The final 2 plots use make_blobs and
make_gaussian_quantiles.
Plot Randomly Generated Multilabel
Dataset

• This illustrates the


make_multilabel_classification dataset
generator.

• Each sample consists of counts of two


features (up to 50 in total), which are
differently distributed in each of two
classes.
References
• Ethem Apaydin, Introduction to Machine Learning, 3e. The MIT Press, 2014.
• Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2011.
• Tom Mitchell, Machine Learning, McGraw Hill, 1997.
• Russell, S., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach, 3rd ed. New York: Prentice Hall.
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to
Build Intelligent Systems”, Aurélien Géron, O'Reilly Media (2019).
• https://developers.google.com/machine-learning/crash-course/
• https://www.kaggle.com/kanncaa1/machine-learning-tutorial-for-beginners
• Yi-xin, https://medium.com/@yixinsun_56102/understanding-generalization-error-in-machine-learning-
e6c03b203036
• Siora Photography, https://unsplash.com/?utm_source=medium&utm_medium=referral
• Sharoon Saxena, https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-
machine-learning/
• https://scikit-learn.org/stable/auto_examples/index.html#dataset-examples
• https://en.wikipedia.org/wiki/Cross-validation_(statistics)
• https://vitalflux.com/hold-out-method-for-training-machine-learning-model/

You might also like