# Introduction to Machine Learning

Ibrahim Sabek
Computer and Systems Engineering Department, Faculty of Engineering, Alexandria University, Egypt

1 / 33

Agenda
1 Machine learning overview and applications 2 Supervised vs. Unsupervised learning 3 Generative vs. Discriminative models 4 Overview of Classiﬁcation 5 The big picture 6 Bayesian inference 7 Summary 8 Feedback
2 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Deﬁnition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

3 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Deﬁnition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netﬂix recommendation system

4 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Deﬁnition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netﬂix recommendation system

Classes of ML models
Supervised vs. Unsupervised. Generative vs. Discriminative

5 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value

6 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classiﬁcation: y i ∈ {ﬁnite set } Regression: y i ∈ R

7 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classiﬁcation: y i ∈ {ﬁnite set } Regression: y i ∈ R

8 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classiﬁcation: y i ∈ {ﬁnite set } Regression: y i ∈ R

9 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), ﬁnd patterns in the data.
x i ∈ R 2 , x i = data points

10 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), ﬁnd patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

11 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), ﬁnd patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

12 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), ﬁnd patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

13 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), ﬁnd patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

14 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n

15 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n Active learning:

16 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Decision theory: measure the prediction performance of unlabeled data

17 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Decision theory: measure the prediction performance of unlabeled data Reinforcement learning:
maximize rewards (minimize losses) by actions maximize overall lifetime reward

18 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y )

19 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y ∈ {0, 1}

20 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y ∈ {0, 1}

Generative:
you want to estimate the joint distribution p (x , y )

21 / 33

Overview of Classiﬁcation

k-Nearest Neighbor classiﬁcation (kNN)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new point (x , y ) where x i ∈ R , y i ∈ {0, 1}

Dissimilarity metric: d (x , x ) = ||x − x ||2 for k = 1 Probabilistic interpretation:
Given ﬁxed k , p (y ) = fraction of pts x i in Nk (x ) s.t. y i = y y ˆ = argmaxy p (y |x , D )
22 / 33

Overview of Classiﬁcation

Classiﬁcation trees (CART)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where x i ∈ R , y i ∈ {0, 1} You build a binary tree Minimize error in each leaf

23 / 33

Overview of Classiﬁcation

Regression tress (CART)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where xi ∈ R, yi ∈ R

24 / 33

Overview of Classiﬁcation

Bootstrap aggregation (Bagging)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i ∈ R , y i ∈ R , we need to ﬁnd its y value

25 / 33

Overview of Classiﬁcation

Bootstrap aggregation (Bagging)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i ∈ R , y i ∈ R , we need to ﬁnd its y value Intuition: averaging makes your prediction close to the true label i i Diﬀerent training datasets , (xk , yk ) follows uniform (D ) iid . The ﬁnal label y is the average of generated labels from the diﬀerent datasets.

26 / 33

Overview of Classiﬁcation

Random forests
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i ∈ R , y i ∈ R For i = 1, ..., B
Choose bootstrap sample Di from D Construct tree Ti using Di s.t. at each node choose random subset of features and only consider splitting on these features.

Given x , take majority vote (for classiﬁcation) or average (for regression).

27 / 33

The big picture

The big picture
Given the expected loss function EL(y , f (x )) and D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i ∈ R , y i ∈ R , we want to estimate p (y |x ) Discriminative: Estimate p (y |x ) directly using D .
KNN, Trees, SVM

Generative: Estimate p (x , y ) directly using D . and then
p (y |x ) =
p (x ,y ) p (x ) ,

also we have p (x , y ) = p (x |y )p (y )

Params/Latent variables θ: by including parameters, we have p (x , y |θ)
for discrete space: p (y |x , D ) = p (y |x , D , θ)p (θ|x , D )
p (y |x , D , θ) is nice p (θ|x , D ) is nasty (called posterior dist. on θ) summation (or integration in case of continuous space) is nasty and often intractable
28 / 33

The big picture

The big picture
p (y |x , D ) = p (y |x , D , θ)p (θ|x , D ) Exact inference:
Multi-variate Gaussian. Graphical models

Point estimate of θ
Maximum Likelihood Estimation (MLE) Maximum A Prior (MAP) θEst . = argmaxθ p (θ|x , D )

Deterministic Approximation
Laplace Approx. Variational methods

Stochastic Approximation
Importance sampling Gibbs sampling
29 / 33

Bayesian inference

Bayesian inference
”Put distributions on everything and then use rules of probability to infer values” Aspects of Bayesian inference
Priors: Assuming a prior distribution p (θ) Procedures: Minimizing expected loss (averaging over θ) Pros.:

Cons.:
Must assume prior. Exact computation can be intractable

30 / 33

Bayesian inference

Directed graphical models
”Bayesian networks” or ”Conditional independ. diagram”:
Why? Tractable inference.

Factorization of the probabilistic model. Notational device Visualization for inference algorithms Example for thinking graphically p (a, b , c ):
p (a, b , c ) = p (c |a, b )p (a, b ) = p (c |a, b )p (b |a)p (a)

31 / 33

Summary

Summary
Machine learning is an essential ﬁeld for our life. Machine learning is a broad world, we just started it in this session :D :D.

32 / 33

Feedback

Feedback
Your feedback is welcomed on alex.acm.org/feedback/machine/

33 / 33