Introduction to Machine Learning

Ibrahim Sabek
Computer and Systems Engineering Department, Faculty of Engineering, Alexandria University, Egypt

1 / 33

Agenda
1 Machine learning overview and applications 2 Supervised vs. Unsupervised learning 3 Generative vs. Discriminative models 4 Overview of Classification 5 The big picture 6 Bayesian inference 7 Summary 8 Feedback
2 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Definition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

3 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Definition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netflix recommendation system

4 / 33

Machine learning overview and applications

What is Machine Learning (ML)?
Definition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netflix recommendation system

Classes of ML models
Supervised vs. Unsupervised. Generative vs. Discriminative

5 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value

6 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classification: y i ∈ {finite set } Regression: y i ∈ R

7 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classification: y i ∈ {finite set } Regression: y i ∈ R

8 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i ∈ R 2 , x i = data points y i = class/value Classification: y i ∈ {finite set } Regression: y i ∈ R

9 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), find patterns in the data.
x i ∈ R 2 , x i = data points

10 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), find patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

11 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), find patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

12 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), find patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

13 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised
Unsupervised: Given (x 1 , x 2 , ..., x n ), find patterns in the data.
x i ∈ R 2 , x i = data points Clustering Density estimation Dimensional reduction

14 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n

15 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n Active learning:

16 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Decision theory: measure the prediction performance of unlabeled data

17 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised
Decision theory: measure the prediction performance of unlabeled data Reinforcement learning:
maximize rewards (minimize losses) by actions maximize overall lifetime reward

18 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y )

19 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y ∈ {0, 1}

20 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models
Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y ∈ {0, 1}

Generative:
you want to estimate the joint distribution p (x , y )

21 / 33

Overview of Classification

k-Nearest Neighbor classification (kNN)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new point (x , y ) where x i ∈ R , y i ∈ {0, 1}

Dissimilarity metric: d (x , x ) = ||x − x ||2 for k = 1 Probabilistic interpretation:
Given fixed k , p (y ) = fraction of pts x i in Nk (x ) s.t. y i = y y ˆ = argmaxy p (y |x , D )
22 / 33

Overview of Classification

Classification trees (CART)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where x i ∈ R , y i ∈ {0, 1} You build a binary tree Minimize error in each leaf

23 / 33

Overview of Classification

Regression tress (CART)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where xi ∈ R, yi ∈ R

24 / 33

Overview of Classification

Bootstrap aggregation (Bagging)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i ∈ R , y i ∈ R , we need to find its y value

25 / 33

Overview of Classification

Bootstrap aggregation (Bagging)
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i ∈ R , y i ∈ R , we need to find its y value Intuition: averaging makes your prediction close to the true label i i Different training datasets , (xk , yk ) follows uniform (D ) iid . The final label y is the average of generated labels from the different datasets.

26 / 33

Overview of Classification

Random forests
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i ∈ R , y i ∈ R For i = 1, ..., B
Choose bootstrap sample Di from D Construct tree Ti using Di s.t. at each node choose random subset of features and only consider splitting on these features.

Given x , take majority vote (for classification) or average (for regression).

27 / 33

The big picture

The big picture
Given the expected loss function EL(y , f (x )) and D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i ∈ R , y i ∈ R , we want to estimate p (y |x ) Discriminative: Estimate p (y |x ) directly using D .
KNN, Trees, SVM

Generative: Estimate p (x , y ) directly using D . and then
p (y |x ) =
p (x ,y ) p (x ) ,

also we have p (x , y ) = p (x |y )p (y )

Params/Latent variables θ: by including parameters, we have p (x , y |θ)
for discrete space: p (y |x , D ) = p (y |x , D , θ)p (θ|x , D )
p (y |x , D , θ) is nice p (θ|x , D ) is nasty (called posterior dist. on θ) summation (or integration in case of continuous space) is nasty and often intractable
28 / 33

The big picture

The big picture
p (y |x , D ) = p (y |x , D , θ)p (θ|x , D ) Exact inference:
Multi-variate Gaussian. Graphical models

Point estimate of θ
Maximum Likelihood Estimation (MLE) Maximum A Prior (MAP) θEst . = argmaxθ p (θ|x , D )

Deterministic Approximation
Laplace Approx. Variational methods

Stochastic Approximation
Importance sampling Gibbs sampling
29 / 33

Bayesian inference

Bayesian inference
”Put distributions on everything and then use rules of probability to infer values” Aspects of Bayesian inference
Priors: Assuming a prior distribution p (θ) Procedures: Minimizing expected loss (averaging over θ) Pros.:
Directly answer questions. Avoid overfitting

Cons.:
Must assume prior. Exact computation can be intractable

30 / 33

Bayesian inference

Directed graphical models
”Bayesian networks” or ”Conditional independ. diagram”:
Why? Tractable inference.

Factorization of the probabilistic model. Notational device Visualization for inference algorithms Example for thinking graphically p (a, b , c ):
p (a, b , c ) = p (c |a, b )p (a, b ) = p (c |a, b )p (b |a)p (a)

31 / 33

Summary

Summary
Machine learning is an essential field for our life. Machine learning is a broad world, we just started it in this session :D :D.

32 / 33

Feedback

Feedback
Your feedback is welcomed on alex.acm.org/feedback/machine/

33 / 33

Sign up to vote on this title
UsefulNot useful