You are on page 1of 48

MACHINE

LEARNING
SUPERVISED LEARNING
LEC-7-8

1 Nazia Bibi
CLASSIFICATION

2
LEARNING A CLASS FROM EXAMPLES
 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect from a
family car?
 Positive (+) and negative (–) examples
 Input representation:

x1: price, x2 : engine power

3
TRAINING SET X
For each car

x 1 
x 
x 2 
 1 if x is positive
r 
0 if x is negative

For N training examples

X  {x ,r }t t N
t 1

4
5
6
7
CLASS C

 p 1  price  p 2  AND  e 1  engine power  e 2 

For suitable values of


p1,p2,e1 and e2

Class C is defined by a
rectangle in the price-
engine power space.

8
CLASS C

 p 1  price  p 2  AND e 1  engine power  e 2 

This equation fixes the hypothesis class H – the set of


rectangles
Learning algorithm to find a particular hypothesis h Є H to
approximate C as closely as possible

Expert defines the hypothesis class


Algorithm finds the parameters
9
 “False positive" is when a good quality item gets
rejected

 “False negative" is when a poor quality item gets


accepted.

10
HYPOTHESIS CLASS H
 1 if h classifies x as positive
h (x )  
0 if h classifies x as negative

Training Error: Predictions of h


which do not match the
required values in X

   
N
E (h | X )   1 h x t  r t
t 1
11
HYPOTHESIS CLASS H How to read?

   
N
E (h | X )   1 h x t  r t
t 1

Error on hypothesis h
given the training set X

12
13
S, G, AND THE VERSION SPACE

Most specific hypothesis, S


Most general hypothesis, G

h Î H, between S and G is
consistent

and make up the


version space

14
Ci for i=1,...,K

MULTIPLE CLASSES

 1 if x t
 Ci
X  {x t ,r t }tN1 ri  
t

0 if x  C j , j  i
t

Train hypotheses
hi(x), i =1,...,K:

1 if x t  Ci
 
hi xt 
0 if x  C j , j  i
t

15
Ci for i=1,...,K

MULTIPLE CLASSES
K Class problem = K – 2 class problems

Positive examples for


class : Luxury Sedan

Rest ALL – Negative


examples

16
LINEAR REGRESSION

17
EXAMPLE

David Beckham: 1.83m Brad Pitt: 1.83m George Bush :1.81m


Victoria Beckham: 1.68m Angelina Jolie: 1.70m Laura Bush: ?

 To predict height of the wife in a couple, based on the husband’s height


 Response (out come or dependent) variable (Y):
 height of the wife
 Predictor (explanatory or independent) variable (X): 18
 height of the husband
WHAT IS LINEAR
 Remember this?

19
WHAT IS LINEAR
 A slope of 2 means
that every 1-unit
change in X yields a
2-unit change in Y.

20
EXAMPLE
 Dataset giving the living areas and prices of 50 houses

21
EXAMPLE
 We can plot this data

Given data like this,


how can we learn to
predict the prices of
other houses as a
function of the size of
their living areas?

22
NOTATIONS
 The “input” variables – x(i) (living area in this example)
 The “output” or target variable that we are trying to
predict – y(i) (price)
 A pair (x(i), y(i)) is called a training example

 A list of m training examples {(x(i), y(i)); i =

 1, . . . ,m}—is called a training set

 X denote the space of input values, and Y the space of


output values

23
REGRESSION
Given a training set, to learn a function h : X → Y so
that h(x) is a “good” predictor for the corresponding
value of y. For historical reasons, this function h is
called a hypothesis.

24
CHOICE OF HYPOTHESIS
 Decision
 How to represent the hypothesis h
 For linear regression – we assume that the hypothesis is
Linear

h( x )   0  1 x

25
HYPOTHESIS
 Generally we’ll have more than one input features

x1=Living area h( x)   0  1 x1   2 x2
x2 = # of bedrooms 26
HYPOTHESIS
 Hypothesis
h( x)   0  1 x1   2 x2

 To show dependence on θ:

h ( x)   0  1 x1   2 x2
 OR
h( x |  )   0  1 x1   2 x2

This is the price that the hypothesis predicts for a given


27
house with living area x1 and number of bedrooms x2
HYPOTHESIS
h ( x)   0  1 x1   2 x2
 For conciseness

Define x0  1 h ( x)   0 x0  1 x1   2 x2
2
h ( x )    i xi θs are called the parameters and
i 0 are real numbers
 For n features
n Job of learning alogrithm to
h ( x )    i xi   T X find or learn these
i 0
parameters
28
CHOSING THE REGRESSION LINE

Which of these
lines to chose?

Y Y

X X 29
y  h ( x)   0  1 x

CHOSING THE REGRESSION LINE


The predicted value is:
yˆ i  h ( xi )   0  1 xi

Y The true value for xi is yi

yˆi
Error or residual yˆi  yi
yi

Consider this point xi

xi X 30
CHOSING THE REGRESSION LINE
How to chose this
best fit line
m
min  (h ( x (i ) )  y ( i ) ) 2
Y 
i 1

Minimize the sum of the squared


In other words: (why squared?) distances of the
How to chose points (Yi’s) from the line for the m
the θs training examples

X
31
min J ( )

GRADIENT DESCENT
 Chose initial values of θ0 and θ1 and continue moving the
direction of steepest descente
J(θ)

32
θ0
θ1
GRADIENT DESCENT
 Chose initial values of θ0 and θ1 and continue moving the
direction of steepest descente
 The step size is controlled by a parameter called learning
rate
 Starting point is
important

33
MODEL SELECTION

g  x   w 1x  w 0
Life is not as simple as Linear

g  x   w 2x 2  w 1x  w 0
 Non-Linear Regression

Higher order polynomial

34

GENERALIZATION
 Generalization: How well a model performs on new data
 Overfitting:
 The chosen hypothesis is too complex
 For example: Fitting a 3rd order polynomial on linear data

 Underfitting:
 The chosen hypothesis is too simple
 For example: Fitting a line on a quardatic function

35
CROSS VALIDATION

 To estimate generalization error, we need data unseen


during training. We split the data as
 Training set (50%)
 Validation set (25%)
 Test (publication) set (25%)

Chose the hypothesis that is best on the validation


set – Cross Validation
36
CROSS VALIDATION
 Example: Find the right order of polynomial in
regression?
 Use the training set to estimate the coefficients
 Caclulate the errors on the validation set
 Chose the one with the least validation error

 Question: What is the expected error of the chosen


model?
 Can NOT use the validation error
 The validation data has been used to chose the model –
effectively a part of training
 Use the TEST data set

37
SUMMARY
 Model
h ( x) or h  x| 

 Loss Function
m
E ( | x)  J ( )   (h ( x (i ) )  y ( i ) ) 2
i 1
 Optimization
min E ( | x)

38
COVARIANCE
n

 ( x  X )( y
i i Y )
cov ( x , y )  i 1
n 1

cov(X,Y) > 0 X and Y are positively correlated


cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent

39
CORRELATION COEFFICIENT
 Pearson’s Correlation Coefficient is standardized
covariance:

cov( x, y )
r
var x var y

40
CORRELATION COEFFICIENT
 Measures the relative strength of the linear relationship
between two variables
 Unit-less

 Ranges between –1 and 1

 The closer to –1, the stronger the negative linear


relationship
 The closer to 1, the stronger the positive linear relationship

 The closer to 0, the weaker any linear relationship

41
CORRELATION COEFFICIENT
Y Y

X X
r = -0.8 r = -0.6
Y
Y Y

42
X X
r = +0.8 r = +0.2
CORRELATION COEFFICIENT
Strong relationships Weak relationships

Y Y

X X

Y Y

43
X X
CORRELATION ANALYSIS

 Correlation coefficient (also called Pearson’s product moment


coefficient)

 
n n
(ai  A)(bi  B ) (ai bi )  n A B
rA, B  i 1
 i 1

(n) A B (n) A B
A B
where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is
the sum of the AB cross-product.

44
COVARIANCE
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, and A B


are the
respective mean or expected values of A and B, σA and σB
are the respective standard deviation of A and B.
45
CO-VARIANCE: AN EXAMPLE

 It can be simplified in computation as


CO-VARIANCE: AN EXAMPLE
 Suppose two stocks A and B have the following values in one week:

(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4


 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
ACKNOWLEDGEMENTS
 Dr. Imran Siddiqi, Bahria University, Islamabad
 Machine Learning, Andrew Ng – Stanford University

 Machine Intelligence, Dr M. Hanif, UET, Lahore

 Lecture Slides, Introduction to Machine Learning, E.


Alpyadin, MIT Press.

48

You might also like