You are on page 1of 75

Machine Learning

Linear Regression

Na Lu
Xi’an Jiaotong University
Machine Learning

• Machine learning: Field of study that gives


computers the ability to learn without being
explicitly programmed.
– By Arthur Samuel (1959)
Machine Learning

• Well-posed learning problem: A computer


program is said to learn from experience E with
respect to some task T and some performance
measure P, if its performance on T, as
measured by P, improves with experience E.
– By Tom Mitchell (1998)
Question
Three types of learning

• Supervised learning
• Learn to predict an output when given an input
vector
• Reinforcement learning
• Learn to select an action to maximize payoff.
• Unsupervised learning
• Discover a good internal representation of the
input and find out what it could be.
Two types of supervised learning
• Each training case consists of an input vector x and a
target output y.
• Regression: The target output is a real number or a
whole vector of real numbers.
– The price of a stock.
– The temperature during a day.
– Aim: to get as close as you can to the real number.
• Classification: The target output is a class label.
– The simplest case is a choice between 1 and 0.
– Facial identities with multiple labels.
– Aim: to classify the input into the correct category.
How does supervised learning work

• Start by choosing a model-class: y = f (x; W)


– A model class f is a mapping of the input vector x to
the predicted output y with parameters in W.
• The target of learning is to reduce the discrepancy
between the model predicted output and the actual
output based on given input-output pairs.
– L2 norm (least square) is a widely used measure of
the discrepancy in case of regression.
– L2 norm and other particular measures could be used
in case of classification.
Reinforcement learning
• The output is an action or a sequence of actions; and the
supervisory information is an occasional scalar reward.
– The goal of the action selection is to maximize the
expected future rewards.
– A discount factor is employed to incorporate the
delayed rewards.
• Difficulties in reinforcement learning:
– The delayed rewards make it hard to know when we
went wrong.
– A scalar reward dose not supply sufficient information.
– Not many parameters could be learnt from
reinforcement learning as could be done in
supervised and unsupervised learning.
Unsupervised learning

• The aim of unsupervised learning is hard to be


defined.
– One major aim is to find an internal
representation of the input for subsequent
supervised learning or reinforcement learning.
– The related research focus on clustering.
– Unsupervised learning has been ignored for
about 40 years by the machine learning
community.
Other goals of unsupervised learning

• To provides a compact, low-dimensional


representation of the input.
– High-dimensional inputs typically reside on or
near a low-dimensional manifold (or several
such manifolds).
– Principle component analysis is such a
representative linear method.
• To find the clusters from the input.
– Clusters could be interpreted as a very sparse
code where only one feature is nonzero.
Question
Supervised learning

• Data from Portland, Oregon State, US

• Supervised learning: right answer given


• Regression: predict continuous valued output (house
price)
Supervised learning

• Breast cancer (malignant or benign?)

• Supervised learning: correct labels given


• Classification: Discrete valued output (0, 1)
Supervised learning

• More than one feature considered

– Clump thickness
– Uniformity of cell size
– Uniformity of cell shape
– ……
Unsupervised learning
Question
Applications
Cocktail party problem
Cocktail party problem

• Microphone 1 Output 1

Output 2

[W, s v] =svd((repmat(sum(x.*x,1),size(x,1),1).*x*x’);
Problem
500
Housing Prices
(Portland, OR) 400

300

Price 200
(in 1000s
of dollars) 100
0
0 1000 2000 3000
Size (feet2)

Supervised Learning Regression Problem


Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 Price ($) in
housing prices (x) 1000's (y)
(Portland, OR) 2104 460
1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable

(x, y) : one training example


(x(i),y(i)): ith training example
How do we represent h ?

Training Set
) θ 0 + θ1 x
hθ ( x=

Learning Algorithm
y

Size of Estimate
h x
house d price
Linear regression with one variable.
Univariate linear regression.
Question
Linear regression
with one variable

Cost function
Training Set Size in feet2 Price ($) in
(x) 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:

: Parameters

How to choose s?
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
1 m
y
J (θ 0 ,θ1 )
= ∑
2m i =1
( hθ ( x (i )
) − y (i ) 2
)

x Minimize J (θ 0 ,θ1 )
θ0 ,θ1

Idea: Choose so that


is close to for
our training examples
Question
Linear regression
with one variable

Cost function
intuition I
Hypothesis: Simplified

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 m
J (θ1 ) ∑
2m i =1
(hθ ( x (i ) ) − y (i ) ) 2

1 m
= ∑
2m i =1
(θ1 x (i )
− y (i ) 2
)
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
=
J (0.5) [(0.5 − 1) 2 + (1 − 2) 2 + (1.5 − 3) 2 ]
6
3.5
= ≈ 0.58
6
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 2
J (0)= [1 + 22 + 32 ]
6
14
= ≈ 2.3
6
Question
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameters )

500

400

Price ($)
in 1000’s 300

200

100

0
0 1000 2000 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Linear regression
with one variable

Gradient descent
Have some function

Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J(θ0,θ1)

θ1
θ0
J(θ0,θ1)

θ1
θ0
Gradient descent algorithm

Correct: Simultaneous update Incorrect:


Question
Linear regression
with one variable

Gradient descent
intuition
Gradient descent algorithm
If α is too small, gradient
descent can be slow.

If α is too large, gradient descent


can overshoot the minimum. It
may fail to converge, or even
diverge.
Question
at local optima

Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α
fixed.

As we approach a local
minimum, gradient
descent will
automatically take
smaller steps. So, no
need to decrease α over
time.
Linear regression
with one variable

Gradient descent for


linear regression
Gradient descent algorithm Linear Regression Model
∂ 1 m

∂θ j 2m i =1
( hθ ( x (i )
) − y (i ) 2
)

∂ 1 m
= ∑
∂θ j 2m i =1
(θ 0 + θ1 x (i )
− y (i ) 2
)

1 m

m i =1
( hθ ( x (i )
) − y (i )
)

1 m

m i =1
( hθ ( x (i )
) − y (i )
) ⋅ x (i )
Gradient descent algorithm

update
and
simultaneously
J(θ0,θ1)

θ1
θ0
J(θ0,θ1)

θ1
θ0
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent

“Batch”: Each step of gradient


descent uses all the training
examples.
Question
The End

Thank you

You might also like