You are on page 1of 17

Machine Learning 2016

Lecture 1: Introduction
Lecturer: Endrew ng Scribe: Minsu Kim

1.1 What is Machine Learning?

Before studying Machine Learning algorthm, We should know what is Machine Learning. Two people have
tried to define it as follows :

Arthur Samuel. Machine Learning : Field of study that gives computers the ability to learn
without being explicitly programmed.

Tom Mitchell well-posed Learning problem : A computer program is said to learn from experience
E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

1.2 Supervised Learning

In supervised learning, We already have been given data set and know a relationship between input and
output. There are two kinds of supervised learning.

1.2.1 Regression

If we have data set of housing price in which right answers are given, how could we predict housing price in
special size. In figure1.1, the red-’X’ points are data set and two lines are the output prediction lines.

Figure 1.1: Regression

From thoes lines, we could predict housing price in some special size. These process is Regression. In other
words, Regression is to predict continuous valued output when you have data set labeled.

1-1
1.2.2 Classification

Classificatoin is similar to regression in that predicting output from data set labeled. However, It is not to
predict continuous valued output.

Figure 1.2: classification

We have data set about tumor above. We are trying to predict whether tumor is malignant or not according
to tumor size. Like this, it is Classification to predict discrete valued output(0,1) with data set labeled.

1.3 Unsupervised learning

Unlike supervised learning, unsupervised learning allows us to address a problem with no data set that can
show us what our result are being. We can instead obtain cluster from the data in which we don’t know the
effect of the variables. Namely, it is not teaching us what is correct result. We just obtain clusters that are
somehow simiral or related.

For example, We collect 1000 articles about Greese economy, and find a way to group these articles to
small clusters on similar topics, sentence, the number of pages ,and so on.

Figure 1.3: Unsupervised learning

1-2
Machine Learning 2016

Lecture 2: Linear Regression with One Variable


Lecturer: Endrew ng Scribe: Minsu Kim

2.1 Model Representation

On previously lecture, we studied ”Regression Problem”. In Regression Problem, We are trying to predict
continuous valued output when you have data set in which you have already known a relationship between
input and output.
Linear regression with one variable is also named ”Univariate linear regression”. If you are trying to predict
a single output valule from a single input value and have already known a reationship between input and
output, you should use univariate linear regression.

2.1.1 Hypothesis Function

Before studying the hypothesis fuction, we should know a few notation.


m = Number of training examples.
0
x s = ”input” variable / features.
0
y s = ”output” variable / ”target” variable.
(x,y) = one training example.
x(i) , y(i) = ith training example.

Our hypothesis fuction has the general form :

hθ = θ0 + θ1 x
0 0
We should choose θ0 , θ1 of hθ so that hθ is close to y s for our training examples, and hθ maps from x s to
0
y s like bule line as follow:

Figure 2.1: Hypothesis function

2-1
2.1.2 Cost Function

We could measure how close hθ is with cost function. We takes average of squarel error of hθ with all input
and the actual output. That’s why it is called as ”squared error function” or ”Mean squared error”. Cost
function is defined as
m o2
1 nX
J(θ0 , θ1 ) = hθ (x(i) ) − y(i)
2m
i=1
1
The term of ( 2m ) is for the convenient computation of the gradient descent, as the derivative term of the
square function will cancel out the ( 12 ) term.
From the cost function, we could concretely measure the accuracy of our hθ against the training examples.
The more accuracy our hθ is, the closer J(θ0 , θ1 ) value goes to zero.

2.1.3 Gradient Descent

We have our hypothesis function and a way of measuring how accurate it is. Then we are studying a way of
automatically improve our hypothesis function by using gradient descent.
The gradient descent equation is defined as

repeat until convergence :



θj := θj − α J(θ0 , θ1 )
∂θj

(for j = 0 and j = 1)
(Simultaneous Update)
Our optimization objective for our learning algorithm is fitting proper θ0 , θ1 to minimize J(θ0 , θ1 ). That’s
why we have to do ∂θ∂ j J(θ0 , θ1 ). Here is the ituition of gradient descent.

Figure 2.2: Gradient descent

In Figure2.2, We could see that gradient descent will automatically take smaller steps, as we approach a
local minimum. So, no need to decrease α over time.

2-2
2.1.4 Gradient Descent For Linear Regression

From general gradient descent definition above, we could represent gradient descent for linear regression as

repeat until convergence :


m
∂ 1 X
θ0 := θ0 − α J(θ0 , θ1 ) = θ0 − α (hθ (x(i) ) − y(i) )
∂θ0 m
i=1
m
∂ 1 X
θ1 := θ1 − α J(θ0 , θ1 ) = θ1 − α (hθ (x(i) ) − y(i) )x(i)
∂θ1 m
i=1

(Simultaneous Update)

In linear regression, we could alyaws see a global minimum. It is not available to see local minimum.

2-3
Machine Learning 2016

Lecture 3: Linear Regression with Multiple Variable


Lecturer: Endrew ng Scribe: Minsu Kim

3.1 Multiple Features

Unlike univariate linear regression, linear regression with multivariable is more variables to predict output
more accurately. It is also called as ”Multivariate linear regression”.

3.1.1 Hypothesis Function for Multiple Features

Now we introduce notation for equations with any number of variables.

m = Number of training examples.


n = Number of features.
x(i) = input(features) of i(th) training example.
(i)
xj = calue of feature j in i(th) training example.

Then we could define hypothesis function in muliples features as follows :

hθ = θ0 + θ1 x1 + θ2 x2 + θ3 x3 + . . . + θn xn

For convenience of notation, we define x0 = 1 and want more compact form :


 
x0
 x1 
 
hθ = θ0 θ1 . . . θn  .  = ΘT x

 .. 
xn

Now wec collect all m training examples each with n features and record them in an (n + 1) × m matrix,
as shown here :  (1) (2) (m)
 
1 1 ... 1

x0 x0 . . . x0
 (1) (2) (m)  (1) (2) (m)
 x1 x1 . . . x1   x1 x1 . . . x1 
X= = 
..  .. 
.
  
 .  
(1) (2) (m) (1) (2) (m)
xn xn . . . xn xn xn . . . xn
And then,
1 1 ... 1
 
(1) (2) (m)
  x1
 x1 ... x1 
 = ΘT X

hθ = θ0 θ1 ... θn

 ..
 . 
(1) (2) (m)
xn xn ... xn

3-1
3.1.2 Cost Function for Multiple Variables

For multiple variables hθ = ΘT X and the paramethers vertor Θ is (n+1)-dimensional vector. Then the cost
function is:
m o2
1 nX
J(Θ) = hθ (x(i) ) − y(i)
2m
i=1

And the vectorized version is :


1
J(Θ) = (ΘT X − ~y )T (ΘT X − ~y )
2m

3.1.3 Gradient Descent for Multiple Variables

From general gradient descent form, the gradient descent for multiple variables is :

repeat until convergence :


m
1 X (i)
θj := θj − α (hθ (x(i) ) − y(i) )xj
m
i=1

(for j = 0 , 1 , 2 , . . . , n)
(Simultaneous Update)
And the vectorized version is :

Θ:=Θ − α 5 J(Θ)

where 5J(Θ) is defined as


∂θ0 J(Θ)
 

∂θ1 J(Θ)
 
5J(Θ) =   = ΘT X
 
..
 . 

∂θn J(Θ)

The j-th component of 5J(Θ) could be vectorized as


m
∂ 1 X (i)
J(Θ) = (hθ (x(i) ) − y(i) )xj
∂θj m
i=1
m
1 X (i)
= xj (hθ (x(i) ) − y(i) )
m
i=1
1
= x~j T (ΘT X − ~y )
m

3-2
And then

1 T T
5J(Θ) = X (Θ X − ~y )
m

From those vectorized version, we could express vectorized gradient descent as

Θ:=Θ − α 5 J(Θ)
1
:=Θ − α XT (ΘT X − ~y )
m

3.1.4 Feature Scaling

previously, we have learned to choose Θ for predicting our output. Now we are studying a faster way to
choose Θ. The idea is that features are on a similar sacle. For example,
x1 = size(0 ∼ 2000(f eet2 )).
x2 = number of bedrooms(1 ∼ 5).
x1 ’s range is too bigger than x2 ’s range. So, It will take long time to get grobal minimum. Therefore, we
should get every feature into approximately a −1 ≤ x ≤ 1 range.
- Mean normalize
Replace xi with xi − µi to make features have approximately zero mean.[NOTE : Do not apply to x0 = 1]

 
xi − µi W here µi = average of xi in training set
Xi =
si si = range of xi (max − min)

3.1.5 Learing Rate

In gradient descent

θj := θj − α J(θ)
∂θj

We could see α, and it is called as ”Learning Rate”. It also affect convergence of θ. We know that J(θ)
should decrease after every iteration like below.

Figure 3.1: proper α

3-3
Figure 3.2: not proper α

However, If J(θ) increase or vibrate like Figure3.2, you should use smaller α.
Therefore,
- If α is too small : slow convergence.
- If α is too large : J(θ) may not decrease on every iteration, may not converge.
To choose α, try from (· · · 0.001, 0.01, 0.1, 1 · · · )

3.1.6 Features and Polynomial Regression

From our features x1 , x2 , we could make a feature x3 = x1 × x2 . Like this, we should make new features
from origin features in some problem.
If we get hypothesis function as
hθ = θ0 + θ1 x1
and then, from that function we could change the behavior of curve hθ . Now, hθ is linear function. Then
we just duplicate the variable of x1 to get a new function:
hθ = θ0 + θ1 x1 + θ2 x21
In that function, we could make a new feature x2 = x21 .
By making hθ quadratic, cubic or any other form, you could improve your hypothesis function.

3.1.7 Normal Equation

Normal equation is a method of solve for θ analytically.


From below :

J(θ) = · · · · · · ≈ 0 f or solve θ
∂θ
In m examples and n features, the normal equation is defined as
Θ = (XT X)−1 XT y (Θ ∈ <n+1 )
where  (i) 
x0 (x1 )T
 
··· ···
(i)
x1 ··· (x2 )T ···
   
   
i (i) ··· (x3 )T ···
 
x = x2  and X=
 


..
  .. 
.
 
 .   
(i)
xn ··· (xm )T ···

3-4
If we use normal equation, It doesn’t need to do feature scaling.
Now, we are comparing gradinent descent to normal equation.

Gradient Descent Normal Equation


- Need to choose α - No need to choose α
- Needs many iteration - Don’t need to iterate
- Works well even when n is large - Need to compute (XT X)−1
- Slow if n is very large

3-5
Machine Learning 2016

Lecture 4: Logistic Regression


Lecturer: Endrew ng Scribe: Minsu Kim

4.1 Classification

As we mentioned before in lecture 1, classification is to predict discrete valued output(0,1) with data set
labeled. This means (y ∈ {0, 1}). hypothesis function for regression is not alyway in from 0 to 1. So, we
could’t use hypothesis function for regression and need new function.
”Logistic regression” is defined for classification algorithm.

4.2 Hypothesis representation for Classification

4.2.1 Logistic Regression Model

We want the condition of 0 ≤ hθ (x) ≤ 1. With the condition, hypothesis function is defined as

hθ (x) = g(θT x)
1
g(z) =
1 + e−z

Figure 4.1: Sigmoid function g(z)

Sigmoid function g(z) satisfy the condition of hθ (x). And, threshold classifier output is at 0.5 :
- If hθ (x) ≥ 0.5, predict ”y = 1”
- If hθ (x) < 0.5, predict ”y = 0”
Therefore, hθ (x) means estimated probability that y = 1 on input x. In other words, We could represent
hθ (x) as
hθ (x) = P (y = 1 | x; θ)

4-1
4.2.2 Decision Boundary

In logistic regression, there are boundary. Suppose we are trying to predict that

”y = 1” if hθ (x) ≥ 0.5 : g(z) ≥ 0.5 when z ≥ 0 −→ θT X ≥ 0


”y = 0” if hθ (x) < 0.5 : g(z) < 0.5 when z < 0 −→ θT X < 0

4.2.3 Cost Function for Classification

In logistic regression, cost function defined as


1
J(θ) = cost(hθ (x), y)
m
where
m m n o2
X X 1
cost(hθ (x), y) = cost(hθ (x(i) ), y (i) ) = hθ (x(i) ) − y(i)
2
i=1 i=1

1
Unlike linear regression, hθ (x) is complicated by 1+e−z . The graph of J(θ) is non-convex form as follows :

Figure 4.2: Regression form

Non-convex form does not guarantee global minium. Therefore, we redefined logistic regression cost function
to make convex form.

4-2
- Logistic regression cost function
It is redefined as n −log(hθ (x)) if y = 1
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0

Figure 4.3: cost(hθ (x), y)

cost(hθ (x), y) = 0 if hθ (x) = y


cost(hθ (x), y) → ∞ if y = 0 and hθ (x) → 1
cost(hθ (x), y) → ∞ if y = 1 and hθ (x) → 0

4.2.4 Simplified Cost Function and Gradient Descent

We know the logistic regression cost function as


m
1 X
J(θ) = cost(hθ (x(i) ), y (i) )
m
i=1

−log(hθ (x)) if y = 1
n
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0
[NOTE : y = 0 or 1 always]
Because of NOTE in logistic regression, we could simplify cost function as

cost(hθ (x), y) = −ylog(hθ (x)) − (1 − y)log(1 − hθ (x))

Therefore,
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1

And then, the gradient descent could be simplified as

repeat f or minθ J(θ) :



θj := θj − α J(θ)
∂θj

4-3
(Simultaneous Update)
where
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1

4.2.5 Multiclass Classification

This algorithm looks identical to linear regression. But,hθ (x) is different each other as follows :

Linear Regression : hθ (x) = ΘT X


1
Logistic Regression : hθ (x) =
1 + e−ΘT X
Now we are going to approach classification that has more than two categories. In this case, you have more
than two outputs. If a problem has n-class categories, you could predict probability as follows :
(i)
max hθ (x) = max P (y = i | x; θ) (i = 1, 2, . . . , n)
i i

On a new input x, to make a prediction, pick the class i that maximizes.

4-4
Machine Learning 2016

Lecture 5: Regularization
Lecturer: Endrew ng Scribe: Minsu Kim

5.1 The Problem of Overfitting

Suppose we are trying to predic housing prices, and we have three kinds of hypothesis function in figure5.1.

Figure 5.1: Prediction

case1 : hθ (x) = θ0 + θ1 x
case2 : hθ (x) = θ0 + θ1 x + θ2 x2
case3 : hθ (x) = θ0 + θ1 x + θ2 x2 + · · · + θ6 x6

In case 1, function is not fitting very well (”Underfitting”).


In case 2, function is fitting properly.
In case 3, function is fitting well, but too much. (”Overfitting”)

- Overfitting : If we have too many features, the learned hypothesis may fit the training set very well, but
fial to generalize to new examples.

5.2 Addressing Overfitting

There are two options :


1. Reduce number of features.
- Manually select which features to keep
- Model selection algorithm
But, this option is to throw away our data.
2. Regularization.
- Keep all the features, but reduce magnitude/values of paramethers θj
- Works well when we have a lot of features, each of which contributes a bit to predicting y

5-1
5.3 Regularized Cost function

Our hypothesis function is


hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3 + θ4 x4
Supose hθ (x) is overfitting currently, and we penalize and make θ3 ,θ4 really small. The result is hθ (x) may
be quadetric function, and we could adress overfitting. Without actual eliminating these features, we could
just modify our cost function:
m
1 X
minθ (hθ (x(i) ) − y (i) )2 + 1000 · θ32 + 1000 · θ42
2m i=1

So, we have to reduce the value of θ3 ,θ4 to converge to zero.


We could also regularize all of our θ parameters in a single summation except θ0 :

 
m n
1 X
 (hθ (x(i) ) − y (i) )2 + λ
X
minθ θj2 
2m i=1 j=1

(λ is Regularization parameter)

Because the term (θ0 ) is a bias term, we explicit the term.

5.4 Regularized Linear Regression

5.4.1 Gredient Descent in Regularization

repeat until convergence :


m
1 X (i)
θ0 := θ0 − α (hθ (xi ) − y (i) )x0
m i=1
"m
#
1 X (i) λ
θj := θj − α (hθ (x(i) ) − y (i) )xj + θj
m i=1 m

(for j = 1,. . .,n)


(Simultaneous Update)

λ
The term of ( m θj ) is for regularization. Now we can also represent a part of θj as
m
λ 1 X (i)
θj := θj (1 − α )−α (hθ (x(i) ) − y (i) )xj
m m i=1

λ
The term of (α m ) is less than 1. This term also has an effect on reducing θj .

5-2
5.4.2 Regularized Normal Equation

Normal equation in regularization is defined as


 
0

 1 

T
−1 T 1
θ = X X +λ·L X y where L = 
 

 .. 
 . 
1

5.5 Regularized Logistic Regression

Logistic regression cost function in regularization is defined as


" m
# n
1 X (i) (i) (i) (i) λ X 2
J(θ) = − y log(hθ (x )) + (1 − y )log(1 − hθ (x )) + θ
m 2m j=1 j
i=1

and with J(θ) we could represent gradient descent as

repeat until convergence :


m
1 X (i)
θ0 := θ0 − α (hθ (xi ) − y (i) )x0
m i=1
" m
#
1 X (i) λ
θj := θj − α (hθ (x(i) ) − y (i) )xj + θj
m i=1 m
1
(where hθ (x) = , for j = 1,. . .,n)
1 + e−ΘT X
(Simultaneous Update)

This is identical to the gradient descent for linear regression except hθ (x).

5-3

You might also like