Cost Function

Machine Learning 2016
Lecture 1: Introduction
Lecturer: Endrew ng Scribe: Minsu Kim
1.1 What is Machine Learning?
Before studying Machine Learning algorthm, We should know what is Machine Learning. Two people have
tried to define it as follows :
Arthur Samuel. Machine Learning : Field of study that gives computers the ability to learn
without being explicitly programmed.
Tom Mitchell well-posed Learning problem : A computer program is said to learn from experience
E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.
1.2 Supervised Learning
In supervised learning, We already have been given data set and know a relationship between input and
output. There are two kinds of supervised learning.
1.2.1 Regression
If we have data set of housing price in which right answers are given, how could we predict housing price in
special size. In figure1.1, the red-’X’ points are data set and two lines are the output prediction lines.
Figure 1.1: Regression
From thoes lines, we could predict housing price in some special size. These process is Regression. In other
words, Regression is to predict continuous valued output when you have data set labeled.
1-1
1.2.2 Classification
Classificatoin is similar to regression in that predicting output from data set labeled. However, It is not to
predict continuous valued output.
Figure 1.2: classification
We have data set about tumor above. We are trying to predict whether tumor is malignant or not according
to tumor size. Like this, it is Classification to predict discrete valued output(0,1) with data set labeled.
1.3 Unsupervised learning
Unlike supervised learning, unsupervised learning allows us to address a problem with no data set that can
show us what our result are being. We can instead obtain cluster from the data in which we don’t know the
effect of the variables. Namely, it is not teaching us what is correct result. We just obtain clusters that are
somehow simiral or related.
For example, We collect 1000 articles about Greese economy, and find a way to group these articles to
small clusters on similar topics, sentence, the number of pages ,and so on.
Figure 1.3: Unsupervised learning
1-2
Lecture 2: Linear Regression with One Variable

2.1 Model Representation
On previously lecture, we studied ”Regression Problem”. In Regression Problem, We are trying to predict
continuous valued output when you have data set in which you have already known a relationship between
input and output.
Linear regression with one variable is also named ”Univariate linear regression”. If you are trying to predict
a single output valule from a single input value and have already known a reationship between input and
output, you should use univariate linear regression.
2.1.1 Hypothesis Function
Before studying the hypothesis fuction, we should know a few notation.

m = Number of training examples.
0
x s = ”input” variable / features.
0
y s = ”output” variable / ”target” variable.
(x,y) = one training example.
x(i) , y(i) = ith training example.
Our hypothesis fuction has the general form :
hθ = θ0 + θ1 x
0 0
We should choose θ0 , θ1 of hθ so that hθ is close to y s for our training examples, and hθ maps from x s to
0
y s like bule line as follow:
Figure 2.1: Hypothesis function
2-1
2.1.2 Cost Function
We could measure how close hθ is with cost function. We takes average of squarel error of hθ with all input
and the actual output. That’s why it is called as ”squared error function” or ”Mean squared error”. Cost
function is defined as
m o2
1 nX
J(θ0 , θ1 ) = hθ (x(i) ) − y(i)
2m
i=1
1
The term of ( 2m ) is for the convenient computation of the gradient descent, as the derivative term of the
square function will cancel out the ( 12 ) term.
From the cost function, we could concretely measure the accuracy of our hθ against the training examples.
The more accuracy our hθ is, the closer J(θ0 , θ1 ) value goes to zero.
2.1.3 Gradient Descent
We have our hypothesis function and a way of measuring how accurate it is. Then we are studying a way of
automatically improve our hypothesis function by using gradient descent.
The gradient descent equation is defined as
repeat until convergence :

∂
θj := θj − α J(θ0 , θ1 )
∂θj
(for j = 0 and j = 1)
(Simultaneous Update)
Our optimization objective for our learning algorithm is fitting proper θ0 , θ1 to minimize J(θ0 , θ1 ). That’s
why we have to do ∂θ∂ j J(θ0 , θ1 ). Here is the ituition of gradient descent.
Figure 2.2: Gradient descent
In Figure2.2, We could see that gradient descent will automatically take smaller steps, as we approach a
local minimum. So, no need to decrease α over time.
2-2
2.1.4 Gradient Descent For Linear Regression
From general gradient descent definition above, we could represent gradient descent for linear regression as

m
∂ 1 X
θ0 := θ0 − α J(θ0 , θ1 ) = θ0 − α (hθ (x(i) ) − y(i) )
∂θ0 m
i=1
m
∂ 1 X
θ1 := θ1 − α J(θ0 , θ1 ) = θ1 − α (hθ (x(i) ) − y(i) )x(i)
∂θ1 m
i=1
In linear regression, we could alyaws see a global minimum. It is not available to see local minimum.
2-3
Lecture 3: Linear Regression with Multiple Variable

3.1 Multiple Features
Unlike univariate linear regression, linear regression with multivariable is more variables to predict output
more accurately. It is also called as ”Multivariate linear regression”.
3.1.1 Hypothesis Function for Multiple Features
Now we introduce notation for equations with any number of variables.
m = Number of training examples.

n = Number of features.
x(i) = input(features) of i(th) training example.
(i)
xj = calue of feature j in i(th) training example.
Then we could define hypothesis function in muliples features as follows :
hθ = θ0 + θ1 x1 + θ2 x2 + θ3 x3 + . . . + θn xn
For convenience of notation, we define x0 = 1 and want more compact form :

 
x0
 x1 
 
hθ = θ0 θ1 . . . θn  .  = ΘT x

 .. 
xn
Now wec collect all m training examples each with n features and record them in an (n + 1) × m matrix,
as shown here :  (1) (2) (m)
 
1 1 ... 1

x0 x0 . . . x0
 (1) (2) (m)  (1) (2) (m)
 x1 x1 . . . x1   x1 x1 . . . x1 
X= = 
..  .. 
.
  
 .  
(1) (2) (m) (1) (2) (m)
xn xn . . . xn xn xn . . . xn
And then,
1 1 ... 1
 
(1) (2) (m)
 x1
 x1 ... x1 
 = ΘT X

hθ = θ0 θ1 ... θn

 ..
 . 
(1) (2) (m)
xn xn ... xn
3-1
3.1.2 Cost Function for Multiple Variables
For multiple variables hθ = ΘT X and the paramethers vertor Θ is (n+1)-dimensional vector. Then the cost
function is:
m o2
1 nX
J(Θ) = hθ (x(i) ) − y(i)
2m
i=1
And the vectorized version is :

1
J(Θ) = (ΘT X − ~y )T (ΘT X − ~y )
2m
3.1.3 Gradient Descent for Multiple Variables
From general gradient descent form, the gradient descent for multiple variables is :

m
1 X (i)
θj := θj − α (hθ (x(i) ) − y(i) )xj
m
i=1
(for j = 0 , 1 , 2 , . . . , n)
And the vectorized version is :
Θ:=Θ − α 5 J(Θ)
where 5J(Θ) is defined as
∂
∂θ0 J(Θ)
 
∂
∂θ1 J(Θ)
 
5J(Θ) =   = ΘT X
 
..
 . 
∂
∂θn J(Θ)
The j-th component of 5J(Θ) could be vectorized as

m
∂ 1 X (i)
J(Θ) = (hθ (x(i) ) − y(i) )xj
∂θj m
i=1
m
1 X (i)
= xj (hθ (x(i) ) − y(i) )
m
i=1
1
= x~j T (ΘT X − ~y )
m
3-2
And then
1 T T
5J(Θ) = X (Θ X − ~y )
m
From those vectorized version, we could express vectorized gradient descent as
Θ:=Θ − α 5 J(Θ)
1
:=Θ − α XT (ΘT X − ~y )
m
3.1.4 Feature Scaling
previously, we have learned to choose Θ for predicting our output. Now we are studying a faster way to
choose Θ. The idea is that features are on a similar sacle. For example,
x1 = size(0 ∼ 2000(f eet2 )).
x2 = number of bedrooms(1 ∼ 5).
x1 ’s range is too bigger than x2 ’s range. So, It will take long time to get grobal minimum. Therefore, we
should get every feature into approximately a −1 ≤ x ≤ 1 range.
- Mean normalize
Replace xi with xi − µi to make features have approximately zero mean.[NOTE : Do not apply to x0 = 1]

xi − µi W here µi = average of xi in training set
Xi =
si si = range of xi (max − min)
3.1.5 Learing Rate
In gradient descent
∂
θj := θj − α J(θ)
∂θj
We could see α, and it is called as ”Learning Rate”. It also affect convergence of θ. We know that J(θ)
should decrease after every iteration like below.
Figure 3.1: proper α
3-3
Figure 3.2: not proper α
However, If J(θ) increase or vibrate like Figure3.2, you should use smaller α.
Therefore,
- If α is too small : slow convergence.
- If α is too large : J(θ) may not decrease on every iteration, may not converge.
To choose α, try from (· · · 0.001, 0.01, 0.1, 1 · · · )
3.1.6 Features and Polynomial Regression
From our features x1 , x2 , we could make a feature x3 = x1 × x2 . Like this, we should make new features
from origin features in some problem.
If we get hypothesis function as
hθ = θ0 + θ1 x1
and then, from that function we could change the behavior of curve hθ . Now, hθ is linear function. Then
we just duplicate the variable of x1 to get a new function:
hθ = θ0 + θ1 x1 + θ2 x21
In that function, we could make a new feature x2 = x21 .
By making hθ quadratic, cubic or any other form, you could improve your hypothesis function.
3.1.7 Normal Equation
Normal equation is a method of solve for θ analytically.

From below :
∂
J(θ) = · · · · · · ≈ 0 f or solve θ
∂θ
In m examples and n features, the normal equation is defined as
Θ = (XT X)−1 XT y (Θ ∈ <n+1 )
where  (i) 
x0 (x1 )T
 
··· ···
(i)
x1 ··· (x2 )T ···
   
   
i (i) ··· (x3 )T ···
 
x = x2  and X=
 


..
  .. 
.
 
 .   
(i)
xn ··· (xm )T ···
3-4
If we use normal equation, It doesn’t need to do feature scaling.
Now, we are comparing gradinent descent to normal equation.
Gradient Descent Normal Equation

- Need to choose α - No need to choose α
- Needs many iteration - Don’t need to iterate
- Works well even when n is large - Need to compute (XT X)−1
- Slow if n is very large
3-5
Lecture 4: Logistic Regression

4.1 Classification
As we mentioned before in lecture 1, classification is to predict discrete valued output(0,1) with data set
labeled. This means (y ∈ {0, 1}). hypothesis function for regression is not alyway in from 0 to 1. So, we
could’t use hypothesis function for regression and need new function.
”Logistic regression” is defined for classification algorithm.
4.2 Hypothesis representation for Classification
4.2.1 Logistic Regression Model
We want the condition of 0 ≤ hθ (x) ≤ 1. With the condition, hypothesis function is defined as
hθ (x) = g(θT x)
1
g(z) =
1 + e−z
Figure 4.1: Sigmoid function g(z)
Sigmoid function g(z) satisfy the condition of hθ (x). And, threshold classifier output is at 0.5 :
- If hθ (x) ≥ 0.5, predict ”y = 1”
- If hθ (x) < 0.5, predict ”y = 0”
Therefore, hθ (x) means estimated probability that y = 1 on input x. In other words, We could represent
hθ (x) as
hθ (x) = P (y = 1 | x; θ)
4-1
4.2.2 Decision Boundary
In logistic regression, there are boundary. Suppose we are trying to predict that
”y = 1” if hθ (x) ≥ 0.5 : g(z) ≥ 0.5 when z ≥ 0 −→ θT X ≥ 0

”y = 0” if hθ (x) < 0.5 : g(z) < 0.5 when z < 0 −→ θT X < 0
4.2.3 Cost Function for Classification
In logistic regression, cost function defined as

1
J(θ) = cost(hθ (x), y)
m
where
m m n o2
X X 1
cost(hθ (x), y) = cost(hθ (x(i) ), y (i) ) = hθ (x(i) ) − y(i)
2
i=1 i=1
1
Unlike linear regression, hθ (x) is complicated by 1+e−z . The graph of J(θ) is non-convex form as follows :
Figure 4.2: Regression form
Non-convex form does not guarantee global minium. Therefore, we redefined logistic regression cost function
to make convex form.
4-2
- Logistic regression cost function
It is redefined as n −log(hθ (x)) if y = 1
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0
Figure 4.3: cost(hθ (x), y)
cost(hθ (x), y) = 0 if hθ (x) = y

cost(hθ (x), y) → ∞ if y = 0 and hθ (x) → 1
cost(hθ (x), y) → ∞ if y = 1 and hθ (x) → 0
4.2.4 Simplified Cost Function and Gradient Descent
We know the logistic regression cost function as

m
1 X
J(θ) = cost(hθ (x(i) ), y (i) )
m
i=1
−log(hθ (x)) if y = 1
n
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0
[NOTE : y = 0 or 1 always]
Because of NOTE in logistic regression, we could simplify cost function as
cost(hθ (x), y) = −ylog(hθ (x)) − (1 − y)log(1 − hθ (x))
Therefore,
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1
And then, the gradient descent could be simplified as
repeat f or minθ J(θ) :

∂
θj := θj − α J(θ)
∂θj
4-3
where
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1
4.2.5 Multiclass Classification
This algorithm looks identical to linear regression. But,hθ (x) is different each other as follows :
Linear Regression : hθ (x) = ΘT X

1
Logistic Regression : hθ (x) =
1 + e−ΘT X
Now we are going to approach classification that has more than two categories. In this case, you have more
than two outputs. If a problem has n-class categories, you could predict probability as follows :
(i)
max hθ (x) = max P (y = i | x; θ) (i = 1, 2, . . . , n)
i i
On a new input x, to make a prediction, pick the class i that maximizes.
4-4
Lecture 5: Regularization
5.1 The Problem of Overfitting
Suppose we are trying to predic housing prices, and we have three kinds of hypothesis function in figure5.1.
Figure 5.1: Prediction
case1 : hθ (x) = θ0 + θ1 x
case2 : hθ (x) = θ0 + θ1 x + θ2 x2
case3 : hθ (x) = θ0 + θ1 x + θ2 x2 + · · · + θ6 x6
In case 1, function is not fitting very well (”Underfitting”).

In case 2, function is fitting properly.
In case 3, function is fitting well, but too much. (”Overfitting”)
- Overfitting : If we have too many features, the learned hypothesis may fit the training set very well, but
fial to generalize to new examples.
5.2 Addressing Overfitting
There are two options :

1. Reduce number of features.
- Manually select which features to keep
- Model selection algorithm
But, this option is to throw away our data.
2. Regularization.
- Keep all the features, but reduce magnitude/values of paramethers θj
- Works well when we have a lot of features, each of which contributes a bit to predicting y
5-1
5.3 Regularized Cost function
Our hypothesis function is

hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3 + θ4 x4
Supose hθ (x) is overfitting currently, and we penalize and make θ3 ,θ4 really small. The result is hθ (x) may
be quadetric function, and we could adress overfitting. Without actual eliminating these features, we could
just modify our cost function:
m
1 X
minθ (hθ (x(i) ) − y (i) )2 + 1000 · θ32 + 1000 · θ42
2m i=1
So, we have to reduce the value of θ3 ,θ4 to converge to zero.

We could also regularize all of our θ parameters in a single summation except θ0 :
 
m n
1 X
 (hθ (x(i) ) − y (i) )2 + λ
X
minθ θj2 
2m i=1 j=1
(λ is Regularization parameter)
Because the term (θ0 ) is a bias term, we explicit the term.
5.4 Regularized Linear Regression
5.4.1 Gredient Descent in Regularization

m
1 X (i)
θ0 := θ0 − α (hθ (xi ) − y (i) )x0
m i=1
"m
#
1 X (i) λ
θj := θj − α (hθ (x(i) ) − y (i) )xj + θj
m i=1 m
(for j = 1,. . .,n)

λ
The term of ( m θj ) is for regularization. Now we can also represent a part of θj as
m
λ 1 X (i)
θj := θj (1 − α )−α (hθ (x(i) ) − y (i) )xj
m m i=1
λ
The term of (α m ) is less than 1. This term also has an effect on reducing θj .
5-2
5.4.2 Regularized Normal Equation
Normal equation in regularization is defined as

 
0

 1 

T
−1 T 1
θ = X X +λ·L X y where L = 
 

 .. 
 . 
1
5.5 Regularized Logistic Regression
Logistic regression cost function in regularization is defined as

" m
# n
1 X (i) (i) (i) (i) λ X 2
J(θ) = − y log(hθ (x )) + (1 − y )log(1 − hθ (x )) + θ
m 2m j=1 j
i=1
and with J(θ) we could represent gradient descent as

m
1 X (i)
θ0 := θ0 − α (hθ (xi ) − y (i) )x0
m i=1
" m
#
1 X (i) λ
θj := θj − α (hθ (x(i) ) − y (i) )xj + θj
m i=1 m
1
(where hθ (x) = , for j = 1,. . .,n)
1 + e−ΘT X
This is identical to the gradient descent for linear regression except hθ (x).
5-3

Cost Function

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cost Function

Uploaded by

Copyright:

Available Formats

Machine Learning 2016

1.1 What is Machine Learning?

1.2 Supervised Learning

Figure 1.1: Regression

Figure 1.2: classification

1.3 Unsupervised learning

Figure 1.3: Unsupervised learning

Lecture 2: Linear Regression with One Variable

2.1 Model Representation

2.1.1 Hypothesis Function

Before studying the hypothesis fuction, we should know a few notation.

Our hypothesis fuction has the general form :

Figure 2.1: Hypothesis function

2.1.3 Gradient Descent

repeat until convergence :

Figure 2.2: Gradient descent

repeat until convergence :

Lecture 3: Linear Regression with Multiple Variable

3.1 Multiple Features

3.1.1 Hypothesis Function for Multiple Features

Now we introduce notation for equations with any number of variables.

m = Number of training examples.

Then we could define hypothesis function in muliples features as follows :

For convenience of notation, we define x0 = 1 and want more compact form :

And the vectorized version is :

3.1.3 Gradient Descent for Multiple Variables

repeat until convergence :

where 5J(Θ) is defined as

The j-th component of 5J(Θ) could be vectorized as

From those vectorized version, we could express vectorized gradient descent as

3.1.4 Feature Scaling

3.1.5 Learing Rate

Figure 3.1: proper α

3.1.6 Features and Polynomial Regression

3.1.7 Normal Equation

Normal equation is a method of solve for θ analytically.

Gradient Descent Normal Equation

Lecture 4: Logistic Regression

4.2 Hypothesis representation for Classification

4.2.1 Logistic Regression Model

Figure 4.1: Sigmoid function g(z)

”y = 1” if hθ (x) ≥ 0.5 : g(z) ≥ 0.5 when z ≥ 0 −→ θT X ≥ 0

4.2.3 Cost Function for Classification

In logistic regression, cost function defined as

Figure 4.2: Regression form

Figure 4.3: cost(hθ (x), y)

cost(hθ (x), y) = 0 if hθ (x) = y

4.2.4 Simplified Cost Function and Gradient Descent

We know the logistic regression cost function as

cost(hθ (x), y) = −ylog(hθ (x)) − (1 − y)log(1 − hθ (x))

And then, the gradient descent could be simplified as

repeat f or minθ J(θ) :

4.2.5 Multiclass Classification

Linear Regression : hθ (x) = ΘT X

On a new input x, to make a prediction, pick the class i that maximizes.

5.1 The Problem of Overfitting

Figure 5.1: Prediction

In case 1, function is not fitting very well (”Underfitting”).

5.2 Addressing Overfitting

There are two options :

Our hypothesis function is

So, we have to reduce the value of θ3 ,θ4 to converge to zero.