Professional Documents
Culture Documents
Lecture 1: Introduction
Lecturer: Endrew ng Scribe: Minsu Kim
Before studying Machine Learning algorthm, We should know what is Machine Learning. Two people have
tried to define it as follows :
Arthur Samuel. Machine Learning : Field of study that gives computers the ability to learn
without being explicitly programmed.
Tom Mitchell well-posed Learning problem : A computer program is said to learn from experience
E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.
In supervised learning, We already have been given data set and know a relationship between input and
output. There are two kinds of supervised learning.
1.2.1 Regression
If we have data set of housing price in which right answers are given, how could we predict housing price in
special size. In figure1.1, the red-’X’ points are data set and two lines are the output prediction lines.
From thoes lines, we could predict housing price in some special size. These process is Regression. In other
words, Regression is to predict continuous valued output when you have data set labeled.
1-1
1.2.2 Classification
Classificatoin is similar to regression in that predicting output from data set labeled. However, It is not to
predict continuous valued output.
We have data set about tumor above. We are trying to predict whether tumor is malignant or not according
to tumor size. Like this, it is Classification to predict discrete valued output(0,1) with data set labeled.
Unlike supervised learning, unsupervised learning allows us to address a problem with no data set that can
show us what our result are being. We can instead obtain cluster from the data in which we don’t know the
effect of the variables. Namely, it is not teaching us what is correct result. We just obtain clusters that are
somehow simiral or related.
For example, We collect 1000 articles about Greese economy, and find a way to group these articles to
small clusters on similar topics, sentence, the number of pages ,and so on.
1-2
Machine Learning 2016
On previously lecture, we studied ”Regression Problem”. In Regression Problem, We are trying to predict
continuous valued output when you have data set in which you have already known a relationship between
input and output.
Linear regression with one variable is also named ”Univariate linear regression”. If you are trying to predict
a single output valule from a single input value and have already known a reationship between input and
output, you should use univariate linear regression.
hθ = θ0 + θ1 x
0 0
We should choose θ0 , θ1 of hθ so that hθ is close to y s for our training examples, and hθ maps from x s to
0
y s like bule line as follow:
2-1
2.1.2 Cost Function
We could measure how close hθ is with cost function. We takes average of squarel error of hθ with all input
and the actual output. That’s why it is called as ”squared error function” or ”Mean squared error”. Cost
function is defined as
m o2
1 nX
J(θ0 , θ1 ) = hθ (x(i) ) − y(i)
2m
i=1
1
The term of ( 2m ) is for the convenient computation of the gradient descent, as the derivative term of the
square function will cancel out the ( 12 ) term.
From the cost function, we could concretely measure the accuracy of our hθ against the training examples.
The more accuracy our hθ is, the closer J(θ0 , θ1 ) value goes to zero.
We have our hypothesis function and a way of measuring how accurate it is. Then we are studying a way of
automatically improve our hypothesis function by using gradient descent.
The gradient descent equation is defined as
(for j = 0 and j = 1)
(Simultaneous Update)
Our optimization objective for our learning algorithm is fitting proper θ0 , θ1 to minimize J(θ0 , θ1 ). That’s
why we have to do ∂θ∂ j J(θ0 , θ1 ). Here is the ituition of gradient descent.
In Figure2.2, We could see that gradient descent will automatically take smaller steps, as we approach a
local minimum. So, no need to decrease α over time.
2-2
2.1.4 Gradient Descent For Linear Regression
From general gradient descent definition above, we could represent gradient descent for linear regression as
(Simultaneous Update)
In linear regression, we could alyaws see a global minimum. It is not available to see local minimum.
2-3
Machine Learning 2016
Unlike univariate linear regression, linear regression with multivariable is more variables to predict output
more accurately. It is also called as ”Multivariate linear regression”.
hθ = θ0 + θ1 x1 + θ2 x2 + θ3 x3 + . . . + θn xn
Now wec collect all m training examples each with n features and record them in an (n + 1) × m matrix,
as shown here : (1) (2) (m)
1 1 ... 1
x0 x0 . . . x0
(1) (2) (m) (1) (2) (m)
x1 x1 . . . x1 x1 x1 . . . x1
X= =
.. ..
.
.
(1) (2) (m) (1) (2) (m)
xn xn . . . xn xn xn . . . xn
And then,
1 1 ... 1
(1) (2) (m)
x1
x1 ... x1
= ΘT X
hθ = θ0 θ1 ... θn
..
.
(1) (2) (m)
xn xn ... xn
3-1
3.1.2 Cost Function for Multiple Variables
For multiple variables hθ = ΘT X and the paramethers vertor Θ is (n+1)-dimensional vector. Then the cost
function is:
m o2
1 nX
J(Θ) = hθ (x(i) ) − y(i)
2m
i=1
From general gradient descent form, the gradient descent for multiple variables is :
(for j = 0 , 1 , 2 , . . . , n)
(Simultaneous Update)
And the vectorized version is :
Θ:=Θ − α 5 J(Θ)
∂
∂θ0 J(Θ)
∂
∂θ1 J(Θ)
5J(Θ) = = ΘT X
..
.
∂
∂θn J(Θ)
3-2
And then
1 T T
5J(Θ) = X (Θ X − ~y )
m
Θ:=Θ − α 5 J(Θ)
1
:=Θ − α XT (ΘT X − ~y )
m
previously, we have learned to choose Θ for predicting our output. Now we are studying a faster way to
choose Θ. The idea is that features are on a similar sacle. For example,
x1 = size(0 ∼ 2000(f eet2 )).
x2 = number of bedrooms(1 ∼ 5).
x1 ’s range is too bigger than x2 ’s range. So, It will take long time to get grobal minimum. Therefore, we
should get every feature into approximately a −1 ≤ x ≤ 1 range.
- Mean normalize
Replace xi with xi − µi to make features have approximately zero mean.[NOTE : Do not apply to x0 = 1]
xi − µi W here µi = average of xi in training set
Xi =
si si = range of xi (max − min)
In gradient descent
∂
θj := θj − α J(θ)
∂θj
We could see α, and it is called as ”Learning Rate”. It also affect convergence of θ. We know that J(θ)
should decrease after every iteration like below.
3-3
Figure 3.2: not proper α
However, If J(θ) increase or vibrate like Figure3.2, you should use smaller α.
Therefore,
- If α is too small : slow convergence.
- If α is too large : J(θ) may not decrease on every iteration, may not converge.
To choose α, try from (· · · 0.001, 0.01, 0.1, 1 · · · )
From our features x1 , x2 , we could make a feature x3 = x1 × x2 . Like this, we should make new features
from origin features in some problem.
If we get hypothesis function as
hθ = θ0 + θ1 x1
and then, from that function we could change the behavior of curve hθ . Now, hθ is linear function. Then
we just duplicate the variable of x1 to get a new function:
hθ = θ0 + θ1 x1 + θ2 x21
In that function, we could make a new feature x2 = x21 .
By making hθ quadratic, cubic or any other form, you could improve your hypothesis function.
3-4
If we use normal equation, It doesn’t need to do feature scaling.
Now, we are comparing gradinent descent to normal equation.
3-5
Machine Learning 2016
4.1 Classification
As we mentioned before in lecture 1, classification is to predict discrete valued output(0,1) with data set
labeled. This means (y ∈ {0, 1}). hypothesis function for regression is not alyway in from 0 to 1. So, we
could’t use hypothesis function for regression and need new function.
”Logistic regression” is defined for classification algorithm.
We want the condition of 0 ≤ hθ (x) ≤ 1. With the condition, hypothesis function is defined as
hθ (x) = g(θT x)
1
g(z) =
1 + e−z
Sigmoid function g(z) satisfy the condition of hθ (x). And, threshold classifier output is at 0.5 :
- If hθ (x) ≥ 0.5, predict ”y = 1”
- If hθ (x) < 0.5, predict ”y = 0”
Therefore, hθ (x) means estimated probability that y = 1 on input x. In other words, We could represent
hθ (x) as
hθ (x) = P (y = 1 | x; θ)
4-1
4.2.2 Decision Boundary
In logistic regression, there are boundary. Suppose we are trying to predict that
1
Unlike linear regression, hθ (x) is complicated by 1+e−z . The graph of J(θ) is non-convex form as follows :
Non-convex form does not guarantee global minium. Therefore, we redefined logistic regression cost function
to make convex form.
4-2
- Logistic regression cost function
It is redefined as n −log(hθ (x)) if y = 1
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0
−log(hθ (x)) if y = 1
n
cost(hθ (x), y) =
−log(1 − hθ (x)) if y = 0
[NOTE : y = 0 or 1 always]
Because of NOTE in logistic regression, we could simplify cost function as
Therefore,
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1
4-3
(Simultaneous Update)
where
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))
m
i=1
This algorithm looks identical to linear regression. But,hθ (x) is different each other as follows :
4-4
Machine Learning 2016
Lecture 5: Regularization
Lecturer: Endrew ng Scribe: Minsu Kim
Suppose we are trying to predic housing prices, and we have three kinds of hypothesis function in figure5.1.
case1 : hθ (x) = θ0 + θ1 x
case2 : hθ (x) = θ0 + θ1 x + θ2 x2
case3 : hθ (x) = θ0 + θ1 x + θ2 x2 + · · · + θ6 x6
- Overfitting : If we have too many features, the learned hypothesis may fit the training set very well, but
fial to generalize to new examples.
5-1
5.3 Regularized Cost function
m n
1 X
(hθ (x(i) ) − y (i) )2 + λ
X
minθ θj2
2m i=1 j=1
(λ is Regularization parameter)
λ
The term of ( m θj ) is for regularization. Now we can also represent a part of θj as
m
λ 1 X (i)
θj := θj (1 − α )−α (hθ (x(i) ) − y (i) )xj
m m i=1
λ
The term of (α m ) is less than 1. This term also has an effect on reducing θj .
5-2
5.4.2 Regularized Normal Equation
This is identical to the gradient descent for linear regression except hθ (x).
5-3