Professional Documents
Culture Documents
TRAINING PROGRAM
INTRODUCTION Machine Learning
2
3
4
5
6
KNOWLE
DGE
GRAPH
7
MACHINE
LEARNING
Transformative power
Affected nearly all
industries
Amount of data
Hardware
Algorithms
8
DOTA2
9
ALPHAGO
10
11
12
13
PROBLEM TYPES
14
SUPERVISED LEARNING
15
SUPERVISED LEARNING
16
REGRESSION
17
CLASSIFICATION
18
TRADITIONAL APPROACH
19
MACHINE LEARNING
APPROACH
20
21
WE CAN LEARN FROM
MACHINES
22
23
BRAIN TUMOR DETECTION
24
VOICE ASSISTANTS
25
26
27
28
29
SALES PREDICTION
30
Field of study that gives computers
the ability to learn without being
explicitly programmed
Arthur Samuel, 1959
31
UNSUPERVISED LEARNING
32
ANOMALY DETECTION
33
REGRESSION OR
CLASSIFICATION?
Predicting the amount of rainfall
Predicting the score of a team
Breed of an animal
Estimating sales and price of a product
Churn prediction
34
LINEAR REGRESSION
35
36
37
REGRESSION PROBLEM
38
HOUSE PRICE PREDICTION
39
HOUSE PRICE PREDICTION
40
NOTATION
- input variables (e.g. area), features
- output, target (e.g. price)
- train example
- training set, a list of n training examples
41
PROBLEM FORMULATION
Learn a function
42
HOUSE PRICE PREDICTION
(1)
𝑥 =2104
1
(1)
𝑥 =3
2
(3)
𝑥 =?
2
43
HYPOTHESIS
- parameters, weights
44
COMPACT NOTATION
If (intercept term or bias term),
becomes
45
HOW TO LEARN?
46
HOW TO LEARN?
47
COST FUNCTION
48
LEARNING
Objective: minimize cost function
Algorithm: gradient descent
Idea: start with some values of , update repeatedly
to get to the minimum value of
49
HOW?
50
GRADIENT DESCENT
51
DERIVING COST FUNCTION
For one train example
52
UPDATE RULE
For one train example
53
BATCH GRADIENT DESCENT
For whole training set
54
BATCH GRADIENT DESCENT
Vector form
55
OPTIMIZATION
For this optimization (cost minimization) problem:
– convex quadratic function
Only one global minimum
No other local minimum
Gradient descent always converging
56
LEARNING VISUALIZATION
57
MACHINE LEARNT!
58
MACHINE LEARNT!
59
(for fixed , this is a function of x) (function of the parameters )
60
(for fixed , this is a function of x) (function of the parameters )
61
(for fixed , this is a function of x) (function of the parameters )
62
(for fixed , this is a function of x) (function of the parameters )
63
(for fixed , this is a function of x) (function of the parameters )
64
(for fixed , this is a function of x) (function of the parameters )
65
(for fixed , this is a function of x) (function of the parameters )
66
(for fixed , this is a function of x) (function of the parameters )
67
(for fixed , this is a function of x) (function of the parameters )
68
LEARNING VISUALIZATION
69
STOCHASTIC GRADIENT
DESCENT
70
STOCHASTIC GRADIENT
DESCENT
71
YET ANOTHER OPTION
72
LEARNING RATE
73
LOGISTIC REGRESSION Lecture 3
74
QUESTION
Suppose we set .
75
QUESTION
Suppose we set .
76
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1
77
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1
78
QUESTION
Even if the learning rate α is very large, every iteration of
gradient descent will decrease the value of
Setting the learning rate α to be very small is not harmful and
can only speed up the convergence of gradient descent
79
QUESTION
Even if the learning rate α is very large, every iteration of gradient descent
will decrease the value of
If the learning rate α is too large, one step of gradient descent can actually
vastly "overshoot", and actual increase the value of .
Setting the learning rate α to be very small is not harmful and can only
speed up the convergence of gradient descent
If the learning rate is small, gradient descent ends up taking an extremely
small step on each iteration, so this would actually slow down (rather than
speed up) the convergence of the algorithm.
80
QUESTION
If and are initialized at a local minimum, the one iteration will not change
their values.
If and are initialized at the global minimum, the one iteration will not
change their values.
If and are initialized so that = , then by symmetry (because we do
simultaneous updates to the two parameters), after one iteration of gradient
descent, we will still have =
81
QUESTION
If and are initialized at a local minimum, the one iteration will not change their values.
At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.
If and are initialized at the global minimum, the one iteration will not change their values.
At the global minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.
If and are initialized so that = , then by symmetry (because we do simultaneous updates to
the two parameters), after one iteration of gradient descent, we will still have =
The updates to w0 and w1 are different (even though we're doing simultaneous updates), so there's no particular
reason to expect them to be the same after one iteration of gradient descent.
82
QUESTION
Assume that
Which of the statements below must be true?
For this to be true, we must have =0 and =0 so that 0
We can perfectly predict the value of y even for new examples that we have not yet
seen. (e.g., we can perfectly predict prices of even new houses that we have not yet
seen.)
This is not possible: By the definition of , it is not possible for there to exist and
so that =0
83
QUESTION
For this to be true, we must have and so that
If , that means the line defined by the equation "" perfectly fits all of our data. There's no particular reason to
expect that the values of and that achieve this are both 0 (unless y(i)=0 for all of our training examples).
We can perfectly predict the value of y even for new examples that we have not yet seen.
(e.g., we can perfectly predict prices of even new houses that we have not yet seen.)
Even though we can fit our training set perfectly, this does not mean that we'll always make perfect predictions on
houses in the future/on houses that we have not yet seen.
This is not possible: By the definition of , it is not possible for there to exist and so that
If all of our training examples lie perfectly on a line, then = 0 is possible.
84
CLASSIFICATION PROBLEM
A small number of discrete values
Binary classification problem – only two values, 0 and 1
0 – negative class, 1 – positive class
– features, – label
85
EXAMPLES
Spam filter
COVID19 detection
Fraud detection
86
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
87
EXAMPLE
h 𝜃 ( 𝑥 )= 𝜃 𝑇 𝑥
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
88
LINEAR HYPOTHESIS
89
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
90
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
91
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
92
HYPOTHESIS
0 ≤ h𝜃 (𝑥 ) ≤ 1
93
HYPOTHESIS
94
SIGMOID FUNCTION
Threshold classifier output at 0.5:
if , predict y=1
if , predict y=0
95
HYPOTHESIS
- estimated probability that y = 1 on input x
Example:
If = 0.7 , Tell patient that 70% chance of tumor being malignant
96
QUESTION
Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1)
or benign (y=0). Our logistic regression classifier outputs, for a specific tumor, =
=0.7, so we estimate that there is a 70% chance of this tumor being malignant. What
should be our estimate for , the probability the tumor is benign.
97
DERIVING SIGMOID
FUNCTION
98
DERIVING SIGMOID
FUNCTION
99
DERIVING SIGMOID
FUNCTION
100
HOW TO LEARN THIS TIME?
101
PROBABILISTICALLY
SPEAKING...
102
MAXIMUM LIKELIHOOD
ESTIMATION
103
LOG LIKELIHOOD
104
DERIVING LOG LIKELIHOOD
105
UPDATE RULE
Looks familiar?
106
COST FUNCTION
[ ]
𝑛
1
𝐽 ( 𝑤 ) =−
𝑛
∑ 𝑦 𝑙𝑜𝑔 h𝑤 𝑥 + 1− 𝑦 log ( 1− h𝑤 𝑥 ) )
(
( 𝑖) (𝑖 )
) ( ( 𝑖)
) ( (𝑖 )
𝑖 =1
107
COST FUNCTION
If y = 1:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 1
0 1 h𝑤 ( 𝑥 )
108
COST FUNCTION
If y = 0:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 0
0 1 h𝑤 ( 𝑥 )
109
QUESTION
Suppose you are running gradient descent to fit a logistic regression model with
parameter . Which of the following is a reasonable way to make sure the learning α
is set properly and that gradient descent is running correctly?
A. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration
B. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration.
C. Plot J(w) as a function of w and make sure it is decreasing on every iteration.
D. Plot J(w) as a function of w and make sure it is convex.
110