DSCTP 2022 1 ML Slides

DATA SCIENCE
CONCENTRATED Institute of Education of the

Republic of Azerbaijan
TRAINING PROGRAM
INTRODUCTION Machine Learning
2
3
4
5
6
KNOWLE
DGE
GRAPH
7
MACHINE
LEARNING
Transformative power
Affected nearly all
industries
Amount of data
Hardware
Algorithms
8
DOTA2
9
ALPHAGO
10
11
12
13
PROBLEM TYPES
14
SUPERVISED LEARNING
15
SUPERVISED LEARNING
16
REGRESSION
17
CLASSIFICATION
18
TRADITIONAL APPROACH
19
MACHINE LEARNING
APPROACH
20
21
WE CAN LEARN FROM
MACHINES
22
23
BRAIN TUMOR DETECTION
24
VOICE ASSISTANTS
25
26
27
28
29
SALES PREDICTION
30
Field of study that gives computers
the ability to learn without being
explicitly programmed
Arthur Samuel, 1959
A computer program is said

to learn from experience E
with respect to some task T
and some performance
measure P, if its
performance on T, as
measured by P, improves
with experience E.
Tom Mitchell, 1997
31
UNSUPERVISED LEARNING
32
ANOMALY DETECTION
33
REGRESSION OR
CLASSIFICATION?
 Predicting the amount of rainfall
 Predicting the score of a team
 Breed of an animal
 Estimating sales and price of a product
 Churn prediction
34
LINEAR REGRESSION
35
36
37
REGRESSION PROBLEM
38
HOUSE PRICE PREDICTION
39
40
NOTATION
- input variables (e.g. area), features
- output, target (e.g. price)
- train example
- training set, a list of n training examples
41
PROBLEM FORMULATION
Learn a function
such that is a good predictor for the

corresponding value of
42
(1)
𝑥 =2104
1
(1)
𝑥 =3
2
(3)
𝑥 =?
2
43
HYPOTHESIS
- parameters, weights
44
COMPACT NOTATION
If (intercept term or bias term),
becomes
45
HOW TO LEARN?
46
HOW TO LEARN?
47
COST FUNCTION
Known as least-squares function used in

Ordinary Least Squares regression model
48
LEARNING
Objective: minimize cost function
Algorithm: gradient descent
Idea: start with some values of , update repeatedly
to get to the minimum value of
49
HOW?
50
GRADIENT DESCENT
Simultaneously performed for all values of

– learning rate
51
DERIVING COST FUNCTION
For one train example
52
UPDATE RULE
For one train example
53
BATCH GRADIENT DESCENT
For whole training set
54
BATCH GRADIENT DESCENT
Vector form
55
OPTIMIZATION
For this optimization (cost minimization) problem:
 – convex quadratic function
Only one global minimum
No other local minimum
Gradient descent always converging
56
LEARNING VISUALIZATION
57
MACHINE LEARNT!
58
MACHINE LEARNT!
59
(for fixed , this is a function of x) (function of the parameters )
60
61
62
63
64
65
66
67
68
LEARNING VISUALIZATION
69
STOCHASTIC GRADIENT
DESCENT
70
STOCHASTIC GRADIENT
DESCENT
71
YET ANOTHER OPTION
72
LEARNING RATE
73
LOGISTIC REGRESSION Lecture 3
74
QUESTION
Suppose we set .
What is for the linear hypothesis ?
75
QUESTION
Suppose we set .
What is for the linear hypothesis ?
76
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1
What is for linear hypothesis?
77
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1
78
QUESTION
Even if the learning rate α is very large, every iteration of
gradient descent will decrease the value of
Setting the learning rate α to be very small is not harmful and
can only speed up the convergence of gradient descent
79
QUESTION
Even if the learning rate α is very large, every iteration of gradient descent
will decrease the value of
If the learning rate α is too large, one step of gradient descent can actually
vastly "overshoot", and actual increase the value of .
Setting the learning rate α to be very small is not harmful and can only
speed up the convergence of gradient descent
If the learning rate is small, gradient descent ends up taking an extremely
small step on each iteration, so this would actually slow down (rather than
speed up) the convergence of the algorithm.
80
QUESTION
If and are initialized at a local minimum, the one iteration will not change
their values.
If and are initialized at the global minimum, the one iteration will not
change their values.
If and are initialized so that = , then by symmetry (because we do
simultaneous updates to the two parameters), after one iteration of gradient
descent, we will still have =
81
QUESTION
If and are initialized at a local minimum, the one iteration will not change their values.
 At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.
If and are initialized at the global minimum, the one iteration will not change their values.
 At the global minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.
If and are initialized so that = , then by symmetry (because we do simultaneous updates to
the two parameters), after one iteration of gradient descent, we will still have =
 The updates to w0 and w1 are different (even though we're doing simultaneous updates), so there's no particular
reason to expect them to be the same after one iteration of gradient descent.
82
QUESTION
Assume that
Which of the statements below must be true?
For this to be true, we must have =0 and =0 so that 0
We can perfectly predict the value of y even for new examples that we have not yet
seen. (e.g., we can perfectly predict prices of even new houses that we have not yet
seen.)
This is not possible: By the definition of , it is not possible for there to exist and
so that =0
83
QUESTION
For this to be true, we must have and so that
If , that means the line defined by the equation "" perfectly fits all of our data. There's no particular reason to
expect that the values of and that achieve this are both 0 (unless y(i)=0 for all of our training examples).
We can perfectly predict the value of y even for new examples that we have not yet seen.
(e.g., we can perfectly predict prices of even new houses that we have not yet seen.)
Even though we can fit our training set perfectly, this does not mean that we'll always make perfect predictions on
houses in the future/on houses that we have not yet seen.
This is not possible: By the definition of , it is not possible for there to exist and so that
If all of our training examples lie perfectly on a line, then = 0 is possible.
84
CLASSIFICATION PROBLEM
 A small number of discrete values
 Binary classification problem – only two values, 0 and 1
 0 – negative class, 1 – positive class
 – features, – label
85
EXAMPLES
 Spam filter
 COVID19 detection
 Fraud detection
86
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
87
EXAMPLE
h 𝜃 ( 𝑥 )= 𝜃 𝑇 𝑥
(Yes) 1
Malignant ?
(No) 0
88
LINEAR HYPOTHESIS
89
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
90
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
91
EXAMPLE
(Yes) 1
Malignant ?
(No) 0
92
HYPOTHESIS
0 ≤ h𝜃 (𝑥 ) ≤ 1
93
HYPOTHESIS
94
SIGMOID FUNCTION
Threshold classifier output at 0.5:
if , predict y=1
if , predict y=0
95
HYPOTHESIS
- estimated probability that y = 1 on input x
Example:
If = 0.7 , Tell patient that 70% chance of tumor being malignant
“probability that y = 1, given x, parameterized by w”

𝑃 ( 𝑦=0∨𝑥 ;𝑤 ) +𝑃 ( 𝑦=1∨𝑥 ;𝑤 )=1
𝑃 ( 𝑦=0∨𝑥 ;𝑤 )=1− 𝑃 ( 𝑦=1∨𝑥 ;𝑤 )
96
QUESTION
Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1)
or benign (y=0). Our logistic regression classifier outputs, for a specific tumor, =
=0.7, so we estimate that there is a 70% chance of this tumor being malignant. What
should be our estimate for , the probability the tumor is benign.
97
DERIVING SIGMOID
FUNCTION
98
DERIVING SIGMOID
FUNCTION
99
DERIVING SIGMOID
FUNCTION
100
HOW TO LEARN THIS TIME?
101
PROBABILISTICALLY
SPEAKING...
102
MAXIMUM LIKELIHOOD
ESTIMATION
103
LOG LIKELIHOOD
104
DERIVING LOG LIKELIHOOD
105
UPDATE RULE
Looks familiar?
106
COST FUNCTION
[ ]
𝑛
1
𝐽 ( 𝑤 ) =−
𝑛
∑ 𝑦 𝑙𝑜𝑔 h𝑤 𝑥 + 1− 𝑦 log ( 1− h𝑤 𝑥 ) )
(
( 𝑖) (𝑖 )
) ( ( 𝑖)
) ( (𝑖 )
𝑖 =1
107
COST FUNCTION
If y = 1:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 1
0 1 h𝑤 ( 𝑥 )
108
COST FUNCTION
If y = 0:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 0
0 1 h𝑤 ( 𝑥 )
109
QUESTION
Suppose you are running gradient descent to fit a logistic regression model with
parameter . Which of the following is a reasonable way to make sure the learning α
is set properly and that gradient descent is running correctly?
A. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration
B. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration.
C. Plot J(w) as a function of w and make sure it is decreasing on every iteration.
D. Plot J(w) as a function of w and make sure it is convex.
110

DSCTP 2022 1 ML Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSCTP 2022 1 ML Slides

Uploaded by

Copyright:

Available Formats

DATA SCIENCE

CONCENTRATED Institute of Education of the

A computer program is said

such that is a good predictor for the

Known as least-squares function used in

Simultaneously performed for all values of

What is for the linear hypothesis ?

What is for the linear hypothesis ?

What is for linear hypothesis?

“probability that y = 1, given x, parameterized by w”

You might also like