You are on page 1of 110

DATA SCIENCE

CONCENTRATED Institute of Education of the


Republic of Azerbaijan

TRAINING PROGRAM
INTRODUCTION Machine Learning

2
3
4
5
6
KNOWLE
DGE
GRAPH

7
MACHINE
LEARNING

Transformative power
Affected nearly all
industries
Amount of data
Hardware
Algorithms

8
DOTA2

9
ALPHAGO
10
11
12
13
PROBLEM TYPES

14
SUPERVISED LEARNING

15
SUPERVISED LEARNING

16
REGRESSION

17
CLASSIFICATION

18
TRADITIONAL APPROACH

19
MACHINE LEARNING
APPROACH

20
21
WE CAN LEARN FROM
MACHINES

22
23
BRAIN TUMOR DETECTION

24
VOICE ASSISTANTS

25
26
27
28
29
SALES PREDICTION

30
Field of study that gives computers
the ability to learn without being
explicitly programmed
Arthur Samuel, 1959

A computer program is said


to learn from experience E
with respect to some task T
and some performance
measure P, if its
performance on T, as
measured by P, improves
with experience E.
Tom Mitchell, 1997

31
UNSUPERVISED LEARNING

32
ANOMALY DETECTION

33
REGRESSION OR
CLASSIFICATION?
 Predicting the amount of rainfall
 Predicting the score of a team
 Breed of an animal
 Estimating sales and price of a product
 Churn prediction

34
LINEAR REGRESSION
35
36
37
REGRESSION PROBLEM

38
HOUSE PRICE PREDICTION

39
HOUSE PRICE PREDICTION

40
NOTATION
- input variables (e.g. area), features
- output, target (e.g. price)
- train example
- training set, a list of n training examples

41
PROBLEM FORMULATION
Learn a function

such that is a good predictor for the


corresponding value of

42
HOUSE PRICE PREDICTION
(1)
𝑥 =2104
1

(1)
𝑥 =3
2
(3)
𝑥 =?
2

43
HYPOTHESIS

- parameters, weights

44
COMPACT NOTATION
If (intercept term or bias term),

becomes

45
HOW TO LEARN?

46
HOW TO LEARN?

47
COST FUNCTION

Known as least-squares function used in


Ordinary Least Squares regression model

48
LEARNING
Objective: minimize cost function
Algorithm: gradient descent
Idea: start with some values of , update repeatedly
to get to the minimum value of

49
HOW?

50
GRADIENT DESCENT

Simultaneously performed for all values of


– learning rate

51
DERIVING COST FUNCTION
For one train example

52
UPDATE RULE
For one train example

53
BATCH GRADIENT DESCENT
For whole training set

54
BATCH GRADIENT DESCENT
Vector form

55
OPTIMIZATION
For this optimization (cost minimization) problem:
 – convex quadratic function
Only one global minimum
No other local minimum
Gradient descent always converging

56
LEARNING VISUALIZATION

57
MACHINE LEARNT!

58
MACHINE LEARNT!

59
(for fixed , this is a function of x) (function of the parameters )

60
(for fixed , this is a function of x) (function of the parameters )

61
(for fixed , this is a function of x) (function of the parameters )

62
(for fixed , this is a function of x) (function of the parameters )

63
(for fixed , this is a function of x) (function of the parameters )

64
(for fixed , this is a function of x) (function of the parameters )

65
(for fixed , this is a function of x) (function of the parameters )

66
(for fixed , this is a function of x) (function of the parameters )

67
(for fixed , this is a function of x) (function of the parameters )

68
LEARNING VISUALIZATION

69
STOCHASTIC GRADIENT
DESCENT

70
STOCHASTIC GRADIENT
DESCENT

71
YET ANOTHER OPTION

72
LEARNING RATE

73
LOGISTIC REGRESSION Lecture 3

74
QUESTION

Suppose we set .

What is  for the linear hypothesis ?

75
QUESTION

Suppose we set .

What is  for the linear hypothesis ?

76
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1

What is  for linear hypothesis?

77
QUESTION
Assume that we are using the training set given below.
If the cost function is defined as x y
𝑛 3 4
1 (𝑖) 2
𝐽 ( 𝑤 0 , 𝑤 1 ) = ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
(𝑖)
2 1
2 𝑖=1
4 3
h𝑤 ( 𝑥 )=𝑤0 +𝑤1 𝑥
0 1

78
QUESTION
Even if the learning rate α is very large, every iteration of
gradient descent will decrease the value of
Setting the learning rate α to be very small is not harmful and
can only speed up the convergence of gradient descent

79
QUESTION
Even if the learning rate α is very large, every iteration of gradient descent
will decrease the value of
If the learning rate α is too large, one step of gradient descent can actually
vastly "overshoot", and actual increase the value of .
Setting the learning rate α to be very small is not harmful and can only
speed up the convergence of gradient descent
If the learning rate is small, gradient descent ends up taking an extremely
small step on each iteration, so this would actually slow down (rather than
speed up) the convergence of the algorithm.

80
QUESTION
If   and   are initialized at a local minimum, the one iteration will not change
their values.
If   and  are initialized at the global minimum, the one iteration will not
change their values.
If  and  are initialized so that   =  , then by symmetry (because we do
simultaneous updates to the two parameters), after one iteration of gradient
descent, we will still have   = 

81
QUESTION
If   and   are initialized at a local minimum, the one iteration will not change their values.
 At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.

If   and  are initialized at the global minimum, the one iteration will not change their values.
 At the global minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.

If  and  are initialized so that   =  , then by symmetry (because we do simultaneous updates to
the two parameters), after one iteration of gradient descent, we will still have   = 
 The updates to  w0  and  w1  are different (even though we're doing simultaneous updates), so there's no particular
reason to expect them to be the same after one iteration of gradient descent.

82
QUESTION
Assume that
Which of the statements below must be true?
For this to be true, we must have  =0 and  =0 so that 0
We can perfectly predict the value of y even for new examples that we have not yet
seen. (e.g., we can perfectly predict prices of even new houses that we have not yet
seen.)
This is not possible: By the definition of , it is not possible for there to exist  and 
so that  =0

83
QUESTION
For this to be true, we must have   and  so that 
If , that means the line defined by the equation "" perfectly fits all of our data. There's no particular reason to
expect that the values of  and  that achieve this are both 0 (unless y(i)=0 for all of our training examples).

We can perfectly predict the value of y even for new examples that we have not yet seen.
(e.g., we can perfectly predict prices of even new houses that we have not yet seen.)
Even though we can fit our training set perfectly, this does not mean that we'll always make perfect predictions on
houses in the future/on houses that we have not yet seen.

This is not possible: By the definition of , it is not possible for there to exist  and  so that 
If all of our training examples lie perfectly on a line, then  = 0 is possible.

84
CLASSIFICATION PROBLEM
 A small number of discrete values
 Binary classification problem – only two values, 0 and 1
 0 – negative class, 1 – positive class
 – features, – label

85
EXAMPLES
 Spam filter
 COVID19 detection
 Fraud detection

86
EXAMPLE

(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

87
EXAMPLE
h 𝜃 ( 𝑥 )= 𝜃 𝑇 𝑥

(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

88
LINEAR HYPOTHESIS

89
EXAMPLE

(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

90
EXAMPLE

(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

91
EXAMPLE

(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

92
HYPOTHESIS

0 ≤ h𝜃 (𝑥 ) ≤ 1

93
HYPOTHESIS

94
SIGMOID FUNCTION
Threshold classifier output at 0.5:

if , predict y=1
if , predict y=0

95
HYPOTHESIS
- estimated probability that y = 1 on input x
Example:
If = 0.7 , Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x, parameterized by w”


𝑃 ( 𝑦=0∨𝑥 ;𝑤 ) +𝑃 ( 𝑦=1∨𝑥 ;𝑤 )=1
𝑃 ( 𝑦=0∨𝑥 ;𝑤 )=1− 𝑃 ( 𝑦=1∨𝑥 ;𝑤 )

96
QUESTION
Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1)
or benign (y=0). Our logistic regression classifier outputs, for a specific tumor, =
=0.7, so we estimate that there is a 70% chance of this tumor being malignant. What
should be our estimate for , the probability the tumor is benign.

97
DERIVING SIGMOID
FUNCTION

98
DERIVING SIGMOID
FUNCTION

99
DERIVING SIGMOID
FUNCTION

100
HOW TO LEARN THIS TIME?

101
PROBABILISTICALLY
SPEAKING...

102
MAXIMUM LIKELIHOOD
ESTIMATION

103
LOG LIKELIHOOD

104
DERIVING LOG LIKELIHOOD

105
UPDATE RULE

Looks familiar?

106
COST FUNCTION

[ ]
𝑛
1
𝐽 ( 𝑤 ) =−
𝑛
∑ 𝑦 𝑙𝑜𝑔 h𝑤 𝑥 + 1− 𝑦 log ( 1− h𝑤 𝑥 ) )
(
( 𝑖) (𝑖 )
) ( ( 𝑖)
) ( (𝑖 )

𝑖 =1

107
COST FUNCTION
If y = 1:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 1

0 1 h𝑤 ( 𝑥 )

108
COST FUNCTION
If y = 0:
if then Loss=0 𝐿𝑜𝑠𝑠
If y = 0

0 1 h𝑤 ( 𝑥 )

109
QUESTION
Suppose you are running gradient descent to fit a logistic regression model with
parameter . Which of the following is a reasonable way to make sure the learning α
is set properly and that gradient descent is running correctly?

A. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration
B. Plot as a function of the number of iterations and make sure J(w) is decreasing
on every iteration.
C. Plot J(w) as a function of w and make sure it is decreasing on every iteration.
D. Plot J(w) as a function of w and make sure it is convex.

110

You might also like