Mscfe XXX (Course Name) - Module X: Collaborative Review Task

MScFE xxx [Course Name] - Module X: Collaborative Review Task
© 2019 - WorldQuant University – All rights reserved.

Revised: 08/19/2019
1
MScFE 650 Machine Learning in Finance– Table of Contents
Unit 1: Gradient Descent ....................................................................................... 3

Unit 2: Classification ............................................................................................. 7
Unit 3: Logistic Regression ................................................................................ 10
Unit 4: Softmax ..................................................................................................... 14
Unit 5: Training ..................................................................................................... 16

2
MScFE 650 Machine Learning in Finance - Module 3: Unit 1 Video Transcript
Unit 1: Gradient Descent
In this video we will be discussing optimization and an algorithm called gradient descent. There are
many optimization techniques but gradient descent is one of the simplest and most widely used
algorithms in machine learning.
Now I realize that students may be a little bit confused as we introduce the topics but not to
worry, in the last video for this week we tie all of the concepts together. My recommendation is
that students first watch all of the videos before going over the notes, and if necessary watching
the videos again afterwards.
Why do we need an optimization algorithm at all? In machine learning tasks we have a model and
we need to optimize the model parameters to get a good fit. In short we need to minimize a cost
function in order to estimate the model parameters.
For example: In regression you will use the mean squared error and in classification the
categorical cross entropy as cost functions. By minimizing the cost function, we fit our model to
the data in hopes that it will generalize out of sample.
In this course we will always refer to finding the minimum, if it is the maximum that we are after,
then we will just minimize the negative of the function.
Optimization is a very intuitive explanation. Imagine you are hiking in the mountains and want to
find the bottom of a cup shape valley, the fitness landscape. Your vision is blurred and you can’t
rely on your eye site. How do you know if you are in the bottom of this valley?
One solution is that you take a step in all directions, and if every time you take a step, you move up,
then surely you are at the bottom of the valley. We do the same thing in mathematics.
Objective Function:𝑓(𝑥 )
¯
The first thing we need is a real-valued objective function, this is the function that we want to
minimize. Let’s call this objective function 𝑓 and it will be a function of a vector 𝑥 . It must be a real
¯
value so that we can know when 𝑓 is increasing and when 𝑓 is decreasing.

3
Figure 1.
The illustration shows something that looks cup-shaped and we indicate the bottom or the
minimum with 𝑥 ∗ . We said that if we are at 𝑥 ∗ we take a small step away from 𝑥 ∗ , which means we
¯ ¯ ¯
add a small quantity called epsilon to 𝑥 . As we vary the direction of 𝑥 we step in all possible
¯ ¯
directions. The point is that 𝑓(𝑥 + 𝜖𝑥 ) should always be larger than 𝑓(𝑥 ),
∗ ∗
¯ ¯ ¯
For min:𝑓(𝑥 ∗ + 𝜖𝑥 ) > 𝑓(𝑥 ∗ ), ∀𝑥

¯ ¯ ¯ ¯
For a minimum we know that 𝑓(𝑥 ∗ + 𝜖𝑥 ) should be bigger than 𝑓(𝑥 ∗ ) . This is to ensure that we
¯ ¯ ¯
always go uphill if we move away from 𝑥 , for every possible direction away from 𝑥 ∗ . We need to
∗
¯ ¯
find a mathematical condition when this is true. For this we use Taylor’s theorem,
𝑓(𝑥 ∗ + 𝜖𝑥 ) = 𝑓(𝑥 ∗ ) + 𝜖 ▽ 𝑓(𝑥 ∗ )⊤ 𝑥 + H.O.T (Higher-order terms)

¯ ¯ ¯ ¯ ¯ ¯
Up to first order in 𝜖, 𝑓(𝑥 ∗ + 𝜖𝑥 ) equals 𝑓(𝑥 ∗ ) plus the term with the gradient of 𝑓. Since the
¯ ¯ ¯
gradient of 𝑓 is a vector, we take its inner product with 𝜖𝑥.

¯
As a quick reminder of what the gradient looks like, let’s write it down for only two dimensions and
say that vector 𝑥 has two components 𝑥 and 𝑦.
¯

4
𝜕𝑓
𝑥 𝜕𝑥
𝑥 = [𝑦] ,▽ 𝑓(𝑥 ) = 𝜕𝑓
¯ ¯ ¯
[𝜕𝑦]
The gradient of 𝑓 is the vector of partial derivative of 𝑓 with respect to 𝑥 and 𝑦.
Now we must return to the original equation. We have the Taylor expansion of 𝑓(𝑥 ∗ + 𝜖𝑥 ), and of
¯ ¯
course 𝑓(𝑥 + 𝜖𝑥 ) must be larger than 𝑓(𝑥 ) for any possible choice of 𝑥 .
∗ ∗
¯ ¯ ¯ ¯
Let us look at the term containing the epsilon and the gradient. If we choose the components of 𝜖𝑥
¯
to have the opposite sign than the corresponding term of the gradient, the result of the inner
product is a negative quantity, i.e. we subtract from 𝑓(𝑥 ∗ ). This is incompatible with the
¯
requirement that 𝑓(𝑥 + 𝜖𝑥 ) must be larger than 𝑓(𝑥 ), and we conclude that the gradient at a
∗ ∗
¯ ¯ ¯
minimum has to be zero,
▽ 𝑓(𝑥 ∗ ) = 0
¯ ¯
This will also be true if 𝑥 ∗ is a maximum, and it is therefore only a necessary, but not sufficient
¯
condition. For a sufficient condition we need to go to the next term of the Taylor expansion. The
next, second order term, contains the Hessian matrix which must be positive definite at a
minimum.
The next question is: How do we find that minimum? Again, imagine you are on the side of a hill
and you want to get down to the bottom of the nearby valley. The most natural thing is to walk
down the slope. You only need to determine which direction is down, and away you go.
In mathematics things are equally simple. The gradient of a function points upwards, and to go
down, you go in its negative direction. If you selected an initial condition 𝑥 0 , you want to go from
¯
there in the negative gradient direction,
𝑥1 = 𝑥 0 − 𝛼 ▽ 𝑓(𝑥 0 )
¯ ¯ ¯ ¯
Where 𝛼 is a parameter that determines how far you go in the negative gradient direction. It is also
known as the learning rate.

5
One obvious way that we can improve on this algorithm is to not keep the learning rate constant. If
you imagine going downhill, maybe you are on a steep hill and you want to run down, taking very
large steps, but if you keep taking those very large steps you are in danger of overshooting the
minimum at the bottom. You will want to reduce your steps, reducing the learning rate as you
approach the minimum.
A problem with all gradient methods is that it reaches the minimum that is downhill from where
you start. There may actually be another, better minimum on the other side of the ridge but there
is no way of knowing. Getting stuck in local minima can pose serious problems.
There are strategies to combat this. One obvious strategy is we can reinitialize in a different
position and see if we can reach a better optimum. If we are in a very high dimensional space it is
not feasible to search the whole space, because it may just be too big and there may be too many
local optimums that we can get stuck in.
We can also try different algorithms like simulated annealing, genetic algorithms which are
designed to good at finding global optima.
In places like deep learning, however, gradient descent is often used because the problem with all
these other algorithms is that they may find a good optimum, but take a very, very long time in
doing so.
In deep learning particular attention is given to finding better initializations, and speeding up the
basic gradient descent method.
This video looked gradient descent for optimization. Study the provided notes to learn more about
the subject.

6
Unit 2: Classification
In this video we will be discussing classification, one of the classic problems of machine learning.
Humans are very good at identifying objects, for instance, attaching a label or identifying an
object. Computers much less so, or at least until fairly recently when some significant advances
were made with regard to classification. We’ll get there when we talk about deep learning.
What is classification?
The idea of classification is that you have an observation, or a set of features, which you want to
assign to several classes. For a concrete example, we can think of handwritten digits from 0 to 9,
10 different digits and we have images of all the digits. This data set is famously known as the
MNIST data.
Now the question is: Can we get the computer to assign an image to one of the 10 class labels?
We humans do that automatically. When we see a 0, we know it is a 0 and we assign it

automatically to the class 0. For a computer it is not as straightforward. In this weeks peer review,
you will get an opportunity to build a classifier, where you will have to classify handwritten digits.
This problem is seen as the “Hello World” of classification and computer vision.
Another famous example is the ImageNet challenge. The challenge consists of a photograph of an
object, for example, any type of chair, from different angles, different views, different colors,
different shapes and sizes. You then need to classify that image into one of a thousand classes.
That is a significantly more difficult problem - mainly because the dimensions are much higher, and
we have many more classes.
In 2012, there was a significant breakthrough when people started using deep networks and were
able to improve the classification accuracy from around 71.8% to 97.3% (Quartz, 2017), which is
better than human performance. Many now consider the ImageNet challenge solved.

7
How do we think of this in terms of mathematics?
As always, we need to translate what we’ve just said into mathematics. The best way to do that is
to think of it as a probability problem, given an observation or a set of features 𝑥 , where 𝑥 , can be a
¯ ¯
very long vector, we want to calculate the probabilities that it belongs to the different classes.
An image can be seen as a matrix where each value represents a pixel, for MNIST we will use gray
scale images, 20 by 20, where the pixel value is between 0 and 255, where 0 is white and 255 is
black.
In RGB images the dimensions will be 20 by 20 by 3.
If we start with a gray scale image we will stack all the rows, or the columns and make that into a
very long vector. If you start with a 10 by 10 image, then you can stack that, so you can have 10 by
10 which is 100 features, in your features vector. Now given the feature vector, the idea is to
calculate the class probabilities.
Our classification model will return the probabilities with which the observation is assigned to all
the different classes. For the digit problem, the model returns 10 probabilities. To which class do
we assign that digit? The simplest way would be to assign it to the class with the largest
probability.
We need a mathematical model for 𝑝(𝐶|𝑥 ) , and we learn the model from observations. For
¯
training, each observation comes with a label that tells us to which one of the classes the
observation belongs. The problem becomes much more difficult if we do not have labeled data for
training.
The generative approach and the discriminative approach
How do we calculate the class probabilities given observation 𝑥 ? There are two main approaches
¯
that one can follow. The one is named the generative approach and the other is named the
discriminative approach.
With the generative approach, we build a mathematical model for each class in such a way that we
can generate samples that belong to the class, hence the name, generative model. In the

8
discriminate approach, all that we are interested in is to build a model that will separate the
different classes in the best possible way.
The generative approach
Since we have labels we can select all the observations for each class and then build a model
specifically for that class. For the digit problem we will have a very large number of samples and
for each sample we will know which digit that is. We can now collect all the 0’s and then we can
build a 0 model. We then collect all the 1’s and build a 1 model, and so on.
Later in the course we will learn how we can build quite sophisticated models. For now, you need
to imagine that we have samples for all the 0’s and we can simply fit a Gaussian density function to
that, which becomes our model.
A Gaussian density function may not be appropriate, in which case we need to have something
more sophisticated. Later in the course we will talk about Gaussian mixture models. With
Gaussian mixture models we can approximate any underlying density function.
If we go this route, the advantage is that it is fairly easy to build a model. The disadvantage is that
we build a model for each class and each class is treated all by itself and does not take the
properties of the other classes into account.
The discriminative approach
The discriminative approach is more or less the opposite. For this approach we feed all the
information of all the classes into the training stage. During training we build a model that will
discriminate - as best possible - between the different classes. The power of discriminative models
is much greater than the generative models, but they tend to be harder to train.
In this video we introduced classification using the example of classifying hand written digits. Next
we learnt about the generative and discriminative approach to model building. Study the provided
notes to learn more about the subject.

9
Unit 3: Logistic Regression
In this video we will be discussing logistic regression. Logistic regression is the simplest form of
classification. It is a binary classification. The choice is only between two classes - yes or no. If we
want to do something like a signature verification, the two classes would be a forgery or a genuine
signature. Logistic regression always deals with a two-class problem.
What we want is the probability of class 1, 𝐶1 , given 𝑥 . Let us write down Bayes' theorem.
¯
Remember that there is not much one can do in terms of probability. You have the sum rule and
the product rule and Bayes' theorem derives from the product rule.
In this case, given the probability of class, 𝐶1 , given 𝑥 , using Bayes’ rule, that is equal to 𝑝 of 𝑥 given
¯ ¯
class 𝐶1 times the prior probability of class 𝐶1 . Dividing that by the normalizing factor will be 𝑝 of
𝑥 given 𝐶1 times 𝑝 of 𝐶1 plus 𝑝 of 𝑥 given class 𝐶2 times the prior probability of class 𝐶2 .
¯ ¯
𝑝(𝑥 |𝐶1 )𝑝(𝐶1 )

¯
𝑝(𝐶1 |𝑥 ) =
¯ 𝑝(𝑥 |𝐶1 )𝑝(𝐶1 ) + 𝑝(𝑥 |𝐶2 )𝑝(𝐶2 )
¯ ¯
This equation is also useful to demonstrate the difference between the generative and
discriminative approaches.
For the generative approach, we build the model for each class. This means that we estimate this
quantity, 𝑝 of 𝑥 given 𝐶1 . Then we must assign prior probabilities of the class. Once we have that
¯
we can calculate the normalizing factor if it is necessary, but since the normalizing factor does not
depend on the classes it is only going to change the value of the class probability and not the
relative magnitude. If we want to assign 𝑥 to a specific class, one can normally ignore this
¯
normalizing factor.
For the discriminative approach, we have an expression for the right-hand side that bypasses this
class model – that is p of 𝑥 given 𝐶1 completely. If we look at this equation, we see that the
numerator has a term in the denominator, so we can divide. Then we end up with 1 over 1 plus –

10
and then, since we divide it, we end up with 𝑝 of 𝑥 given 𝐶2 , 𝑝 of 𝐶2 and then divided by 𝑝 of 𝑥 given
𝐶1 times 𝑝 of 𝐶1 .
1
=
𝑝(𝑥 |𝐶2 )𝑝(𝐶2 )
¯
1+
𝑝(𝑥 |𝐶1 )𝑝(𝐶1 )
¯
This we can write as 1 plus – and we say 𝑒 to the minus and we are going to call that negative 𝑎 of 𝑥 .
¯
In the discriminative approach we are going to find an approximation for this a of 𝑥 function. If you
¯
want to have a mathematical expression for a of 𝑥 , write it down. a of 𝑥 is just a logarithm of the
¯ ¯
term 𝑝 of 𝑥 given 𝐶2 times p of 𝐶2 divided by 𝑝 of 𝑥 given 𝐶1 times 𝑝 of 𝐶1 .

¯ ¯
𝑝(𝑥 |𝐶2 )𝑝(𝐶2 )

¯
𝑎(𝑥) = 𝑙𝑜𝑔
𝑝(𝑥 |𝐶1 )𝑝(𝐶1 )
¯
Now what we have is 𝐶1 given 𝑥 that is the probability of 𝐶1 and that is equal to a function on the
¯
right-hand side. The function on the right-hand side is the sigmoid function.
1
𝜎(𝑎) =
1 + 𝑒 −𝑎
We have a function of 𝑎 and that is 1 divide 1 plus 𝑒 to the minus 𝑎. If I draw that function, then we
see for large negative values of 𝑎, sigma of 𝑎, the sigmoid function of 𝑎 is very close to 0. It grows -
it is a half when equal to 0, growing until it will saturate at one. For very large positive values of 𝑎
we see that sigma of 𝑎 saturates at sigma equal to 1.

11
Figure 5.
This means that if we input any value of 𝑎, negative or positive, that value is mapped to a value
between 0 and 1. It means that any value of 𝑎 is now mapped to a probability. Then the question is,
how do we find 𝑎? Once we have 𝑎 we have a complete model and remember 𝑎 is also a function of
𝑥 , which is the input vector.
¯
For a linear classifier, we have input 𝑥 and since it is linear we will, as we did before, simply take
¯
the inner product between the weight 𝑤 , we will add a bias 𝑏, and we will then set 𝑎 of 𝑥 equal to 𝑤
¯ ¯ ¯
transpose 𝑥 plus 𝑏. Our discriminative classifier then becomes 𝑝 of 𝐶1 given 𝑥 and that will be equal
¯ ¯
to 1 divided by 1 plus 𝑒 to the minus 𝑤 transpose of 𝑥 minus 𝑏.

¯ ¯
𝑥 → 𝑤 ⊤𝑥 + 𝑏
¯ ¯
𝑎(𝑥 ) = 𝑤 ⊤ 𝑥 + 𝑏
¯ ¯ ¯
1
𝑝(𝐶1 |𝑥 ) = −𝑤 ⊤ 𝑥 −𝑏
¯
1+𝑒 ¯ ¯
It is the purpose of training to find the value of 𝑤 and b. Remember that the training data consists
¯
of a number of features 𝑥 and each feature 𝑥 will have a corresponding label 𝑡𝑛 . That is what we
¯ ¯
will use during training to learn the values of 𝑤 and 𝑏. Once we have the values of 𝑤 and 𝑏, we have
¯ ¯
a model and we can do a classification.

12
In this video we discussed logistic regression as a classification algorithm. Many people see the
work regression and jump to the conclusion that it is used in regression style problems, that should
not be done. It's important to note that logistic regression is an algorithm specifically used in
binary classification. Study the notes provided to learn more about the subject.

13
Unit 4: Softmax
In this video I will show you how easy it is to do a multi-class classification. Hopefully you will find
the video quite intuitive.
What is softmax classification?
For the multi-class classifier, we will derive a function called the softmax function. Softmax
classification is a linear classifier which means that the different classes are separated linearly, i.e.
straight lines.
Given a feature vector 𝑥 that we want to assign to one of the classes, we take the dot product with
¯
a weight vector 𝑤 and add a bias 𝑏. This is what makes it a linear classifier. Since we have, let’s say,
¯
𝑘 classes we need 𝑘 weight vectors as well as 𝑘 biases,
Logit:𝑤𝑘⊤ 𝑥 + 𝑏𝑘
¯ ¯
𝑘 = 1, … , 𝐾
This function is also called the logit function. You may recall its use in logistic regression. This logit
function can have any value, it can be positive or negative. Since we are working our way towards
a probability, which is positive, we need to convert this into a positive quantity. For that calculate,
the exponent, e(𝑤𝑘⊤ 𝑥 + 𝑏𝑘 ).
¯ ¯
The other main property of probabilities is that they are normalised, i.e. they add up to 1. This is
easily achieved by dividing by the correct quantity, which is the following
𝑤 ⊤ 𝑥+𝑏𝑘
𝑒 ¯𝑘 ¯
𝑝(𝐶𝑘 |𝑥) = 𝑘
¯ 𝑤 ⊤ 𝑥+𝑏𝑗
∑ 𝑒 ¯𝑘 ¯
𝑗=1
This is the softmax function, the function that will then assign a probability.

14
In this short video video I showed you how to derive the softmax function. We will be using this
algorithm in the peer review assignment. In the next video we will tie together all the concepts and
discuss training and one hot encoding.

15
Unit 5: Training
This video ties together the concepts discussed in this module by introducing students to one hot
encoding and how to train a model for both classification and regression style problems.
We want to find all the weights of the softmax as well as the offsets. Logistic regression is just a
simple case of softmax where we have only two classes. Things simplify quite a bit when we only
have two classes, but everything I say about softmax is also true about logistic regression, so we
don’t need to repeat anything.
Again, I am not going to follow the notes because they are quite mathematical, and it is a place to
go if you are mathematically inclined, but I am going to follow the notebook that will be made
available to you in the form of a peer review. You will find a full description of what I am talking
about there, including some of the mathematical details that I omit here, otherwise things become
a little bit too detailed and cluttered.
The training takes place by an optimization problem, so we need to formulate an objective

function like the one that we talked about when we discussed gradient descent. Now we define an
objective function. The weights and offsets that minimise the objective function are the values
that we’ll use in the model.
Before we talk about the objective function we need to decide how we are going to encode the
target values. Remember that with each feature vector 𝑥 we also have a target value 𝑡. If we have
¯ ¯
10 classes like the digits, one possibility is that we have 10 target values, from 0 to 9. But
mathematically that is not a very convenient way of doing things. Instead of using that encoding
scheme for the targets we rather use something that is called the 1 of 𝑘 encoding, also known as
one-hot encoding.
One-hot encoding is very simple. The target becomes a vector 𝑡 and this vector 𝑡 has exactly as
¯ ¯
many entries as we have classes. If we look at the digit example again, then 𝑡 will have exactly 10
¯
entries. All of those entries will be 0 except one of them.
If we are dealing with the digit 0, we want to indicate that it is the 0 class label.

16
In that case we put a 1 in the first position and 0’s everywhere else. In the case of digit 5, we will
put zeros everywhere except at the fifth position, where we place a 1.
𝑡0 = [1,0,0,0,0,0,0,0,0,0]
¯
𝑡5 = [0,0,0,0,1,0,0,0,0,0]
¯
Another way to look at this is to say that we assign a probability to each of our k classes. Since we
know to which class our observation 𝑥 belongs we assign a probability of 1 to that class and we
¯
know with certainty that the observation belongs to that class, and not to any of the other, which
are all assigned the probability 0. We can think of this one-hot encoding as just a probability
distribution where the probability of the designated class in 1 and the probability of the rest of the
classes is 0. Armed with this we can start building an objective function that we can use to find the
weights and by implication find the softmax function and our classifier.
We first assign values to all the weights and all the offsets and that should be done in a random
manner. Later on in the course when we talk about deep learning we will talk a little bit about how
we choose that - how we assign values to the weights and the offsets. For now, you can assign
random values to the weights and offsets.
Once we have a set of weights and offsets we can then take an observation 𝑥 , and push it through
¯
our classifier. That means that we can evaluate the class probability of all the classes, given 𝑥 . Let
¯
us then indicate the class probability of 𝑥 by 𝑝𝑘 . That will be the output of the classifier for this
¯
given set of weights.
Evaluate:𝑝(𝐶𝑘 |𝑥 ) = 𝑝𝑘
¯
We must now compare these class probabilities with the target probabilities. We can do that in
different ways.
One possibility is the mean square error. This means that we take the difference between the class
probabilities assigned by our classifier and then compare that to the target values. Remember the
target values will all be 0 except one of them. Since we are only interested in the magnitude we

17
square, we sum over all the probabilities over all the classes. That is the square error for a single
observation. Then we take the mean over all the data values that we have in our training set.
We then calculate the gradient of the mean square error with respect to 𝑤 and then use the
¯
equation for gradient descent:
1 𝑁 𝐾
𝑀𝑆𝐸(𝑤 ) = ∑ ∑ (𝑝 − 𝑡𝑘 )2
¯ 𝑁 𝑛=1𝑘=1 𝑘
As we have explained already, we start with an initial guess. Let us call that 𝑤 0 , then we specify a
¯
learning rate α and we take a step in the negative gradient direction. The update of our initial sets
of weights and offsets, the corrected value, we call that 𝑤 1 and we then repeat the whole
¯
procedure.
𝑤 1 = 𝑤 0 − 𝛼 ▽ 𝑓(𝑤 0 )
¯ ¯ ¯ ¯
Cross entropy
The mean square error is something that is often used but there is an even better metric that we
can use, something called the cross entropy. For two probability distributions, 𝑞 and 𝑝, the cross
entropy is given by the following:
𝐾
𝐶(𝑞 , 𝑝) = − ∑ 𝑞𝑘 𝑙𝑜𝑔𝑝𝑘
¯ ¯ 𝑘=1
We can now minimize the mean of the cross entropy taken over all the samples.
If we make the 𝑞 probability distribution our target probability distribution, that is the one-hot
¯
encoded targets, then we will get a minimum when the 𝑝 (the classifier output probabilities)
¯
equals the target probabilities.
Now we have a good objective function, a way of finding its minimum via gradient descent, and all
that remains is how to compute the gradient. That becomes a little technical. We are, however,
very fortunate in that TensorFlow does all of that for us.

18
In this video we discussed how to train a model and how gradient descent with objective functions,
allows us to do so. Next we introduced a way to encode categorical variables, known as one-hot
encoding. Please note that one-hot encoding is used here for the target variable but it can also be
used to encode features.

19

Mscfe XXX (Course Name) - Module X: Collaborative Review Task

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mscfe XXX (Course Name) - Module X: Collaborative Review Task

Uploaded by

Copyright:

Available Formats

MScFE xxx [Course Name] - Module X: Collaborative Review Task

© 2019 - WorldQuant University – All rights reserved.

Unit 1: Gradient Descent ....................................................................................... 3

© 2019 - WorldQuant University – All rights reserved.

Unit 1: Gradient Descent

value so that we can know when 𝑓 is increasing and when 𝑓 is decreasing.

© 2019 - WorldQuant University – All rights reserved.

For min:𝑓(𝑥 ∗ + 𝜖𝑥 ) > 𝑓(𝑥 ∗ ), ∀𝑥

𝑓(𝑥 ∗ + 𝜖𝑥 ) = 𝑓(𝑥 ∗ ) + 𝜖 ▽ 𝑓(𝑥 ∗ )⊤ 𝑥 + H.O.T (Higher-order terms)

gradient of 𝑓 is a vector, we take its inner product with 𝜖𝑥.

© 2019 - WorldQuant University – All rights reserved.

The gradient of 𝑓 is the vector of partial derivative of 𝑓 with respect to 𝑥 and 𝑦.

minimum has to be zero,

there in the negative gradient direction,

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

We humans do that automatically. When we see a 0, we know it is a 0 and we assign it

© 2019 - WorldQuant University – All rights reserved.

How do we think of this in terms of mathematics?

In RGB images the dimensions will be 20 by 20 by 3.

The generative approach and the discriminative approach

© 2019 - WorldQuant University – All rights reserved.

The generative approach

The discriminative approach

© 2019 - WorldQuant University – All rights reserved.

Unit 3: Logistic Regression

𝑝(𝑥 |𝐶1 )𝑝(𝐶1 )

© 2019 - WorldQuant University – All rights reserved.

term 𝑝 of 𝑥 given 𝐶2 times p of 𝐶2 divided by 𝑝 of 𝑥 given 𝐶1 times 𝑝 of 𝐶1 .

𝑝(𝑥 |𝐶2 )𝑝(𝐶2 )

© 2019 - WorldQuant University – All rights reserved.

to 1 divided by 1 plus 𝑒 to the minus 𝑤 transpose of 𝑥 minus 𝑏.

a model and we can do a classification.

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

What is softmax classification?

𝑘 classes we need 𝑘 weight vectors as well as 𝑘 biases,

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

The training takes place by an optimization problem, so we need to formulate an objective

entries. All of those entries will be 0 except one of them.

© 2019 - WorldQuant University – All rights reserved.

given set of weights.

© 2019 - WorldQuant University – All rights reserved.

equation for gradient descent:

equals the target probabilities.

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

You might also like