You are on page 1of 2

LECTURE 3-FEATURES,NEURAL NETWORKS

Score: It is a dot product between the


wave vector and feature vector.
Score drives the prediction, in Linear Regression we output score as number whereas
in binary classification we output the sign of the score.
We have already learnt to select w via optimization, now our interest will be
on φ.
In real life feature extraction (specifying φ(x) based on our knowledge) plays a
key role. A feature extractor takes an input and outputs a bunch of features which
are useful for our prediction. We need to have a principle to decide how many
features do we select and how to select them.
If a group of features are all computed in same way then they are all known as
“feature template”. Describing multiple features in a unified way is highly helpful for
clarity especially when we are working with complex data for example image plus
some metadata. But a feature template can have a number of features. Most
efficient way to represent all of them is in the form of a vector. If we have few
features with non-zero values then we can also use maps or dictionaries of python
,which are used in NLP.
A “Hypothesis class” is a set
of all predictors we can get by
varying wave vector w. If we take a
linear function φ(x)=x then by varying w we get a large number of functions which
are nothing but straight lines passing through origin with different slope. All these
lines together are considered to be a hypothesis class.
Above picture shows the full pipeline of machine learning. From all possible functions
we select a set of functions and then through available data we select the required
function. Here we face two problems, either we didn’t have enough features or we
fail to optimize the features efficiently. Either way our predictor score will be not up
to mark.

NEURAL NETWORKS:
When we face with complex situations we shift to neural networks from linear
classifiers. In a simple way, Neural networks are a bunch of linear classifiers which
are stitched together with some non-linearity.

Above picture shows that neural networks tries to break down the problem into a set
of subproblems (each one is an output of a linear classifier) and again applies a linear
classifier and finally gets the score. The sigmoid function is our choice but it should
be a non-linear function(ex: logistic function/RLU).
Now training loss depends on both V and w. We know that
TrainLoss(V,w)=1/|Dtrain| ∑ Loss(x,y,w,V).We need to minimise the gradient of
TrainLoss as we did in Linear Classifiers. In neural networks we have 5 basic building
blocks - +,-,*,max and σ. With these 5 blocks we can have a number of layers and can
design any function.
Backpropagation is an algorithm that allows us to compute gradients for any
computation graph.It computes 2 values forward and backward. Nowadays pytorch
and tensorflow automatically calculate gradient ,which often becomes a tedious task
when we have many layers.

You might also like