You are on page 1of 24

Coding principal at population level:

1. Localist representation: each entity encoded by single, different


neuron at each time
2. Distributed representation: each entity encoded by patterns of
activity distributed over many neurons
3. Sparse distributed coding: each entity is encoded by strong
activation of a relatively small set of neurons (small set we have
more sparsity, large set we have less sparsity), to measure
sparseness one possibility is Kurtosis coefficient of activation
probability

Sparse Coding: we impose a penalty in the total number of active units


to define independency, we select the small number of active basis
function from large dictionary of possible function. In Spars coding
forcing the model to rely on very few basis function at the same time, so
in sparse coding we
have a dictionary of basis function that we can combine.
we define sparsity as having few non-zero components or having few
components not close to zero. The requirement that our
coefficients aiai be sparse means that given a input vector, we would like
as few of our coefficients to be far from zero as possible

Probabilistic Graphical Models:


Bayesian networks: Directed graph, Markov blanket of the node:
represent the parents, children and the parents of children
Markov networks: undirected graph, Markov blanket just directed nodes,
corresponds to the immediate neighbors
Machine learning:
Loss function or error function define how we are good in learning (the
performance measure that we aim at optimizing)
To find the minimum of loss function we use Gradient Descent ( it’s a
optimizer to minimize the error )
In simple case : convex optimization we have only one global minimum
loss function
In hard case : non-convex optimization , many local minimum
Underfitting : if the training error is large
Overfitting : The gap between training and testing is large
Balance between them by tuning the complexity of the model
The solutions are:
1. Regularization: Simplify the model by selecting few parameters
2. Gather more training data Special in complex model
3. Reduce the noise in training data
1) Model regularization: Regularization is any modification we do in our
learning algorithm in order to reduce the generation error, is a
technique used for tuning the function by adding an additional penalty
term in the error function to prevent overfitting. every parameter in
deep network reperesenting the weight of the connection between the
neurons, we prefer smaller weight that are near to zero
a) There are two ways for regularize:
i) Weight decay: we add penalty in the model parameter
ii) Sparse coding: we add penalty in the model representation
Supervised learning:
Linear models for supervised learning: Perceptron and delta rule
The problem with perceptron:
1. There are many possible lines separating the two classes!
2. Data might not be linearly separable!
Data Rule: is the first step to improve the perceptron, Replace the
activation function with linear activation function, in this case it is a
continues, differential function. We can use gradient to minimize the
loss function, with this way it is more informative , we know how to be
close to correct class and adopt the weight of the network .
Observation: the perceptron change very abruptly the inclination of the
line , instead delta rule change slightly the weight and gradually find the
best border that separate the datapoint
Error propagation: Magnitude the gradient with respect to each weight
in the network tell us how sensitive is the loss function to changes in that
weight.
Back-propagation exploits the chain rule of calculus to compute the
derivative of composite functions. This allows to recursively determine
the contribution of each variable to the final loss
Vanishing gradient problem: If we add many hidden layer in some
cases the gradient will become vanishing, and this is the reason that the
weight in lower layer could not change their value, since in feed forward
networks learning is based on gradient decent and error back
propagation , we just have to be careful about Gradient vanishing and
overfitting
Gradient exploding: check and limit the size of gradient during training
with
1. gradient clipping: if the norm of the gradient exceeds a given
threshold, set it to the maximum value
2. or gradient scaling: normalizing the gradient vector such that
vector norm equals a defined value
Advanced optimization method:
1) Weight Initialization is an important consideration in the design of a
neural network model. We almost initialize all the weights in the
model randomly from a Gaussian or uniform distribution. if we assign
the same value to all the weight, if training is deterministic, maybe
some neuron will develop the same feature. We can do better if we
employ some simple heuristics that will improve SGD convergence
by avoiding numerical problems during gradient computation. If the
magnitude of the weight is larger we can learn faster, but too large
weight could saturate the activation function , So we want to have
weights that allows to exploid the activation function in the most
sensetive part of function. There is two way for weight initialization:
a) Normalized Xavier Weight Initialization method is calculated
as a random number with a uniform probability distribution (U)
between the range -(sqrt(6)/(n + m)) and sqrt(6)/(n + m), where n is
the number of inputs to the node (e.g. number of nodes in the
previous layer) and m is the number of outputs from the layer (e.g.
number of nodes in the current layer).
b) Sparse initialization: we just decide a percentage of connection
that should be not zero, so we have 2000 connection coming to one
neuron, so we say ok just 200 of them different to be zero and we
may sample them using the Xavier formula, the remain just set to
zero, most of connection initialize zero
2) Momentum: A technique that is used along with SGD is
called Momentum. Instead of using only the gradient of the current
step to guide the search, momentum also accumulates the gradient of
the past steps to determine the direction to go, so adding a momentum
term helps accelerating gradient vectors in the right directions, thus
leading to faster convergence of SGD. The first equation has two
parts. The first term is the gradient that is retained from previous
iterations. This retained gradient is multiplied by a value called
"Coefficient of Momentum" which is the percentage of the gradient
retained every iteration, when gradients are aligned , and SGD is
moving down we have a long valley so we keep in increasing the step
size so the velocity keeps increasing.

3) Adoptive learning rate: is probably one of the most critical


hyperparameters in SGD, we want to directly adopt the learning rate,
learning rate must be carefully calibrated, if u have every small
learning rate you are guarantee to converge but after many iteration!
so you want to increase it as much as you can , however if you
increase it too much you may jump out from of minimum.
(1) Before used constant value for learning rate
(2) another way is to decrease the learning rate as we
proceed, start with large step size, because the idea is the
approaching the local minimum, at the begining maybe we
have large progress, and when we approach the minimum we
decrease the rate.
(1) Adoptive learning rate that is working better, we use
the gradient as a momentum to calibrate the learning rate and
most important thing is we use different learning rate for
each connection weight! Some weight will give you more
information about the gradient, and some other weight are
not very important at some specific point during optimization
, so you can decrease or increase the weight more accurate
and strongly .So every connection weight have its own
learning rate .
(i) ADAM: the idea is same as other adoptive learning
rate , we have different learning rate for every
weight in the network and these learning rate are
adopting according to the gradient , but not only
according the average of the gradient but we also
consider the second moment of the distribution , we

want to estimate the first and second moment of


distribution of the gradien

Dropout is another way for regularization: during the training we


randomly remove some of the connection in network (randomly remove
the hidden or input in the processing of each pattern, one of the efficient
implementation is we randomly sample a binary mask which defines
which units should be dropped out according to a fixed probability.
Dropout regularizes each hidden unit to learn not just a good feature, but
also a feature that is good in many contexts, therefore, taking off the
features and weight make the network more robust .
Second-order optimization methods : Second derivatives give
information about the curvature of the function and can be very useful to
speed-up SGD . I f we just have the first order derivative in case that we
don’t have any surface the gradient predicts the decrease correctly, if we
have the negative surface the cost function decreases faster than gradient
predict. In case we have positive surface the function decreases more
slowly than expected and eventually begins to increase, So the gradient
tell us go to this direction but it’s not true, because the function suddenly
increases so if we follow this direction we can’t find the minimum of
function. So we can not trust these information if we consider just the
first derivative. We need to calculate the curvature of the surface with
the second derivative. Jacobian is the matrix of first derivatives,
Hessian is much larger matrix that contain the second order derivative.
The problem of hessian matrix is a very huge matrix because contains all
the partial derivatives of second order with all the combination between
variable that are participating in the function. The idea is if we could
measure the hessian matrix then we could decomposite and calculate the
condition number, and condition number matrix is the ratio between
largest and smallest eigenvalue, The condition number of the Hessian
measures how much the second derivatives in a certain point differ from
each other. If they are very different means that the gradient going down
very fast in one direction but not very fast in other direction.so we
should consider this during the gradient decent otherwise the optimizer
get very slow. When the Hessian has a very large condition number,
gradient descent performs poorly: in one direction the derivative
increases rapidly, while in another direction it increases slowly

With filter we slide over our input (image) to detect the edges and
generate the feature maps
Kernel moves on the input data by the stride value. If the stride value is
2, then kernel moves by 2 columns of pixels in the input matrix. In short,
the kernel is used to extract high-level features like edges from the
image.
Each hidden neuron has a local receptive field encoding a specific
feature, The number of hidden neurons defines how many features
(kernels) will be represented at each processing layer. The kernel size
defines the receptive field in the kernel, kernels generate a feature map
as an output
Adversarial attack: we want to produce an adversarial image , we just
take the original image and we add a little fraction of noise which is
computing using gradient. We see the direction of gradient and we try to
imagine the direction of the weight which will be increase the
misclassification. Now we move the weight in opposite direction that
give you miss classification. So with using the gradient information we
change the input. Fooling deep network by injecting a small percentage
of noise in the input , Noise must be created carefully: the image should
be modified in the direction of the gradient maximizing the loss function
with respect to the input image
Defenses against adversarial attacks :
1. Adversarial training: we have our image net and training database,
we create a adversarial attack for our system and we training it
again with using adversarial sample , so we augment training set
using adversarial sample.( we have our model , somebody attack
with adversarial sample, now we train our model again with this
sample , so you have better model , so you attack, you have
another sample and you train again…your model become more
robust)
2. Randomized smoothing: we add some noise to the network either
to the input or weight, so we have smoothed version of network,
where the prediction is obtained by average of the predictions in
neighbors. We Cortor the image and see what is the prediction, the
prediction is consistent if across all the cortolation is quiet robust.
If we just cortol a little and the prediction is changed , it means that
it is realy near the boundary classification.
Back propagation through time: A recurrent network seen as a very deep
network with parameters sharing and back propagation, rather than
compute the gradient at the end , we can break down the sequence into
shorter piece , you can update the weight after some steps to avoid
vanishing gradient. we have series of input that we want to sequentially
process them these input are correlating , we can process the first time
stem and create the hidden activation , and then we will process the next
input and we will use the previous context value for creating new hidden
activation and so on , the last state is adopting encoding of the whole
sequence.
A variety of sequence learning :
 Sequence-to-sequence: each input is followed by a prediction
 Encoder-decoder : we try to encode all the information overt the
sequence into fixed-size vector and then try to decode and produce
the output
 Sequence-to-output : we have input sequence , we accumulate the
information in the context ,hidden layer and then only one output
created
learning long-term dependencies is practically very hard due to gradient
vanishing
Long-Short Term Memory (LSTM) networks: We add to the basic
RNN architecture a set of gated units, which will learn how much
information should be retained in temporary memory
and how forward it should be propagated. (The idea of gating is
selecting which information should delete or maintain from the network
state) there are three gate that with them we decide which information
will be delete or maintain
 Input gate: let the input flow in the memory cell
 Output gate: what new information we’re going to save in the cell
state
 Forget gate: what information we want to throw away from the cell
state
 Cell update: we trying to altering the current state by deleting
something from previous state by forgot factor and add the new
information
Attention mechanism in LSTM:
We can selectively search for useful past information during decoding,
by taking a weighted sum of all the encoder inputs and pass it into the
decoder hidden state. We select most important hidden state. In this
mechanism we have selecting weighting input , at every time step we
give important to different portion of input , it is not mandatory the score
give first important to first portion of input , it depends on the structure
of our system.
 With encoder, all the sequences are composed in one single vector
which is hidden state (encoder saved all its hidden states in buffer
and passed to decoder)
 Decoder look back at hidden state and give a score to each hidden
state, which is used to amplify the most relevant hidden states and
then combining every hidden state by doing weighting sum, so
each hidden state is weighted by its score and we sum all the
weighted hidden state
 Then we have new vector is context vector which giving us more
information about most important part in your original state and
used to generate output.
Autoencoders Deterministic model (means it’s not probabilistic) that
trained using error backpropagation, like feed-forward networks, the
only difference is output layer , so in this case the output layer dose not
representing a class, we don’t have to activate the correct class , the
output layer is simply the copy of input , so the goal of this architecture
is take an input , compress it and then decompress it and recreate the
same input as an output, the los function for Autoencoder is called
reconstruction error with MSE we just average all the training patterns.
So if our autoencoder is realy good so the output is similar to input and

this error will be zero.


 It could have overcomplete code if the number of hidden units is
larger or equal to visible units, in this case it will not learn any
features, it just learns to memorize the image.
 We have undercomplete code, when the number of hidden units are
lower than visible units, so autoencoder forced to compress data and
learn new and useful features ( network will extract relevant features)
If we use linear activation functions, an undercomplete autoencoder will
learn to span the same subspace as PCA . However if we use non-linear
activation function we have more powerful mapping. And we can
improve the feature extraction process by introducing additional
regularization such as Sparse autoencoder ( means we want to activate
some units at the same time), denoising autoencoder, and generative
autoencoder
Energy function which is depends on system temperature T:
The energy function specifies which states of the network are more
likely to occur, the learning goal is to assign high probability (low
energy) to the configurations observed during training .for example you
have a pattern and you want to memorize it and you want minimize the
energy of this pattern during learning

Hopefield network architecture:


The connection is bidirectional and fully connected topology, there are
no self-connection, all the neurons are visible and they corresponding to
one input variable, there are not exist latent neurons, in energy function
we consider all the possible pairing interaction between neurons.
Hopefield network Energy function: Energy value tell us how likely it
is the certain configuration of system, and the idea is decrease the energy
of system so explore more propable configuration.

There are two mechanism to minimizing energy :


1. During Inference: changing the activation of every neurons by
simple rules , to make them more similar to stored patterns. each
neuron xi computes a weighted sum of the activities of all the
other neurons xj and fires if the value is above a certain
threshold 𝜃# , and we keep on changing until we get stable state.
2. During Learning: we take the pattern, and fix it in
network(every pixel clamped into neurons in Hopfield network)
and we change the weight between every pair of neurons with
using hebbien learning. So if two neurons have correlations
(both are active or nonactive) we increase that connection, so
we continue to update rules for each pair of neurons , and then
we change the image .In new pattern we have different
correlation so we change the wight according to rules and so
on , by the end of learning the weight will encode the proper
value that allows to recover all the image correlation.

We just want to save the configuration in a dynamic system and during


learning we want to create minima energy point for the correct
configuration, and during recall(inference) we want to start from random
, incomplete or noisy configuration and go toward system state to low
energy minima to recover closest memory.
Stochastic Hopefield:
If we increase the number of patterns, the network will also develop
number of incorrect memories, called spurious attractors, which not
corresponds to correct local minima, to avoid getting trap to bad local
minima we can replace the deterministic activation function with a
stochastic function, in this function we use temperature parameter:
1. in low temperature, network behave deterministic
2. medium temperature, network tend toward low energy
3. high temperature, network randomly explore the state space

Simulated Annealing: to reach the best or more stable energy minima ,


we start with high temperature in our system , then gradually decrease
the temperature , so the temperature progresievly reduced until system
become deterministic
Generative Model:
The problem of Hopefield was all the neurons were visible , when we
have a pattern we have to clamp all the image in neurons in Hopefield
network and we trying to store them by adjusting weight between
neurons, so in this way we can capture only pairwise interaction , so we
can just see how certain unit interacting with direct one , and this is
captured in connection weight between two units.
- In Generative Model we have some visible patterns that clamped
them into visible neurons, however in this case we also add some
hidden neurons that are not force on pixel in the image. they act as
latent vector that be used to capture the variability in the image
without been clamped to the pixel. This model is also fully
connected, have two different set neurons:
1. visible neurons: observe the variables, they are sensor layer of the
network that show input patterns to the network
2. Hidden neurons that act by modeling the latent factor in the
distribution of observed variables, latent factor can explain the data
that you observed , so all the unit in the hidden layer can observe
the activation and data clamped in to visible units , and maybe can
discover some correlation that is not easily captured by units that
only directly looking at pixels in image.
 Inference: Give some observed evidence (visible units) which are
more probable hypothesis (hidden unit) that could explain it.
 Learning: find the parameters (connection weights) of a generative
model that best describe the data distribution
Generative models try to discover the latent structure of the input data
by building a probabilistic, internal model of the environment that could
have produced the data: Latents are the internal representation of the
environment.
A Markov chain is a process that performs a search over a set of possible
states according to a predefined set of constraints specified by a
transition matrix T .

Gibbs sampling: form of Marcove chin but Rather than resampling the
whole network state, only one variable is resampled at each iteration,
conditioning on all the other variables. where x – i indicates the set
containing all the variables except xi ( the transition of the one variable
depends on all other variable). We can speed of the sampling by
exploiding the graph structure ,
We can ignore variable outside the Markove blanket & conditionally
independent variable can be sampled at same time(block gibb sampling).

Boltzman machine :
Boltzmann machines are a stochastic variant of Hopfield networks that
exploit hidden units to learn higher-order correlations (“features”) from
the data distribution
Data driven in inference: observed some evidence , some variable
( input patterns) clamped to visible unit and try to estimate best
configuration for the hidden units , exploid hidden units that are more
closely to what I observed.
 start with an input pattern clamped on the visible units, while the
hidden units have a random initial activation
 iteratively update the activation value of each hidden neuron, one
at a time, until the network reaches equilibrium (Gibbs-sampling),
we continue to change and sampling until we get convergence.
Model-driven sampling:
 start with a random activation on all units
 iteratively update the activation value of each neuron, one at a
time, until the network reaches equilibrium (Gibbs sampling)
Boltzmann machines: Learning :
Learning is based on maximum-likelihood: we want to assign high
probability (low energy) to the observed data points, and low probability
to the remaining points

Our loss function is maximum likelihood function, we try to find out


what are the best weight that allows the network to settle down in a good
state that resample visible state are observing in training state (change a
wight in such way , next time boltzman machine produce the model
phase , some correlation are observed in data.)
Restricted Boltzman Machine:
we have a bipartite graph, means that there are no inter layer connection.
Hidden neuron are not connected together, also visible neurons are not
connected together, but they are fully connected to hidden neurons , we
have conditional independency (block gibbs sampling ) because the
Markov blanket of each hidden unit can observed for all hidden units
and this is same for visible units.

What is contrastive divergence: for each training patterns we have


Positive phase , Negative phase and eight update.
For model driven phase because we have to start randomly for all
units ,it is very demanding , so instead of initialize randomly we look at
the correlation produce after first phase ( in positive phase correlation
between two units), I was start with hidden activation by data then
resample data , then sample hidden unit and so on until I get equilibrium
and at the end I have a fantasy model, where it is able to change the
visible unit so it is something similar to model driven , this is initialy
bias by the data, because I was not starting from total random
configuration ,I was starting from hidden activation by given the data.
1) Positive phase:
a) The pattern is presented to the network
b) The activations of hidden neurons h are computed in a single step
using the stochastic activation function
c) We compute the correlations < vi hj > between all visible and
hidden neurons
2) Negative phase:
i) Starting from the hidden neurons activations computed during
the positive phase, we generate activations on the visible layer
using the stochastic activation function
ii) Starting from these new visible activations (“reconstructed
data”), we compute again the activations of the hidden neurons
iii) We compute the correlations < vi hj > between all visible and
hidden neurons
3) Weights update: The weights are then updated according to
Deep belief: you train hierarchical genertaive model to discover more
and more complex features and then find tune all the connection weight
using backpropagation
a) Internal representations (real-valued probabilities over the hidden
units) of one RBM used as input (visible layer) for the next RBM
b) Multiple levels of representations encoded as a hierarchical
generative model
(hypotheses over hypotheses)
c) NB: each layer performs a non-linear transformation of the data
The idea is that we have some visible units which are clamped to our
training patterns like images, and we can compute some hidden unit
activation by establish probability of activation given the visible unit, so
we activate this hidden units, also we can reconstruct visible layer from
the top down using the same connection, so this is standard training of
RBM, we use contrastive divergence and we calculate the final state of
this connection. after we have trained the first RBM we can simply
stacked the second BM on top of that and the visible layer for new RBM
will be by hidden layer activation of first BM and we can stack together
many BM as we want.
Restricted Boltzmann Machines vs. Autoencoders
RBM has bidirectional connection and using energy based architecture,
its’ is a probabilistic model that tries to assign low energy to the good
configuration of the system, you can sample new data by just exploring
landscape of the energy function using Gibbs sampling
AE has a feed-forward architecture, starting from visible layer, we still
activate hidden unit and try to reconstruct the visible layer , this is very
similar to what we are doing in BM, during data driven phase we create
the hidden activation and during the model driven we try to reconstruct
the data . But in BM we are doing in probabilistic, but here we do in
deterministic feed forward mapping and we train autoencoder using back
propagation not using contrast divergence, If we want to generate data
with BM we just start random activation over the units and perform
gibbs sampling until reach convergence but in AE if we give no input we
have no output! it can not generate model by itself
Variational Autoencoders
In standard autoencoders, the latent space can be extremely irregular
(close points in latent space can produce very different – often
meaningless – patterns over visible units) so usually we cannot
implement a generative process that simply samples a vector from the
latent space and passes it through the decoder , so for fix this issue we
make the mapping probabilistic, The encoder returns a distribution over
the latent space instead of a single point and The loss function has an
additional regularisation term in order to ensure a “better organization”
of the latent space
β-VAEs
introduce a penalization term in the KL-divergence using a
hyperparameter β > 1 that balances latent channel capacity and
independence constraints with reconstruction accuracy

Generative Adversarial Networks


Game-theoretic perspective: learning is a minimax two-player game,
with a value function that one agent seeks to maximize and the other
seeks to minimize:

 D(x) represents the probability that x comes from the true data
distribution rather than from the generator
 D wants to maximize the probability of assigning the correct label
to both x and G(z)
 G wants to minimize the probability that D assigns the correct
label to G(z)
 The Discriminator parameters are updated by ascending its
stochastic gradient
 The Generator parameters are instead updated by descending its
stochastic gradient
 the goal is to achieve the equilibrium when the Generator produces
data indistinguishable from the training distribution, and the
Discriminator always predicts “true” or “fake” with probability 0.5
Advanced GAN architectures
1) CGAN (Conditional GAN): add class information in latent space in
order to allow generation. Of patterns from specific class
2) ACGAN (Auxiliary Classifier GAN): include the class information
also in output in order to challenge the discriminator (if the image is
fake or real and also from which class is sampled)
3) InfoGAN (Information Maximizing GAN): add specific semantic
meaning to the variables in the latent space, by adding a
regularization term to the GAN loss that maximizes the mutual
information between the latent variables and the Generator
distribution
4) WGAN (Wasserstein GAN ) The discriminator is replaced by a
“critic” that, rather than classifying, scores the realness or fakeness of
a given image
5) CycleGAN : Learn image-to-image translation without the need of
paired training images
6) StyleGAN: The architecture of the Generator is significantly changed
in order to allow for a more detail control of the generator process
The aim of reinforcement learning is maximizing the accumulated
rewards, we need to. Introduce a discount rate, because we aim that
maximizing reward in the entire agent life. we have to balance between
immediate rewards and long-term rewards, we have to focus more on
immediate reward.
Policies and Value Function:
Value function estimate what is the value of every state . for calculate it
we consider the immediate reward and estimate all the future return that
can get from next state. where specifies how much we
should care about future rewards
Policy(rule) which specifies the action it will choose in each state.

Temporal difference learning: learn an estimation of value function by


exploiting a prediction-error signal between two following state (we start
from the random estimation of value of state then choose an action, I got
the rewards, then decide if my value is correct or if I need to update it)

After transition to the next state I choose an action and then I update the
value function to bring it more close to the true value that I observed. In
future if the agent know well the environment so the value function will
be perfect ,you will always get rewards that you expect, so your
expectation over reward is accurate and you don’t need to change your
value

Q-learning: with this value, we are not just estimating the value of each
state, we are estimating the value of each state-action pairs. By relating
actions to states, Q-learning allows to consider the policy used for
predicting future rewards (e.g., greedy policy max), We usually start
from a default estimate of the value function for each state-action pair,
which we call Q-value, and improve it over time
Exploration vs. Exploitation: Exploration is crucial in the initial
phases: the agent needs to find out as much as it can from the
environment. If the agent is too greedy during the first episodes, it might
get stuck in a local maximum. To better explore the environment ,
usually action is chosen probabilistically .
1) e-greedy: is the stochastic policy to encourage the expoleration,
because with some propability we can take a random action , and this
allows to explore some other action that maybe give better outcomes
2) Softmax: if you have many possible actions, with max greedy policy
u always choose the action with highest value, with softmax you
sample from this distribution. you can make your choice more
stochastic by add temperature, in high temperature all the action with
almost have same probability been sampled, so the behavior is totally
random, with low Temperature behaviour is more and more
deterministic.
On-policy vs. Off-policy learning:
 We have update policy for estimating the future rewards and
behaviour police for select the next action , so if two policies are
same ( policy for planning and policy for perform the acyion) the
learning algorithm is on policy .
 SARSA is on policy algorithm , it use stochastic behavior policy
also for update its estimation ,
 but Q-learning is off-policy, it uses a stochastic behavior policy to
improve exploration, but a greedy update policy, when we have to
choose the present action, we can use e-greedy or softmax, but
during planning when we are updating our Q values , we use
deterministic max policy
The problem of using neural networks as Q-value estimators is
instability
1. Maintain two networks: a prediction network that is trained at each
step, and a target network used for action selection, which is
updated after each iteration.
2. Use a memory buffer to store previous learning episods, we sample
randomly from the buffer, we decorate example , stabilize learning,
the replay memory can be prioritized to show more important
transitions more often
Policy Gradient Methods: The policy parameters are learned by
ascending the gradient of some scalar performance measure, where the
performance is usually estimated by sampling possible trajectories
(“roll-outs”) containing sequences of states and actions (we need to
guess what could be the outcome of our actions and this is specify by
policy parameters, we sample some possible action for future in order to
estimate possible return in long term
we are optimizing continuse function by using gradient decent, we can
use a lot of theory to improve the convergence, so this method is more
stronger convergence compare to traditional methods like Q-learning)

You might also like