Professional Documents
Culture Documents
With filter we slide over our input (image) to detect the edges and
generate the feature maps
Kernel moves on the input data by the stride value. If the stride value is
2, then kernel moves by 2 columns of pixels in the input matrix. In short,
the kernel is used to extract high-level features like edges from the
image.
Each hidden neuron has a local receptive field encoding a specific
feature, The number of hidden neurons defines how many features
(kernels) will be represented at each processing layer. The kernel size
defines the receptive field in the kernel, kernels generate a feature map
as an output
Adversarial attack: we want to produce an adversarial image , we just
take the original image and we add a little fraction of noise which is
computing using gradient. We see the direction of gradient and we try to
imagine the direction of the weight which will be increase the
misclassification. Now we move the weight in opposite direction that
give you miss classification. So with using the gradient information we
change the input. Fooling deep network by injecting a small percentage
of noise in the input , Noise must be created carefully: the image should
be modified in the direction of the gradient maximizing the loss function
with respect to the input image
Defenses against adversarial attacks :
1. Adversarial training: we have our image net and training database,
we create a adversarial attack for our system and we training it
again with using adversarial sample , so we augment training set
using adversarial sample.( we have our model , somebody attack
with adversarial sample, now we train our model again with this
sample , so you have better model , so you attack, you have
another sample and you train again…your model become more
robust)
2. Randomized smoothing: we add some noise to the network either
to the input or weight, so we have smoothed version of network,
where the prediction is obtained by average of the predictions in
neighbors. We Cortor the image and see what is the prediction, the
prediction is consistent if across all the cortolation is quiet robust.
If we just cortol a little and the prediction is changed , it means that
it is realy near the boundary classification.
Back propagation through time: A recurrent network seen as a very deep
network with parameters sharing and back propagation, rather than
compute the gradient at the end , we can break down the sequence into
shorter piece , you can update the weight after some steps to avoid
vanishing gradient. we have series of input that we want to sequentially
process them these input are correlating , we can process the first time
stem and create the hidden activation , and then we will process the next
input and we will use the previous context value for creating new hidden
activation and so on , the last state is adopting encoding of the whole
sequence.
A variety of sequence learning :
Sequence-to-sequence: each input is followed by a prediction
Encoder-decoder : we try to encode all the information overt the
sequence into fixed-size vector and then try to decode and produce
the output
Sequence-to-output : we have input sequence , we accumulate the
information in the context ,hidden layer and then only one output
created
learning long-term dependencies is practically very hard due to gradient
vanishing
Long-Short Term Memory (LSTM) networks: We add to the basic
RNN architecture a set of gated units, which will learn how much
information should be retained in temporary memory
and how forward it should be propagated. (The idea of gating is
selecting which information should delete or maintain from the network
state) there are three gate that with them we decide which information
will be delete or maintain
Input gate: let the input flow in the memory cell
Output gate: what new information we’re going to save in the cell
state
Forget gate: what information we want to throw away from the cell
state
Cell update: we trying to altering the current state by deleting
something from previous state by forgot factor and add the new
information
Attention mechanism in LSTM:
We can selectively search for useful past information during decoding,
by taking a weighted sum of all the encoder inputs and pass it into the
decoder hidden state. We select most important hidden state. In this
mechanism we have selecting weighting input , at every time step we
give important to different portion of input , it is not mandatory the score
give first important to first portion of input , it depends on the structure
of our system.
With encoder, all the sequences are composed in one single vector
which is hidden state (encoder saved all its hidden states in buffer
and passed to decoder)
Decoder look back at hidden state and give a score to each hidden
state, which is used to amplify the most relevant hidden states and
then combining every hidden state by doing weighting sum, so
each hidden state is weighted by its score and we sum all the
weighted hidden state
Then we have new vector is context vector which giving us more
information about most important part in your original state and
used to generate output.
Autoencoders Deterministic model (means it’s not probabilistic) that
trained using error backpropagation, like feed-forward networks, the
only difference is output layer , so in this case the output layer dose not
representing a class, we don’t have to activate the correct class , the
output layer is simply the copy of input , so the goal of this architecture
is take an input , compress it and then decompress it and recreate the
same input as an output, the los function for Autoencoder is called
reconstruction error with MSE we just average all the training patterns.
So if our autoencoder is realy good so the output is similar to input and
Gibbs sampling: form of Marcove chin but Rather than resampling the
whole network state, only one variable is resampled at each iteration,
conditioning on all the other variables. where x – i indicates the set
containing all the variables except xi ( the transition of the one variable
depends on all other variable). We can speed of the sampling by
exploiding the graph structure ,
We can ignore variable outside the Markove blanket & conditionally
independent variable can be sampled at same time(block gibb sampling).
Boltzman machine :
Boltzmann machines are a stochastic variant of Hopfield networks that
exploit hidden units to learn higher-order correlations (“features”) from
the data distribution
Data driven in inference: observed some evidence , some variable
( input patterns) clamped to visible unit and try to estimate best
configuration for the hidden units , exploid hidden units that are more
closely to what I observed.
start with an input pattern clamped on the visible units, while the
hidden units have a random initial activation
iteratively update the activation value of each hidden neuron, one
at a time, until the network reaches equilibrium (Gibbs-sampling),
we continue to change and sampling until we get convergence.
Model-driven sampling:
start with a random activation on all units
iteratively update the activation value of each neuron, one at a
time, until the network reaches equilibrium (Gibbs sampling)
Boltzmann machines: Learning :
Learning is based on maximum-likelihood: we want to assign high
probability (low energy) to the observed data points, and low probability
to the remaining points
D(x) represents the probability that x comes from the true data
distribution rather than from the generator
D wants to maximize the probability of assigning the correct label
to both x and G(z)
G wants to minimize the probability that D assigns the correct
label to G(z)
The Discriminator parameters are updated by ascending its
stochastic gradient
The Generator parameters are instead updated by descending its
stochastic gradient
the goal is to achieve the equilibrium when the Generator produces
data indistinguishable from the training distribution, and the
Discriminator always predicts “true” or “fake” with probability 0.5
Advanced GAN architectures
1) CGAN (Conditional GAN): add class information in latent space in
order to allow generation. Of patterns from specific class
2) ACGAN (Auxiliary Classifier GAN): include the class information
also in output in order to challenge the discriminator (if the image is
fake or real and also from which class is sampled)
3) InfoGAN (Information Maximizing GAN): add specific semantic
meaning to the variables in the latent space, by adding a
regularization term to the GAN loss that maximizes the mutual
information between the latent variables and the Generator
distribution
4) WGAN (Wasserstein GAN ) The discriminator is replaced by a
“critic” that, rather than classifying, scores the realness or fakeness of
a given image
5) CycleGAN : Learn image-to-image translation without the need of
paired training images
6) StyleGAN: The architecture of the Generator is significantly changed
in order to allow for a more detail control of the generator process
The aim of reinforcement learning is maximizing the accumulated
rewards, we need to. Introduce a discount rate, because we aim that
maximizing reward in the entire agent life. we have to balance between
immediate rewards and long-term rewards, we have to focus more on
immediate reward.
Policies and Value Function:
Value function estimate what is the value of every state . for calculate it
we consider the immediate reward and estimate all the future return that
can get from next state. where specifies how much we
should care about future rewards
Policy(rule) which specifies the action it will choose in each state.
After transition to the next state I choose an action and then I update the
value function to bring it more close to the true value that I observed. In
future if the agent know well the environment so the value function will
be perfect ,you will always get rewards that you expect, so your
expectation over reward is accurate and you don’t need to change your
value
Q-learning: with this value, we are not just estimating the value of each
state, we are estimating the value of each state-action pairs. By relating
actions to states, Q-learning allows to consider the policy used for
predicting future rewards (e.g., greedy policy max), We usually start
from a default estimate of the value function for each state-action pair,
which we call Q-value, and improve it over time
Exploration vs. Exploitation: Exploration is crucial in the initial
phases: the agent needs to find out as much as it can from the
environment. If the agent is too greedy during the first episodes, it might
get stuck in a local maximum. To better explore the environment ,
usually action is chosen probabilistically .
1) e-greedy: is the stochastic policy to encourage the expoleration,
because with some propability we can take a random action , and this
allows to explore some other action that maybe give better outcomes
2) Softmax: if you have many possible actions, with max greedy policy
u always choose the action with highest value, with softmax you
sample from this distribution. you can make your choice more
stochastic by add temperature, in high temperature all the action with
almost have same probability been sampled, so the behavior is totally
random, with low Temperature behaviour is more and more
deterministic.
On-policy vs. Off-policy learning:
We have update policy for estimating the future rewards and
behaviour police for select the next action , so if two policies are
same ( policy for planning and policy for perform the acyion) the
learning algorithm is on policy .
SARSA is on policy algorithm , it use stochastic behavior policy
also for update its estimation ,
but Q-learning is off-policy, it uses a stochastic behavior policy to
improve exploration, but a greedy update policy, when we have to
choose the present action, we can use e-greedy or softmax, but
during planning when we are updating our Q values , we use
deterministic max policy
The problem of using neural networks as Q-value estimators is
instability
1. Maintain two networks: a prediction network that is trained at each
step, and a target network used for action selection, which is
updated after each iteration.
2. Use a memory buffer to store previous learning episods, we sample
randomly from the buffer, we decorate example , stabilize learning,
the replay memory can be prioritized to show more important
transitions more often
Policy Gradient Methods: The policy parameters are learned by
ascending the gradient of some scalar performance measure, where the
performance is usually estimated by sampling possible trajectories
(“roll-outs”) containing sequences of states and actions (we need to
guess what could be the outcome of our actions and this is specify by
policy parameters, we sample some possible action for future in order to
estimate possible return in long term
we are optimizing continuse function by using gradient decent, we can
use a lot of theory to improve the convergence, so this method is more
stronger convergence compare to traditional methods like Q-learning)