Professional Documents
Culture Documents
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
b c
d e
d VIETNAM NATIONAL UNIVERSITY e
d e
d e
d HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY e
d e
d e
d FACULTY OF COMPUTER SCIENCE AND ENGINEERING e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d PROJECT 7 e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
1. Phạm Huy Thiên Phúc Student ID: 2053346
d e
d e
d e
2. Trương Huỳnh Đăng Khoa Student ID: 2053145
d e
d 3. Hoàng Huy Minh Student ID: 2053212 e
d e
d e
d 4. Hoàng Thảo Lan Chi Student ID: 2052337 e
d e
d e
fgg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
g h
1
Contents of the project
1 What is Deep Learning ?..............................................................................3
Reference...........................................................................................................24
2
1.What is Deep Learning ?
Deep learning drives many artificial intelligence (AI) applications and services
that improve automation, performing analytical and physical tasks without human
intervention. Deep learning technology lies behind everyday products and services
(such as digital assistants, voice-enabled TV remotes, and credit card fraud detec-
tion) as well as emerging technologies (such as self-driving cars).
3
2.The history and the development of Deep Learning:
ANNs started with a work by McCullogh and Pitts who showed that sets of sim-
ple units (artificial neurons) could perform all possible logic operations and thus
be capable of universal computation. This work was concomitant to Von Neumann
and Turing who first dealt with statistical aspects of the information processing
of the brain and how to build a machine capable of reproducing them. Frank
Rosembalt invented the perceptron machine to perform simple pattern classifi-
cation. However, this new learning machine was incapable of solving simple prob-
lems, like the logic XOR. In 1969 Minsky and Papert showed that perceptrons
had intrinsic limitations that could not be transcended, thus leading to a fading
enthusiasm for ANNs. In 1983 John Hopfield proposed a special type of ANNs
(the Hopfield networks) and proved that they had powerful pattern completion and
memory properties. The backpropagation algorithm was first described by Lin-
nainmaa, S. (1970) as the representation of the cumulative rounding error of an
algorithm (as a Taylor expansion of the local rounding errors), without reference
to neural networks. In 1985, Rumelhart, McClelland, and Hinton rediscovered
this powerful learning rule that allowed them to train ANNs with several hidden
units, thus surpassing the Minsk criticism.
4
Year Contributor Contribution
(Continued)
5
Year Contributor Contribution
2008 Andrew NG’s group GPUs for training Deep Neural Net-
works
2019 Yoshua Bengio, Geoffrey Turing Award 2018 for their immense
Hinton, and Yann LeCun contribution in advancements in area of
deep learning
6
3.Discuss some prominent architectures of Deep Learning:
7
3.1.Supervised Learning:
8
3.1.1.Convolutional neural networks :
A CNN is a multilayer neural network that was biologically inspired by the an-
imal visual cortex. The architecture is particularly useful in image-processing ap-
plications. The first CNN was created by Yann LeCun; at the time, the architecture
focused on handwritten character recognition, such as postal code interpretation. As
a deep network, early layers recognize features (such as edges), and later layers re-
combine these features into higher-level attributes of the input.
The LeNet CNN architecture is made up of several layers that implement feature
extraction and then classification (see figure 3.1.1). The image is divided into re-
ceptive fields that feed into a convolutional layer, which then extracts features from
the input image. The next step is pooling, which reduces the dimensionality of the
extracted features (through down-sampling) while retaining the most important in-
formation (typically, through max pooling). Another convolution and pooling step
is then performed that feeds into a fully connected multilayer perceptron. The fi-
nal output layer of this network is a set of nodes that identify features of the image
(in this case, a node per identified number). You train the network by using back-
propagation.
9
3.1.2.Recurrent neural networks :
The RNN is one of the foundational network architectures from which other deep
learning architectures are built. The primary difference between a typical multilayer
network and a recurrent network is that rather than completely feed-forward connec-
tions, a recurrent network might have connections that feed back into prior layers (or
into the same layer). This feedback allows RNNs to maintain memory of past inputs
and model problems in time.
RNNs consist of a rich set of architectures (we’ll look at one popular topology
called LSTM next). The key differentiator is feedback within the network, which
could manifest itself from a hidden layer, the output layer, or some combination
thereof.
10
3.1.3.LSTM networks:
The LSTM memory cell contains three gates that control how information flows
into or out of the cell. The input gate controls when new information can flow
into the memory. The forget gate controls when an existing piece of information is
forgotten, allowing the cell to remember new data. Finally, the output gate controls
when the information that is contained in the cell is used in the output from the cell.
The cell also contains weights, which control each gate. The training algorithm,
commonly BPTT, optimizes these weights based on the resulting network output
error.
Recent applications of CNNs and LSTMs produced image and video caption-
ing systems in which an image or video is captioned in natural language. The CNN
implements the image or video processing, and the LSTM is trained to convert the
CNN output into natural language.
11
3.1.4.GRU networks:
In 2014, a simplification of the LSTM was introduced called the gated recurrent
unit. This model has two gates, getting rid of the output gate present in the LSTM
model. These gates are an update gate and a reset gate. The update gate indicates
how much of the previous cell contents to maintain. The reset gate defines how
to incorporate the new input with the previous cell contents. A GRU can model a
standard RNN simply by setting the reset gate to 1 and the update gate to 0.
The GRU is simpler than the LSTM, can be trained more quickly, and can be
more efficient in its execution. However, the LSTM can be more expressive and
with more data can lead to better results.
Example applications: Natural language text compression, handwriting recognition, speech recog-
nition, gesture recognition, image captioning
12
3.2.Unsupervised Learning:
13
3.2.1.Self-organized maps:
Self-organized map (SOM) was popularly known as the Kohonen map. SOM
is an unsupervised neural network that creates clusters of the input data set by
reducing the dimensionality of the input. SOMs vary from the traditional artificial
neural network in quite a few ways.
14
3.2.2.Autoencoders:
Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders .This variant of an autoencoders is composed of 3 layers:
input, hidden, and output layers.
First, the input layer is encoded into the hidden layer using an appropriate en-
coding function. The number of nodes in the hidden layer is much less than the
number of nodes in the input layer. This hidden layer contains the compressed
representation of the original input. The output layer aims to reconstruct the input
layer by using a decoder function.
During the training phase, the difference between the input and the output layer
is calculated using an error function, and the weights are adjusted to minimize the
error. Unlike traditional unsupervised learning techniques, where there is no data
to compare the outputs against, autoencoders learn continuosly using backward
propagation. For this reason, autoencoders are classified as self supervised algo-
rithms.
Example applications: : Dimensionality reduction, data interpolation, and data compression/ de-
compression
15
3.2.3.Restricted Boltzmann Machines:
An RBM is a 2-layered neural network. The layers are input and hidden layers.
As shown in the following figure, in RBMs every node in a hidden layer is con-
nected to every node in a visible layer. In a traditional Boltzmann Machine, nodes
within the input and hidden layer are also connected. Due to computational com-
plexity, nodes within a layer are not connected in a Restricted Boltzmann Machine.
During the training phase, RBMs calculate the probabilty distribution of the
training set using a stochastic approach. When the training begins, each neuron
gets activated at random. Also, the model contains respective hidden and visible
bias. While the hidden bias is used in the forward pass to build the activation, the
visible bias helps in reconstructing the input.
Because in an RBM the reconstructed input is always different from the origi-
nal input, they are also known as generative models.
Also, because of the built-in randomness, the same predictions result in dif-
ferent outputs. In fact, this is the most significant difference from an autoencoder,
which is a deterministic model.
Example applications: Dimensionality reduction and collaborative filtering
16
4. Introduce some recent outstanding achievements of Deep Learning:
4.1.Video to video synthesis:
In 2018, Ting-Chun Wang and others announced a new video-to-video synthe-
sis approach. In this approach, they aim to turn a segmented input source of video
into an output photorealistic video that precisely depicts the content of the source
video. The result is high-resolution, photorealistic, temporally coherent video results
on a diverse set of input formats, including segmentation masks, sketches, and poses.
They can achieve this by using a neural generator network to create an image with
one discriminator network to check whether the images look good one by one and
one discriminator to overlook the sequence of the image whether it would pass as a
video.
17
4.1.1.Semantic Labels → Cityscapes Street Views:
Starting from a video in some source domain, they synthesize a new video in a
target domain using a learning network. Semantic labels allow them to edit or create
content in a convenient input domain and generate a video to an output domain that
is harder to edit or create.
The network can synthesize multiple results given the same input or manipulated
to generate the desired output video. In the crude map, each color corresponds to an
object class, and we can change the meaning of the label. Some examples of this are
transforming trees into buildings or vice versa and changing the styles of buildings
or roads.
18
4.1.2. Face → Edge → Face :
They train a sketch-to-face synthesis video model by using the real face videos in
the Face Forensics dataset. The network learns to transfer edge map video to video
of a human face. It also can generate different faces from the same input edge map.
On the other hand, the model can change the facial appearance of the original face
videos. The resulting video is temporarily consistent from frame to frame.
19
4.1.4.Frame Prediction :
To predict the future video given a few observed frames, the team has decom-
posed the task into two sub-tasks:
_Synthesizing future semantic segmentation masks using the observed frames.
_Converting the synthesized segmentation masks into videos. In practice, af-
ter extracting the segmentation masks from the observed frames, they trained a gen-
erator to predict future semantic masks. They then use the proposed video-to-video
synthesis approach to convert the predicted segmentation masks to a future video.
20
4.2.Language models: Google’s BERT representation:
In Natural Language Processing (NLP), a language model is a model that can es-
timate the probability distribution of a set of linguistic units, typically a sequence
of words. These are interesting models since they can be built at little cost and have
significantly improved several NLP tasks such as machine translation, speech recog-
nition, and parsing.
Historically, one of the best-known approaches is based on Markov models and n-
grams. With the emergence of deep learning, more powerful models generally based
on long short-term memory networks (LSTM) appeared. Although highly effective,
existing models are usually unidirectional, meaning that only the left (or right) con-
text of a word ends up being considered.
Last October, the Google AI Language team published a paper that caused a stir
in the community. BERT (Bidirectional Encoder Representations from Transform-
ers) is a new bidirectional language model that has achieved state of the art results
for 11 complex NLP tasks, including sentiment analysis, question answering, and
paraphrase detection.
The strategy for pre-training BERT differs from the traditional left-to-right or
right-to-left options. The novelty consists of:
• Masking some percentage of the input tokens at random, then predicting only
those masked tokens; this keeps, in a multi-layered context, the words from
indirectly “seeing themselves”.
• Building a binary classification task to predict if sentence B follows immedi-
ately after sentence A, which allows the model to determine the relationship
between sentences, a phenomenon not directly captured by classical language
modeling.
21
4.2.1.Mask Language Model (MLM):
Masking out some of the words in the input and then condition each word bidi-
rectionally to predict the masked words. Before feeding word sequences into BERT,
15% of the words in each sequence are replaced with a [MASK] token. The model
then attempts to predict the original value of the masked words, based on the context
provided by the other, non-masked, words in the sequence.
When training the BERT model, both the techniques are trained together, thus
minimizing the combined loss function of the two strategies.
22
Figure 4.2: b) SquAD2.0 Leaderboard
23
*References:
-Websites:
[1] ibm.com
https://www.ibm.com/cloud/learn/deep-learning
https://developer.ibm.com/technologies/artificial-intelligence/articles/cc-machine-
learning-deep-learning-architectures/
[2] machinelearningknowledge.ai
https://machinelearningknowledge.ai/brief-history-of-deep-learning/
[3] nvlabs.github.io
https://nvlabs.github.io/few-shot-vid2vid/
[4] tryolabs.com
https://tryolabs.com/blog/2018/12/19/major-advancements-deep-learning-2018/
[5] towardsdatascience.com
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp
-7cca943cf3ad: :text=1%2C%20BERT%20achieves%2093.2%25%20F1,Languag
e%20Understanding%20(NLU)%20tasks.
[6] web.stanford.edu
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/1
5812785.pdf
-Science Journals:
[7]Sanskruti Patel,Atul Patel
Deep Leaning Architectures and its Applications A Survey ,2018
[8] Ting-Chun Wang , Ming-Yu Liu , Jun-Yan Zhu , Guilin Li, Andrew Tao ,
Jan Kautz , Bryan Catanzaro
Video-to-Video Synthesis, 2018
-Books:
[9] Fundamentals of Deep Learning
Nikhil Buduma with contributions by Nicholas Locascio. ( Page 85-109 )
[10] Introduction to Deep Learning Business Applications for Developers
24
From Conversational Bots in Customer Service to Medical Image Processing —
Armando Vieira Bernardete Ribeiro. ( Page 38-40 )
25