You are on page 1of 25

abb

bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
bbb
b c
d e
d VIETNAM NATIONAL UNIVERSITY e
d e
d e
d HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY e
d e
d e
d FACULTY OF COMPUTER SCIENCE AND ENGINEERING e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d PROJECT 7 e
d e
d e
d e
d e
d e
d e
d e
d e
d e
d e
1. Phạm Huy Thiên Phúc Student ID: 2053346
d e
d e
d e
2. Trương Huỳnh Đăng Khoa Student ID: 2053145
d e
d 3. Hoàng Huy Minh Student ID: 2053212 e
d e
d e
d 4. Hoàng Thảo Lan Chi Student ID: 2052337 e
d e
d e
fgg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
ggg
g h

1
Contents of the project
1 What is Deep Learning ?..............................................................................3

2 The history and the development of Deep Learning..................................4

3 Discuss some prominent architectures of Deep Learning?.......................7

3.1. Supervised Learning...............................................................................8


3.1.1. Convolutional neural networks .....................................................9
3.1.2. Recurrent neural networks.............................................................10
3.1.3. LSTM networks...............................................................................11
3.1.4. GRU networks.................................................................................12
3.2. Unsupervised Learning...........................................................................13
3.2.1. Self-organized maps........................................................................14
3.2.2. Autoencoders...................................................................................15
3.2.3. Restricted Boltzmann Machines......................................................16

4 Introduce some recent outstanding achievements of Deep Learning......17

4.1.Video to video synthesis..........................................................................17


4.1.1. Semantic Labels → Cityscapes Street Views....................................18
4.1.2. Face → Edge → Face .....................................................................19
4.1.3.Body → Pose → Body. . . ..................................................................19
4.1.1.Frame Prediction..............................................................................20

4.2.Language models: Google’s BERT representation..................................21


4.2.1. Mask Language Model (MLM).........................................................22
4.2.2.Next Sentence Prediction (NSP)........................................................23

Reference...........................................................................................................24

2
1.What is Deep Learning ?

Deep learning is a subset of machine learning [ Figure 1], which is essen-


tially a neural network with three or more layers. These neural networks attempt
to simulate the behavior of the human brain—albeit far from matching its abil-
ity—allowing it to “learn” from large amounts of data that is unstructured or un-
labeled. While a neural network with a single layer can still make approximate
predictions, additional hidden layers can help to optimize and refine for accuracy.

Figure 1: AI, ML and DL

Deep learning drives many artificial intelligence (AI) applications and services
that improve automation, performing analytical and physical tasks without human
intervention. Deep learning technology lies behind everyday products and services
(such as digital assistants, voice-enabled TV remotes, and credit card fraud detec-
tion) as well as emerging technologies (such as self-driving cars).

3
2.The history and the development of Deep Learning:

ANNs started with a work by McCullogh and Pitts who showed that sets of sim-
ple units (artificial neurons) could perform all possible logic operations and thus
be capable of universal computation. This work was concomitant to Von Neumann
and Turing who first dealt with statistical aspects of the information processing
of the brain and how to build a machine capable of reproducing them. Frank
Rosembalt invented the perceptron machine to perform simple pattern classifi-
cation. However, this new learning machine was incapable of solving simple prob-
lems, like the logic XOR. In 1969 Minsky and Papert showed that perceptrons
had intrinsic limitations that could not be transcended, thus leading to a fading
enthusiasm for ANNs. In 1983 John Hopfield proposed a special type of ANNs
(the Hopfield networks) and proved that they had powerful pattern completion and
memory properties. The backpropagation algorithm was first described by Lin-
nainmaa, S. (1970) as the representation of the cumulative rounding error of an
algorithm (as a Taylor expansion of the local rounding errors), without reference
to neural networks. In 1985, Rumelhart, McClelland, and Hinton rediscovered
this powerful learning rule that allowed them to train ANNs with several hidden
units, thus surpassing the Minsk criticism.

Figure 2: Brief history of Deep learning

4
Year Contributor Contribution

1943 Walter Pitts and Warren Mc- McCulloch Pitts Neuron


Culloch

1957 Frank Rosenblatt Perceptron

1960 Henry J. Kelley The First Backpropagation Model

1962 Stuart Dreyfus Backpropagation With Chain Rule

1965 Alexey Grigoryevich Multilayer Neural Network


Ivakhnenko and Valentin
Grigorevich Lapa

1969 Marvin Minsky and Seymour XOR problem


Papert

1970 Seppo Linnainmaa Automatic differentiation for backprop-


agation. Implements backpropagation
in computer code

1971 Alexey Grigoryevich Deep neural network


Ivakhnenko

1980 Kunihiko Fukushima Neocognitron – First CNN Architecture

1982 John Hopfield Hopfield Network


Paul Werbos Backpropagation In ANN

1985 David H. Ackley, Geof- Boltzmann Machine


frey Hinton and Terrence
Sejnowski

(Continued)

5
Year Contributor Contribution

1986 Terry Sejnowski NetTalk – ANN Learns Speech


Geoffrey Hinton, Rumelhart Implementation Of Backpropagation
and Williams

1991 Sepp Hochreiter Vanishing Gradient Problem

1997 Sepp Hochreiter and Jürgen The Milestone Of LSTM


Schmidhube

2006 Geoffrey Hinton, Ruslan Deep Belief Network


Salakhutdinov, Osindero and
Teh

2008 Andrew NG’s group GPUs for training Deep Neural Net-
works

2011 Yoshua Bengio, Antoine Bor- Vanishing Gradient


des, Xavier Glorot

2012 Alex Krizhevsky AlexNet

2014 Ian Goodfellow Generative Adversarial Neural Network

2016 Deepmind’s deep reinforcement learn-


ing model beats human champion in the
complex game of Go

2019 Yoshua Bengio, Geoffrey Turing Award 2018 for their immense
Hinton, and Yann LeCun contribution in advancements in area of
deep learning

6
3.Discuss some prominent architectures of Deep Learning:

Figure 3: Deep Learning Architecture

7
3.1.Supervised Learning:

Figure 3.1: Supervised learning

Supervised learning (SL) is the machine learning task of learning a


function that maps an input to an output based on example input-output
pairs.It infers a function from labeled training data consisting of a set of
training examples. In supervised learning, each example is a pair con-
sisting of an input object (typically a vector) and a desired output value
(also called the supervisory signal). A supervised learning algorithm an-
alyzes the training data and produces an inferred function, which can be
used for mapping new examples.

8
3.1.1.Convolutional neural networks :
A CNN is a multilayer neural network that was biologically inspired by the an-
imal visual cortex. The architecture is particularly useful in image-processing ap-
plications. The first CNN was created by Yann LeCun; at the time, the architecture
focused on handwritten character recognition, such as postal code interpretation. As
a deep network, early layers recognize features (such as edges), and later layers re-
combine these features into higher-level attributes of the input.
The LeNet CNN architecture is made up of several layers that implement feature
extraction and then classification (see figure 3.1.1). The image is divided into re-
ceptive fields that feed into a convolutional layer, which then extracts features from
the input image. The next step is pooling, which reduces the dimensionality of the
extracted features (through down-sampling) while retaining the most important in-
formation (typically, through max pooling). Another convolution and pooling step
is then performed that feeds into a fully connected multilayer perceptron. The fi-
nal output layer of this network is a set of nodes that identify features of the image
(in this case, a node per identified number). You train the network by using back-
propagation.

Figure 3.1.1: CNN architecture in extraction and classification image


The use of deep layers of processing, convolutions, pooling, and a fully connected
classification layer opened the door to various new applications of deep learning neu-
ral networks. In addition to image processing, the CNN has been successfully applied
to video recognition and various tasks within natural language processing.
Example applications: Image recognition, video analysis, and natural language processing

9
3.1.2.Recurrent neural networks :
The RNN is one of the foundational network architectures from which other deep
learning architectures are built. The primary difference between a typical multilayer
network and a recurrent network is that rather than completely feed-forward connec-
tions, a recurrent network might have connections that feed back into prior layers (or
into the same layer). This feedback allows RNNs to maintain memory of past inputs
and model problems in time.
RNNs consist of a rich set of architectures (we’ll look at one popular topology
called LSTM next). The key differentiator is feedback within the network, which
could manifest itself from a hidden layer, the output layer, or some combination
thereof.

Figure 3.1.2: RNN architecture and connections between layers


RNNs can be unfolded in time and trained with standard back-propagation or by using a variant
of back-propagation that is called back-propagation in time (BPTT).

Example applications: Speech recognition and handwriting recognition

10
3.1.3.LSTM networks:

The LSTM departed from typical neuron-based neural network architectures


and instead introduced the concept of a memory cell. The memory cell can retain
its value for a short or long time as a function of its inputs, which allows the cell
to remember what’s important and not just its last computed value.

The LSTM memory cell contains three gates that control how information flows
into or out of the cell. The input gate controls when new information can flow
into the memory. The forget gate controls when an existing piece of information is
forgotten, allowing the cell to remember new data. Finally, the output gate controls
when the information that is contained in the cell is used in the output from the cell.
The cell also contains weights, which control each gate. The training algorithm,
commonly BPTT, optimizes these weights based on the resulting network output
error.

Figure 3.1.3: LTSM memory cell

Recent applications of CNNs and LSTMs produced image and video caption-
ing systems in which an image or video is captioned in natural language. The CNN
implements the image or video processing, and the LSTM is trained to convert the
CNN output into natural language.

Example applications: Image and video captioning systems

11
3.1.4.GRU networks:

In 2014, a simplification of the LSTM was introduced called the gated recurrent
unit. This model has two gates, getting rid of the output gate present in the LSTM
model. These gates are an update gate and a reset gate. The update gate indicates
how much of the previous cell contents to maintain. The reset gate defines how
to incorporate the new input with the previous cell contents. A GRU can model a
standard RNN simply by setting the reset gate to 1 and the update gate to 0.

Figure 3.1.4: GRU cell

The GRU is simpler than the LSTM, can be trained more quickly, and can be
more efficient in its execution. However, the LSTM can be more expressive and
with more data can lead to better results.

Example applications: Natural language text compression, handwriting recognition, speech recog-
nition, gesture recognition, image captioning

12
3.2.Unsupervised Learning:

Figure 3.2: Unsupervised learning

Unsupervised learning (UL) is a type of algorithm that learns patterns


from untagged data. The hope is that, through mimicry, the machine is
forced to build a compact internal representation of its world and then gen-
erate imaginative content. In contrast to supervised learning (SL) where
data is tagged by a human, e.g. as "car" or "fish" etc, UL exhibits self-
organization that captures patterns as neuronal predilections or probability
densities. The other levels in the supervision spectrum are reinforcement
learning where the machine is given only a numerical performance score
as its guidance, and semi-supervised learning where a smaller portion of
the data is tagged. Two broad methods in UL are Neural Networks and
Probabilistic Methods.

13
3.2.1.Self-organized maps:

Self-organized map (SOM) was popularly known as the Kohonen map. SOM
is an unsupervised neural network that creates clusters of the input data set by
reducing the dimensionality of the input. SOMs vary from the traditional artificial
neural network in quite a few ways.

Figure 3.2.1: Self-organizing map

The first significant variation is that weights serve as a characteristic of the


node. After the inputs are normalized, a random input is first chosen. Random
weights close to zero are initialized to each feature of the input record. These
weights now represent the input node. Several combinations of these random weights
represent variations of the input node. The euclidean distance between each of
these output nodes with the input node is calculated. The node with the least dis-
tance is declared as the most accurate representation of the input and is marked as
the best matching unit or BMU. With these BMUs as center points, other units are
similarly calculated and assigned to the cluster that it is the distance from. Radius
of points around BMU weights are updated based on proximity. Radius is shrunk.

Next, in an SOM, no activation function is applied, and because there are no


target labels to compare against there is no concept of calculating error and back
propogation.

Example applications: : Dimensionality reduction, clustering high-dimensional inputs to 2-dimensional


output, radiant grade result, and cluster visualization

14
3.2.2.Autoencoders:

Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders .This variant of an autoencoders is composed of 3 layers:
input, hidden, and output layers.

Figure 3.2.2: Autoencoders layers

First, the input layer is encoded into the hidden layer using an appropriate en-
coding function. The number of nodes in the hidden layer is much less than the
number of nodes in the input layer. This hidden layer contains the compressed
representation of the original input. The output layer aims to reconstruct the input
layer by using a decoder function.

During the training phase, the difference between the input and the output layer
is calculated using an error function, and the weights are adjusted to minimize the
error. Unlike traditional unsupervised learning techniques, where there is no data
to compare the outputs against, autoencoders learn continuosly using backward
propagation. For this reason, autoencoders are classified as self supervised algo-
rithms.

Example applications: : Dimensionality reduction, data interpolation, and data compression/ de-
compression

15
3.2.3.Restricted Boltzmann Machines:

An RBM is a 2-layered neural network. The layers are input and hidden layers.
As shown in the following figure, in RBMs every node in a hidden layer is con-
nected to every node in a visible layer. In a traditional Boltzmann Machine, nodes
within the input and hidden layer are also connected. Due to computational com-
plexity, nodes within a layer are not connected in a Restricted Boltzmann Machine.

Figure 3.2.3: Restricted Boltzmann Machines

During the training phase, RBMs calculate the probabilty distribution of the
training set using a stochastic approach. When the training begins, each neuron
gets activated at random. Also, the model contains respective hidden and visible
bias. While the hidden bias is used in the forward pass to build the activation, the
visible bias helps in reconstructing the input.
Because in an RBM the reconstructed input is always different from the origi-
nal input, they are also known as generative models.
Also, because of the built-in randomness, the same predictions result in dif-
ferent outputs. In fact, this is the most significant difference from an autoencoder,
which is a deterministic model.
Example applications: Dimensionality reduction and collaborative filtering

16
4. Introduce some recent outstanding achievements of Deep Learning:
4.1.Video to video synthesis:
In 2018, Ting-Chun Wang and others announced a new video-to-video synthe-
sis approach. In this approach, they aim to turn a segmented input source of video
into an output photorealistic video that precisely depicts the content of the source
video. The result is high-resolution, photorealistic, temporally coherent video results
on a diverse set of input formats, including segmentation masks, sketches, and poses.
They can achieve this by using a neural generator network to create an image with
one discriminator network to check whether the images look good one by one and
one discriminator to overlook the sequence of the image whether it would pass as a
video.

Figure 4.1: Network architecture of video-to-video synthesis

Network architecture of our generator. We first train a residual network G1 on


lower solution images. Then, another network G2 is appended to G1 and the two
networks are trained jointly on high resolution images. Specifically, the input to the
residual blocks in G2 is the element-wise sum of the feature map form G2 and the
last feature map from G1.

17
4.1.1.Semantic Labels → Cityscapes Street Views:

Figure 4.1.1 a): Semantic Labels results

Starting from a video in some source domain, they synthesize a new video in a
target domain using a learning network. Semantic labels allow them to edit or create
content in a convenient input domain and generate a video to an output domain that
is harder to edit or create.

Figure 4.1.1 b): Semantic Labels results

The network can synthesize multiple results given the same input or manipulated
to generate the desired output video. In the crude map, each color corresponds to an
object class, and we can change the meaning of the label. Some examples of this are
transforming trees into buildings or vice versa and changing the styles of buildings
or roads.

18
4.1.2. Face → Edge → Face :
They train a sketch-to-face synthesis video model by using the real face videos in
the Face Forensics dataset. The network learns to transfer edge map video to video
of a human face. It also can generate different faces from the same input edge map.
On the other hand, the model can change the facial appearance of the original face
videos. The resulting video is temporarily consistent from frame to frame.

Figure 4.1.2: Face → Edge → Face Results

4.1.3. Body → Pose → Body :


Wang’s model can synthesize videos of human moving given information on
poses and output high-resolution photorealistic dance videos that contain unseen
body shapes and motions. The method can change the clothing for the same dancer
or transfer poses from one person to another person with consistent shadow.

Figure 4.1.3: Body → Pose → Body Results

19
4.1.4.Frame Prediction :

Figure 4.1.4: Frame Prediction Results

To predict the future video given a few observed frames, the team has decom-
posed the task into two sub-tasks:
_Synthesizing future semantic segmentation masks using the observed frames.
_Converting the synthesized segmentation masks into videos. In practice, af-
ter extracting the segmentation masks from the observed frames, they trained a gen-
erator to predict future semantic masks. They then use the proposed video-to-video
synthesis approach to convert the predicted segmentation masks to a future video.

20
4.2.Language models: Google’s BERT representation:
In Natural Language Processing (NLP), a language model is a model that can es-
timate the probability distribution of a set of linguistic units, typically a sequence
of words. These are interesting models since they can be built at little cost and have
significantly improved several NLP tasks such as machine translation, speech recog-
nition, and parsing.
Historically, one of the best-known approaches is based on Markov models and n-
grams. With the emergence of deep learning, more powerful models generally based
on long short-term memory networks (LSTM) appeared. Although highly effective,
existing models are usually unidirectional, meaning that only the left (or right) con-
text of a word ends up being considered.
Last October, the Google AI Language team published a paper that caused a stir
in the community. BERT (Bidirectional Encoder Representations from Transform-
ers) is a new bidirectional language model that has achieved state of the art results
for 11 complex NLP tasks, including sentiment analysis, question answering, and
paraphrase detection.

Figure 4.2: a) Comparative results for the GLUE Benchmark

The strategy for pre-training BERT differs from the traditional left-to-right or
right-to-left options. The novelty consists of:

• Masking some percentage of the input tokens at random, then predicting only
those masked tokens; this keeps, in a multi-layered context, the words from
indirectly “seeing themselves”.
• Building a binary classification task to predict if sentence B follows immedi-
ately after sentence A, which allows the model to determine the relationship
between sentences, a phenomenon not directly captured by classical language
modeling.

21
4.2.1.Mask Language Model (MLM):
Masking out some of the words in the input and then condition each word bidi-
rectionally to predict the masked words. Before feeding word sequences into BERT,
15% of the words in each sequence are replaced with a [MASK] token. The model
then attempts to predict the original value of the masked words, based on the context
provided by the other, non-masked, words in the sequence.

Figure 4.2.1: Mask Language Model

4.2.2.Next Sentence Prediction (NSP):


Next Sentence Prediction (NSP), where BERT learns to model relationships be-
tween sentences. In the training process, the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the subsequent sen-
tence in the original document. Let’s consider two sentences A and B, is B the actual
next sentence that comes after A in the corpus, or just a random sentence? For exam-
ple:

Figure 4.2.2: Next Sentence Prediction

When training the BERT model, both the techniques are trained together, thus
minimizing the combined loss function of the two strategies.

22
Figure 4.2: b) SquAD2.0 Leaderboard

On SQuAD v2.0, BERT achieves 89,474% F1 score (a measure of accuracy), sur-


passing the previous state-of-the-art score of 87.147% which is greater than human
performance 0.316%.
BERT is undoubtedly a milestone in the use of Deep Learning for Natural Lan-
guage Processing.

23
*References:
-Websites:
[1] ibm.com
https://www.ibm.com/cloud/learn/deep-learning
https://developer.ibm.com/technologies/artificial-intelligence/articles/cc-machine-
learning-deep-learning-architectures/
[2] machinelearningknowledge.ai
https://machinelearningknowledge.ai/brief-history-of-deep-learning/
[3] nvlabs.github.io
https://nvlabs.github.io/few-shot-vid2vid/
[4] tryolabs.com
https://tryolabs.com/blog/2018/12/19/major-advancements-deep-learning-2018/
[5] towardsdatascience.com
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp
-7cca943cf3ad: :text=1%2C%20BERT%20achieves%2093.2%25%20F1,Languag
e%20Understanding%20(NLU)%20tasks.
[6] web.stanford.edu
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/1
5812785.pdf
-Science Journals:
[7]Sanskruti Patel,Atul Patel
Deep Leaning Architectures and its Applications A Survey ,2018
[8] Ting-Chun Wang , Ming-Yu Liu , Jun-Yan Zhu , Guilin Li, Andrew Tao ,
Jan Kautz , Bryan Catanzaro
Video-to-Video Synthesis, 2018
-Books:
[9] Fundamentals of Deep Learning
Nikhil Buduma with contributions by Nicholas Locascio. ( Page 85-109 )
[10] Introduction to Deep Learning Business Applications for Developers

24
From Conversational Bots in Customer Service to Medical Image Processing —
Armando Vieira Bernardete Ribeiro. ( Page 38-40 )

25

You might also like