Professional Documents
Culture Documents
Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
V. NEURAL NETWORKS The input given is larger than the filter. Dot product
Human beings are blessed with the ability to think and between a patch of the input image that is of the same size of
memorize. Artificial Intelligence tries to mimic this the filter and the filter itself is applied. This gives a single
behaviour. And this is what forms the basis for Neural value If we multiply the filter with array of input images we
Networks.They can be considered as a series of algorithms get a single value but if the same operation is repeated
which are trying to mimic the functionalities of a human multiple times we get a 2-D array of filtered values and this
brain. The human brain is based on networks of neurons is called as feature map.
which act as messengers of signals to the brain. Similarly, in 2) Padding
neural networks there are several layers. They try to find out Padding is used to protect the length and the breadth of
the underlying relationships in the input data. It contains the input image when going from one layer to the next.
different layers. Each node in a neural network is called a Deeper network can be designed with the help of padding.
perceptron. A perceptron is a single neuron model that was a The performance is said to be improved here as the
precursor to larger neural networks. It feeds the signal information is kept at the borders. Padding is of two types,
produced by a multiple linear regression into an activation valid padding and same padding. In Valid Padding the
function that may be nonlinear. feature that is convolved through the layers is smaller than
the input image. In same padding the feature that is
VI. CONVOLUTIONAL NEURAL NETWORK
convolved through the layers is either of same size or bigger
size than the input image.
3) Activation function
A node that is kept in the middle or at the finish point of
a neural network is called the activation function. An
activation function helps us to identify whether to fire the
neuron or not. It is a non-linear function that is applied on
the input signal which is then transformed and sent to the
Fig. 2. Convolutional Neural Network[8]
further layers of neurons which then treat it as input. We
Convolutional neural network also known as CNN or have used RELU and softmax activation function.
ConvNet is a kind of artificial neural network which is used 4) Stride
by most people for the analysis of images. It can be also used The number of pixels of input matrix over which filter is
for problems of classification and data analysis. CNN can be moved is termed as stride. If stride is one, the pixel is moved
thought of as an Artificial Neural Network that uses one time. Similarly, if stride is two or three times the pixel is
classification for detecting patterns and tries to find a moved one or two times. Thus, stride decides how much our
meaning from them. filter slides over the input.
More formally, CNN is an algorithm of deep learning B. Layers in Convolutional Neural Network
which allow user to input an image, and the algorithm can
assign learnable biases and weights to various objects or Input image in convolutional neural network goes
aspects that are present in the image and hence be able to through different layers. Each layer is responsible for its own
differentiate one object from the other. functions that are assigned to them. All layers are discussed
as follows -
A. Important concepts in Convolutional neural network
1) Input Layer
1) Filters The very first layer present in the convolutional neural
network is the input layer. The input layer contains artificial
input neurons. This layer is responsible for bringing the very
initial data into the system. The input layer consists of a
characteristic that the other layers do not have. Input layers
consists of neurons that are passive in nature. Being the first
layer it does not take information from previous layers. An
input layer contains neurons which do not contain weighed
inputs or we can say where the weights are calculated in a
Fig. 3. Filters[1]
way different than other layers as input is coming for the
very first time. In convolutional neural network the input
In Convolutional Neural Network every layer that is
layer can take any image of 3 dimensions. The image can be
present contains filters. Filters are responsible for finding
coloured or black and white.
specific patterns or features in the data as input is given. In
each layer the number of filters that the layer should have are 2) Convolutional Layer
given. The convolutional layer is the most important layer as
At the start of the network filters are pretty simple and most of the computation takes place in this layer. It contains
detect patterns like edges, circles, etc. but as we move to many hidden layers. The purpose of this Convolutional
further layers these filters become more advanced and are operation is to extract features as well as minute details from
able to identify complete figures like mouse, cat etc. Filters the input image. Convolutional neural network is named
can be imagined to be a matrix of specified number of rows after this operation. There can be more than one
and columns. Any random number can be used to initialize convolutional layer in CNN. This layer is responsible for
the matrix blocks. detecting patterns and objects from the input image. It
contains specified number of filters which are applied to the
166
Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
input image. Each object that is given in the input image is C. Work Flow of CNN
identified using filters in complex mathematical steps to
reach an individual value in the output map.
In Convolutional Layer the filter is slid throughout the
whole image and a dot product of the part of the image that
is overlapped by the filter and the filter itself is performed to
give a scaler product which is stored in a matrix. This output
matrix helps us to detect patterns in an image. This output is
then given as an input to the further convolutional layers for
the detection of more complex patterns.
3) Pooling Layer
One of the most commonly known limitation of feature
map is that they capture the positions of all the features of Fig. 4. Workflow of CNN[1]
the input very precisely. So, if there are small variations in
the input image we get a different feature map. These All concepts required in convolutional neural network
variations can be possible due to various different reasons have been explained above. Firstly, an input is given to our
such as rotating, cropping, shifting etc. network through the input layer, the input should be a 3-
Pooling layers are responsible for breaking the features dimensional image. It could be either coloured or black and
present into patches of feature map. Hence, we have an white.
image of lower resolution but containing the major elements The image is then passed further to the convolutional
that are important ignoring all the minute useless details. layer. In this layer, filters are applied on the input image.
This technique can be referred as down sampling. Pooling Number of filters are specified This process is called
layer is used between two convolutional layers. The two convolution. Then activation function is applied. RELU
methods of pooling are - activation function is used here. Then our output here acts as
an input to the pooling layer which helps to decrease the
x Average pooling – The average of all the blocks of
resolution of the image. Then again, the convolutional layer
the matrix that are overlapped by the filter are given
undergoes the same process. There can be more than one
by average pooling. Average pooling helps in
convolutional layer. As we go in further convolutional
reducing the dimensions.
layers the filters get more complex patters such as eyes,
x Max pooling – The highest value of the input figure faces, birds etc. Next, the matrix is flattened and the fully
that is overlapped by the filter is given by Max connected layer is entered. This layer helps to classify the
Pooling. Another function of max pooling is objects in input image into classes. Softmax function is used
removing the noisy activations and hence supressing here to find the probability of classes. Finally, the output is
the noise in the image along with the reduction of obtained which consists of the objects present in the image in
dimensions. Performance of max Pooling is better the output layer.
than the performance of average pooling.
VII. NATURAL LANGUAGE PROCESSING
4) Fully Connected Layer Natural Language Processing abbreviated as NLP is a
This is mainly present at the ending of the convolutional stream of artificial intelligence which helps to communicate
network and is also known as the feed forward layer or FC with computers. Due to NLP, computers are now able to
layer. Fully connected layer is a dense layer that is each node fulfil various tasks which were not possible before such as
in this layer is connected to each and every node that is they are now able to read test, hear and understand what we
present in the previous layer. Fully connected Layer is are trying to say and even interpret the parts that are
responsible for the full classification of the features obtained important. The various places where NLP can be used is to
by us from the previous layers such as the pooling or translate text from one language to another. It can also be
convolutional layer. used in applications such as Grammarly etc to correct any
The input that is given to the fully connected layer is the grammatical mistakes, Call centres to respond to various
output that is received from the from the final convolutional customers. Personal assistant applications such as alexa,
layer or the final pooling layer. The output is flattened. sirietc use NLP to operate.
Flattening means converting a matrix into a vector before it In our Research paper of image captioning, natural
enters a the fully connected layer. After the image passes the language processing is used after the algorithm CNN. Using
fully connected layer, the softmax activation function is used NLP we apply algorithms due to which we can identify the
and not the ReLU activation function to obtain the natural language rules so that our language can be converted
probability of input being present in some definite class. to a form that can be understood by the computers.
5) Output Layer Thus, NLP here is used to help us communicate with the
The outputs received are obtained from the fully computer.
connected layer are calculated on the basis of highest
probability obtained by the FC Layer. The outputs are A. Techniques involved –
objects that are present in our input image such as dog, cat, The two main techniques used in NLP are semantic
mouse etc. analysis and syntactic analysis.
167
Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
1) Syntax Analysis: layer as an input for the next layer. Here there are 4 neural
Identification is done whether the grammatical rules are network layers each interacting in a unique manner.
being followed by the language or not. Some of the syntax
techniques frequently used are Parsing, Stemming,
Lemmatization
2) Semantic Analysis
Here, algorithms are applied in order to understand the
meaning that NLP is trying to convey. Some of the
techniques involved in this are Word sense Disambiguation,
Named Entity Recognition and Natural Language
Generation (NLG).
VIII. RECURRENT NEURAL NETWORK Fig. 6. Work flow of LSTM [10]
Fig. 5. Workflow of RNN [10] The second layer that the input data goes through in
LSTM is the “input gate layer”. In this layer, we decide
In the case of RNN, the persisting information is used in which new information is to be added in the cell state. Here
conjunction with the input data for that layer and the output there are two components of this layer. The first is the
is predicted. Based on the above figure, information from sigmoid function and the second is the tan (h) layer.
timestamp ‘t-1’ is passed along with the input value at
timestamp ‘t’ to predict the output at timestamp t. Back
propagation through time (BTT) is the algorithm used in
order to train the model. Thus, RNNs functionality is
required whenever we wish to implement the model while
using persistence.
B. Long Short Term Memory (LSTM)
LSTM is a type of Recurrent Neural Network which is
used in cases where memory persistence for a longer period
of time is required. It is often considered as the more
enhanced version of RNN. It can perform almost every task
that an RNN network can perform. They can understand long Fig. 8. Input gate layer [10]
– term dependencies. LSTMs are designed in order to tackle
the issue of long term dependency. IX. APPLICATIONS OF IMAGE CAPTIONING
C. Work Flow of LSTM x Helps in automating the task of labelling images.
LSTMs consist of numerous layers where each of these x Can be useful for social media sites such as
layers is in the form of chain of repeating modules. The basic Facebook which can make use of these
idea behind LSTM is that each layer stores the output of its
168
Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
automatically generated captions to infer data about 2018), 36 pages. Computing methodologies→Machine learning;
its users solely on the basis of images. Neural networks.
[2] “A gentle Introduction to deep learning Caption Generation Models”,
x In web development to make websites dynamic by Jason Brownlee, November 22 2017, For deep learning Natural
simply add images without worrying about Language Processing.
describing them. [3] O.Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: Lessons
learned from the 2015 MSCOCO Image Captioning Challenge”,
x Can be of aid to visually impaired people who can IEEE transactions on Pattern Analysis and Machine Intelligence,
simply read the text in a much larger font. 2016.
[4] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert
x Helps in organizing files without dwelling in the task English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard
of image captioning present in the files. Socher. Ask me anything: Dynamic memory networks for natural
language processing. CoRR, abs/1506.07285, 2015
x Google photos can be helped in organizing images [5] Quanzeng You, HailinJin, Zhaowen Wang, Chen Fang, and Jiebo
based on the objects present in the images. Luo. Image captioning with semantic attention. CoRR,
abs/1603.03925, 2016.
X. CONCLUSION [6] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter
Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth.
Image Caption Generator is thus an effective tool which Every picture tells a story: Generating sentences from images. In
can be used for multiple purposes. It helps in automating the Proceedings of the 11th European Conference on Computer Vision:
task of generating labels for images in an organized manner Part IV, ECCV’10, pages 15–29, Berlin, Heidelberg, 2010. Springer-
and thus aids in making lives easier. It can be used to Verlag.
describe images to people who are blind or have low vision [7] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network
architectures for matching natural language sentences, in:
and who rely on sounds and texts to describe a scene. In web Proceedings of the 27th International Conference on Neural
development, it’s a good practice to provide a description for Information Processing Systems, 2014, pp. 2042–2050
any image that appears on the page so that an image can be [8] M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons:a neural
read or heard as opposed to just seen. based approach to answering questions about images, in: International
Conference on Computer Vision, 2015
REFERENCES [9] X. Chen, C. Lawrence Zitnick, "Mind's eye: A recurrent visual
[1] Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and representation for image caption generation", CVPR, 2015
Hamid Laga. 2018. A Comprehensive Survey of Deep Learning for [10] Understanding LSTM Networks by Colah,online blog , 2015
Image Captioning. ACM Comput. Surv. 0, 0, Article 0 (October
169
Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.