Chapter III

Chapter 3
THE HISTORY OF DIFFERENT LANGUAGE MODELS

This chapter provides a comprehensive overview of strategies for detecting offensive
language that are based on LSTM and language models that were presented in chapter
2 for the benchmark of Dravidian text data. These strategies were presented in chapter
2 for the purpose of analyzing Dravidian text data.
Users of many different websites are in a difficult position due to the prevalence of
offensive information online and the ease with which offensive content can be
propagated via social media. The widespread interest in this topic has prompted the
development of numerous systems that can recognize potentially dangerous messages
written in a wide variety of languages. All continents have these kinds of systems.
These algorithms are up to the task because they can recognize offensive language in
many different tongues. Once identified, these posts can be quarantined to a human-
moderated portion of the network or removed outright, making the platform safer for
its users[61]. Recurrent neural networks, or RNNs, are a type of neural network that
recycles the same units’ multiple times in a linear fashion. Neural networks function
according to this idea. Our RNN generation strategy uses the Long Short-Term
Memory (LSTM) model in addition to the Gated Recurrent Unit (GRU) model. Our
LSTM model's hierarchical structure starts with a 100-unit LSTM layer. Those who
have dropped out of school make up the second tier. In the first layer of the system, a
classification algorithm is in charge of determining what kinds of tweets people would
post. The structure of the GRU model is unaltered save from the substitution of an
extra GRU layer for the LSTM layer. Many models were examined, and this
architecture was ultimately chosen since it provided the greatest overall results. Both
methods make an effort to fix the RNN's most fundamental problem—vanishing
gradients. There is no perfect solution, but both of these approaches are steps in the
right direction. Using one BiLSTM layer and a total of one hundred hidden units, we
construct a Bidirectional Long Short-Term Memory. This storage allows for both
sequential and random access. The resulting vectors are then supplied into the
classification layer as input after being flattened. Though the BiGRU layer of a
BiGRU network is configured similarly to that of a BiLSTM network, the network is
still referred to as a GRU. For the simple reason that both the BiGRU and BiLSTM
setups look the same. Because of its extreme simplicity, the repeating module is most
1
likely to be a tanh function. While the LSTM's chain-like structure is comparable to
those of other architectures, understanding the internal structure of the repeating
module is extremely challenging. An LSTM's cell, the most critical part of the
memory device, can be identified by following the horizontal line that runs through
the top of the cell diagram. From one portion of the cell to another, the information
undergoes minimal transformation. Depending on the circumstances, LSTM can
either remove information from the cell state or add information to the cell state. A
gate is a single structure that regulates the cell's activity.
3.1 Long Short-Term Memory Based Architecture

Tokenization is the first stage of natural language processing, and it refers to the
process of separating texts into individual words or phrases for the goal of conducting
additional research. This is done for the sake of subsequent analysis (NLP). In
addition to this, the spaces that were previously blank between the paragraphs are
going to be filled in[62]. When attempting to determine whether the words of a
sentence are arranged in the sequence that is most conducive to conveying the
intended meaning of the statement, the following step is to examine the word order of
the sentence. In addition to this, we have added tests to ensure that there are no
grammatical errors and that difficult terms have been simplified. In this situation,
neural connections are being created between one another throughout the course of
time. One of the main foundations of recurrent neural networks is the idea that
neurons may acquire new information from one another by remembering what their
predecessors learnt and applying it to new situations. This is one of the most
important aspects of this type of network (RNN). This suggests that the calculations
performed following the measurements that were taken at time t1 are dependent on
the data that was collected at time t2 (t2). To say the least, this is how the RNN theory
operates in practice. The phenomenon known as vanishing gradient is a critical
problem that can occur in RNN. During the training phase of a neural network, the
weights of the network are constantly modified through the utilization of error
calculation and back-propagation. This is done to achieve the best possible results. On
the other hand, the fact that information must "propagate over time" to these neurons
makes it significantly more difficult to implement RNNs.
2
Problems are unavoidable when attempting to fairly distribute weights due to the
process involved. It is vital to carry out the phase that involves multiplying the
anticipated gradient by the initial weights of the network each and every time an
occurrence takes place. As a direct consequence of this, the gradient vanishes as the
calculation of the weights of the network moves further and further back in time. The
learning process will not benefit from it if the value of the gradient is too low; it will
be detrimental to it [63]. Imagine for a second suppose we were able to calculate the
degree of error at this very moment by utilising the following graphic: (t3). If you
want to change the weights of all the neurons that were active when the output was
being calculated, you will need to travel back in time to the instance (t3) (t0).
Figure 3.1: Long Short-Term Memory Process
In a nutshell, the vanishing gradient problem renders it impossible for RNN to

transport information from a later time instance to an earlier one when the sequence
length is long [64]. This predicament occurs when the sequence length is greater than
one. When the length of the sequence is greater, this is the situation. LSTM is
available to assist us in finding a solution to this problem (Long Short-Term
Memory). The problem of disappearing gradients is solved by the Long Short-Term
Memory (LSTM) method, which is an upgraded version of the Recurrent Neural
Network. The structure of the LSTM will be covered in the next paragraphs of this
article.
New weight = Old weight – (learning rate * gradient)
3
Figure 3.2: Back-Propagation in LSTM
The ability to transport information effectively from one time to another is made
possible by a memory cell that is strategically located at the peak of the peak.
Therefore, it may be able to retain more information from previous states, and unlike
RNN, it is not hampered by the problem of vanishing gradients. This suggests that it
may be able to remember more information. Because of the valve methods, the
information that was kept in the memory cell had the ability to be either written to or
deleted entirely.
Figure 3.3: Gated Recurrent Unit in LSTM
An LSTM network will not function properly if it is missing either the output of the
hidden layer from the time instance before the one in which it is currently operating or
4
the input data from the time instance in which it is currently functioning. These two
pieces of data go through several network nodes, each of which is equipped with
activation functions and gates, before arriving at their destination. This happens
before they reach the output of the network. When typical text modelling is being
done, the majority of the effort that is being put into the preprocessing as well as the
modelling activity is centered on the process of creating data in a sequential manner
[65]. This is true for both the preprocessing and the modelling. This is the case with
both the preprocessing activity and the modelling work. This is the situation with both
the modelling endeavor and the preprocessing task that needs to be done. The
assignment of labels to point-of-sale sites, the elimination of stop words, and the
reorganization of the text's sequence are only a few examples of the kinds of jobs that
are included in this category. These are the tactics that are utilized in an effort to help
a model comprehend the data with less effort and in a manner that is consistent with a
pattern that is already known to exist. The goal of this endeavor is to improve the
accuracy of the model's predictions. The purpose of this endeavor is to enhance the
precision with which the model's predictions may be made. It is not impossible for
you to attain the outcomes that you have set your sights on. It is likely that the
implementation of LSTM networks will lead to the formation of a characteristic that
is unique to this setting. In a previous part of this article, we discussed the fact that
LSTM is equipped with a function that enables it to memorize the sequence in which
the data is given. This function was described in greater detail below. This capability
is what makes LSTM such a useful neural architecture. On this page, lower down, a
more in-depth discussion of this functionality can be found. In addition to this, it
possesses a second characteristic, which is that it endeavors to get rid of information
that is not essential to the process. Text data will almost always have a sizeable
amount of information that isn't being utilized, which is something that is common
knowledge among all of us. The LSTM offers the capability to omit this information,
which allows for a reduction in the length of time required to calculate as well as the
cost incurred. The Long-Term Short-Term Memory (LSTM) is a useful tool for doing
text categorization as well as other tasks that are dependent on text. This is because
the LSTM can eliminate information that is not being used and remember the order in
which information is being used. In addition, the LSTM can remember the order in
which information is being used. This is since it is able to remove information that is
not being utilised and remember the sequence in which information is being utilised.
5
Because of this, the Long Short-Term Memory (LSTM) is a useful tool for
performing activities such as text categorization as well as other tasks that are
dependent on text. The embedding layer is one of the many layers that make up our
network. This layer is responsible for providing the fundamental support for the
structure. Tokens represent a certain type of data, and it is the duty of this layer to
gather them (each sentence is converted into index sequences using tokenizer)[66].
The table that converts sequences to dense vector sequences will be used by the
subsequent layer, which will make use of the embedding table that was generated by
this layer. Sequences will be converted into dense vector sequences by using this
table. Our system's embedding dimension was set to 128, and we utilized a tokenizer
to identify the top 1,000 phrases that were used the most frequently. The results of
this analysis are presented below. The frequency with which these phrases were used
helped us identify them as useful phrases. The second layer is referred to as the
SpatialDropout1D layer, and it is primarily responsible for playing a significant role
in fostering independence across feature maps. This is since the layer in question is
the one responsible for removing spatial information. The rate that we utilized was
0.1, which is commonly referred to as a fraction of the total number of input units that
were lost. We found that this was an accurate representation of our results. We were
able to calculate the total number of input units that were lost by using this rate. The
dropout approach is the one that is utilized most frequently when attempting to
accomplish the goal of transforming multi-dimensional data to one-dimensional
information on this layer [67]. This is since the dropout approach is a more effective
strategy. In the third layer, which is an LSTM layer, both the dropout and the
recurrent dropout parameters have been set to the value 0.2. This value was chosen
since it is between 0.1 and 0.3. This figure was decided upon because it falls
somewhere in the range of 0.1 to 0.3. The control of short-term memory was this
layer's one and only responsibility, however it was also responsible for fulfilling the
responsibilities of a recurrent neural network layer. Long-term short-term memory, or
LSTM for short, refers to the section of the brain that oversees managing long term
dependencies. This part of the brain is called the hippocampus. In some circles, it is
also referred to as the LSTM. It is essential to emphasize the fact that this number is
utilized by our method, as the dimension of LSTM hidden states that we use has the
value of 200. The value 200 can be interpreted in several different ways, one of which
is as a representation of the stratum's density. Word embeddings can be subdivided
6
into various distinct subcategories, such as the dense vector representation, which is a
type of method. This is only one example of how word embeddings can be organized.
These methodologies allow for the representation of words as well as the contents of
documents.
It is an improvement over more traditional bag-of-word model encoding schemes, in
which large sparse vectors were utilized to either represent each word on its own
within a vector to encode an entire vocabulary or to score each word on its own within
a vector in order to encode an entire vocabulary. This was done in order to encode an
entire vocabulary[68]. This fresh approach is a good illustration of how the bag-of-
word model can be utilized to make encoding systems more effective. The application
of this fresh methodology is a good illustration of this improvement in action. These
representations didn't have a lot of detail because the vocabularies held a huge amount
of information, but they did have a lot of information. Instead, a large vector that was
predominately composed of zero values was used to represent each unique word and
bit of text that was being processed.
Instead, words are represented in an embedding by dense vectors, with each vector
standing for the projection of the word onto a vector space that is continuous. This is
done so that the embedding can be as compact as possible. A vector space embedding
is the name given to this kind of representation.
It is possible to infer the location of a word within the vector space from the text in
which it is used by looking at the words that immediately precede and follow it in the
sentence. This is one way to determine the position of a word within the vector space.
This is what is meant when we talk about a word's "immediate context."
The term "embedding" is the one that is used when referring to the placement of a
word within the vector space that has been obtained through training. This is the
placement of a word within the vector space [69]. An Embedding layer, which may be
used to train neural networks on text data, is included in the Keras deep learning
framework. During the program, you will have access to this layer.
To have a correlation that is one-to-one between each word and an individual integer,
the data that is being entered must first be integer encoded. This can be done in
several different ways. This is necessary in order for the encoding process to begin.
This step of the data preparation process can be finished with the assistance of the
Tokenizer API, which is a component of Keras as well. Keras comes standard with
both components.
7
The embedding layer will proceed to learn an embedding for each of the words that
are included in the training dataset after first being "seeded" with random weights at
the beginning of the process. The training dataset will contain all of the words that
will be used. It is a flexible layer that can be utilised in a variety of different ways,
some of which are as follows: It is possible to utilise it on its own to learn an
embedding for a word, which can then be saved and used in another model later. This
process is referred to as independent learning. Independent learning refers to when a
student is responsible for their own education[70]. It is feasible to incorporate it into a
model for deep learning, in which the embedding is learnt concurrently with the
model itself. This type of learning is known as "concurrent learning."
Loading a pre-trained word embedding model is an example of a type of transfer
learning that may be carried out with the assistance of this tool. The initial layer of a
network that is hidden from plain sight is referred to as the "Embedding layer," and it
is denoted by this name. It is required to specify the three arguments listed below: It is
required to specify the three arguments listed below: input dim: This area provides a
representation of the total size of the vocabulary that is included in the text data. If the
data were encoded in this fashion, for example, the size of the vocabulary would be
11 words. This would be the case if the data were integer encoded to values that
ranged from 0 to 10. output dim: This value is used to provide the size of the vector
space that will be utilized in the process of embedding the words. The dimensions of
the output vectors for each word that are generated by this layer are established by
this component. For instance, it may be 32, it could be 100, or it could even be bigger.
Other possibilities include Test different values for your problem.
This is the length of input sequences, and you would define it in a Keras model very
similarly to how you would define it for any input layer. The length of the value for
the input is measured in bytes. For example, if each of your input papers consists of
one thousand words, then this value would also be one thousand. For example, in the
following section, we will construct an Embedding layer. This layer will have a
vocabulary of 200 words (such as integer encoded words ranging from 0 to 199,
inclusive), a vector space that has 32 dimensions in which words will be embedded,
and input documents that each have 50 words. The vector space will also have 32
dimensions. Through training, one will eventually be able to acquire the weights of
the Embedding layer. When you save your model to a file, the weights that you
8
specified for the Embedding layer will be carried over to the saved version of the
model.
An output that is a two-dimensional vector that contains one embedding for each
word in the input sequence of words is produced by the Embedding layer (input
document). If you want to connect a Dense layer directly to an Embedding layer, you
must first use the Flatten layer to transform the 2D output matrix into a 1D vector.
Only then can you connect the Dense layer to the Embedding layer. After that, you
will be able to make the connection between the Dense layer and the Embedding
layer.
3.2 Bi-Directional LSTM

To create a neural network that can retain sequence information in both the forward
(from the present to the future) and backward (from the future to the past) directions
is the goal of bidirectional long-short term memory, or bi-lstm for short (past to
future). A bi-lstm differs considerably from a standard LSTM in that it allows input to
flow in both directions when the operation is bidirectional. This is what differentiates
a bi-lstm from a standard LSTM. The most popular LSTM allows us to restrict the
input to either the forward or backward direction. This allows us to train the system in
both directions! A bidirectional system, on the other hand, has input those flows in
both directions, so it may be used to store both new and old information [71]. Let's
look at this example to help you picture what we're talking about. Some research
indicates that bidirectional LSTMs perform better than traditional LSTMs at solving
sequence classification problems. Since it is probable that not all timesteps of the
input sequence will be available during training, the bidirectional LSTM trains two
LSTMs on the input sequence rather than one. The input sequence is processed by
each method twice, once with the original data and once with its mirrored data. The
system may now be able to detect the problem more rapidly and thoroughly because
of this. It's not hard to grasp the idea of processing recurrent neural networks in both
directions (RNNs).
By duplicating the network's initial recurrent layer, two parallel layers are created,
with the input sequence being passed through the first layer unchanged and the output
sequence from the first layer being passed through to the second layer in reverse.
Once upon a time, folks involved in voice recognition thought it was fine to provide
9
the sequence in both directions, reasoning that the full speech context would be used
to decipher the words provided. Natural language processing is just one area where
this approach shines; it also works very well for other practical purposes. Elements of
input sequences, for instance, may hold both historical and real-time information.
Results are improved with BiLSTM because LSTM layers are fused in both
directions[71]. What this means is that BiLSTM will generate different results for
each sequence word (sentence). In natural language processing, the BiLSTM model is
used for a wide variety of tasks, such as phrase classification, translation, and entity
recognition. It has shown promise in a wide variety of settings, including protein
structure prediction, automatic speech recognition, and handwriting identification.
And finally, remember that BiLSTM is a slower model that needs more time to train
than LSTM when comparing the two. Therefore, it should be used sparingly.
Figure 3.4: Bi-Directional LSTM
Nothing helpful comes to mind when we try to finish the phrase "guys go there." We
have no words to describe how we feel. However, knowing a statement about the
future, such as "boys come out of school," greatly improves our ability to predict the
blank space in the past. Our neural network is able to do this thanks to the use of
bidirectional LSTM, which achieves a functionally identical result to what we need.
However, if we know the next line will be something like "boys come out of school,"
we can easily predict the blank space. The graphic demonstrates the forward and
backward data transmission layers of the system. For problems that necessitate both
parallel and sequential processing, BI-LSTM is the method of choice. Prediction
10
models, audio recognition, and text classification are just some of the possible
applications of such a network.
3.2.1 Hyper parameters Tuning of LSTM

By modifying the values of the model's hyperparameters, we are able to fine-tune the
performance of the model, which in turn enables us to optimize the model's
output[72]. They are not developed in a natural manner through training; rather, the
trainer must consciously try to offer them for the trainee. The accuracy of the model
can be significantly improved by paying close attention to the selection and
modification of the model's hyperparameters, which are an essential component of the
model's overall performance. Paying close attention to these selection and
modification processes can significantly improve the accuracy of the model. In Neural
Networks, some examples of hyperparameters include the number of hidden layers,
the number of neurons that are present in each hidden layer, the activation function,
the learning rate, the dropout ratio, the number of epochs, and a great deal more.
Other examples of hyperparameters include the number of epochs.
Deep Learning is a rapidly evolving subfield in machine learning. As a means of
pattern recognition and reliable prediction, it aspires to mimic the human brain. It's
hard to think of a field where models based on neural network topology wouldn't be
useful. Among all the (integral) steps, training the model to reliably predict on new
testing data is probably the most important. Therefore, it is crucial to choose
hyperparameters (values used to manage the learning process) for a model so that
training is both time- and fit-efficient (whether the model "knows" the training data
too well, or too badly; to limit overfitting or underfitting). Recurrent neural networks
are a type of neural network specifically designed to deal with time data. Long short-
term memory (LSTM) processes data and propagates knowledge forward in a manner
analogous to the control flow of a recurrent neural network. The fundamental
difference is in the acions taken by the cells that make up the long-term memory
network. The LSTM's ability to memorize and forget information is made possible by
these alterations. Backpropagation of errors through time and layers is a feature of
LSTMs that helps to preserve the integrity of the original data. Artificial LSTM (Long
short-term memory) models are a subset of RNN architectures that, thanks to their
feedback links, may examine data sequences in addition to single data points. All of
the hyperparameters for an LSTM model, including the common values used for
11
them, are covered in this article. This article focuses on how to properly establish the
most crucial LSTM hyperparameters, although it's worth noting that optimization
tools can be used to let the system pick its own hyperparameters. These methods are
useful because they spare us the laborious task of determining and fine-tuning
hyperparameters by hand.
● Number of nodes and hidden layers

The layers between the input and output are the ones that are "hidden," as the term
suggests[72]. Because of this fundamental premise, many individuals dismiss the
forecasts of deep learning networks as "black box" predictions that defy human
explanation. The ideal technique will differ from problem to problem, and there is no
universal criterion for deciding how many nodes (hidden neurons) or hidden layers
should be employed. If you're dealing with a problem of intermediate complexity, you
probably just need one hidden layer, but for something more complex, you'll need
two. While it's possible that increasing the number of nodes inside a layer (through
regularization techniques) may enhance accuracy, decreasing the number of nodes
would likely result in underfitting.
● Number of units in a dense layer

Using model.add(Dense(10,...)), we can define the density of a layer. One common
type of network architecture is a "thick layer," in which all of the neurons in the
layer below get input from their peers. For optimal precision, a network's base
should consist of dense layers with a minimum of 5 and a maximum of 10 units (or
nodes). As a result, the output shape of the final dense layer will be affected by the
number of neurons / units used.
● Dropout
To do this, we employ a model.add(LSTM(...), dropout=0.5) approach. Every LSTM
layer should be followed by a dropout layer. Avoiding overfitting is possible with the
help of this layer, which selectively ignores input from some neurons and so reduces
the weights of individual neurons[72]. Dropout layers are useful for training models,
but they shouldn't be used on output layers because they can distort the model's output
and generate inaccurate error calculations. The likelihood of overfitting grows with
the number of thick layers or the number of nodes inside them, however this can be
12
mitigated by incorporating dropout. While a threshold of 20% to begin with seems
sense, the dropout amount should be kept relatively low (up to 50%). It is widely
agreed that a value of 20% is best for preventing model overfitting while still
retaining model accuracy.
● Weight initialization
Each distinct activation function calls for a unique strategy for the initialization
of the weights. However, a uniform distribution is typically used to choose out
the initial weight values. It is not possible to begin searching efficiently by setting
all weights to 0.0 because the asymmetry in the error gradient is highlighted by
the optimization strategy. Changing the weights at the commencement of the
optimization process can have a dramatic effect on the optimized solution's
behavior[73]. Finally, randomness should be incorporated into the search process
by initializing the weights to very small values (as in the stochastic optimization
technique, also known as stochastic gradient descent).
● Decay rate
The weight update rule can include the weight decay, causing weights to decrease
exponentially as they approach zero, if no further weight update is planned. To
prevent the weights from becoming excessively enormous, they are multiplied by
a value slightly less than 1 after each update. Regularization in the network is
defined by this. The default setting of 0.97 usually works fine.
● Activation Function
Node activation functions decide whether an output is active (ON) or passive
(OFF). The addition of these operations to a model allows deep learning to
expand its horizons beyond those of linear prediction. Although activation
functions could be included in dense layers in theory, it is more practical to break
them out into their own layers in order to regain the density layer's original, lower
output. Once again, the activation layer that is ideal will depend on the
application, however the rectifier activation function is the most typical. In
different situations, different capabilities are called for[74]. Output layers
utilising sigmoid activation are used for binary predictions, whereas softmax
activation is used for multi-class predictions (which allows you to read the
13
outputs as probabilities). This method requires the creation of user-defined
functions that, when called, will return a result associated with a certain
activation function. One type of activation function is the sigmoid, which looks
like this:
Sigmoid defined (x)
Simply return 1/(1+np.exp(-x)) to retrieve the fraction of (1+np.exp(-x)) that was
used. The sigmoid (log-sigmoid) and hyperbolic tangent are two popular
activation functions for usage in LSTM blocks.
● Learning Rate
This hyperparameter regulates the update frequency of the network's parameters.
However, if the learning rate is too high, the model may fail to converge (a
training stage when the loss settles to within an error range around the final
value) or even diverge. As the steps towards the minimum of the loss function
become incredibly small, the learning process becomes noticeably slower when
the pace is reduced. However, the model's convergence will be sluggish thanks to
this. To achieve the appropriate decaying learning rate, this hyperparameter is
commonly set during training to a value between 0.0 and 0.1.
● Momentum
Research on combining RNN and LSTM using the momentum hyperparameter
has been conducted. Unlike most hyperparameters, momentum allows the
accumulation of the gradients of the past steps to pick the direction to move with,
as opposed to only using the gradient of the current step to lead the search. This
figure varies from about 0.5 to 0.9.
● Epochs
These hyperparameters determine the level of dataset replication that should take
place. Optimal settings entail increasing it until the validation accuracy begins to
drop even as the training accuracy grows, which can happen at any positive
integer between 1 and infinity (and hence risking overfitting). The early quitting
technique involves stopping training after a large number of epochs have been
specified and then defining a threshold for the rate of improvement in the model's
performance on the validation dataset.
14
● Batch Size
You can control how many data points are analysed before the model's internal
parameters are updated by adjusting this hyperparameter. Size-dependent
increases in gradient step size correspond to an increase in the number of samples
"seen" per unit increase in size. The default batch size of 32 is usually accepted as
a good starting point. Experiment with numbers like 64, 128, and 256 that are
divisible by 32.
3.2.1.1 Additional Hyper parameters Tuning of LSTM

A Dropout layer should be placed after each LSTM layer. Reduces the likelihood of
overfitting by making this layer less sensitive to the weights of individual neurons.
Numerous experiments have shown that the 20% threshold is the sweet spot between
too little model precision and too much overfitting. After the input has been
transformed by our LSTM layer(s), the shape must be shortened (or, in very unusual
instances, stretched) to meet the target output before predictions can be made toward
the goal output. There are two types of tags being applied to our output, necessitating
two separate output procedures[75]. The activation layer has been reached at long last.
It's feasible to do so within the density layer, but it's not very efficient. It is not the
case that separating the density layer from the activation layer will restore the model's
lost density layer output. As before, the app will automatically select the most
efficient activation method. There are several classes (for both sexes), but only one
can meet at a time because of limited space. For problems like this, the softmax
activation function is usually the most effective because we (and your model) can
understand the outcomes as probabilities. It is common practise to select a loss
function in addition to the activation function. To achieve optimal results in binary
classification problems, the softmax activation function suggests the cross-entropy
loss function. The cross-entropy function does away with the plateaus in the learning
curve that the soft-max function produces and also speeds up the learning process.
Adam, which stands for "adaptive moment estimation," is a method for selecting the
optimizer that needs only small adjustments to the hyperparameters but can still
produce substantial gains. We must now settle on a criterion by which our model will
be judged. Keras provided a large selection of precision levels. Examining the models'
15
ability to correctly anticipate outcomes is a quick and straightforward comparison
method.
3.3 CNN: Convolutional Neural Networks

Since neural networks are a type of machine learning, it is important to remember that
the information presented in the Neural Networks Learn Hub article revolves on them.
Each node level consists of an input layer, a hidden layer or layers, and an output
layer. There is a weight and a threshold associated with each node, and each node is
connected to every other node. Each node will become active and start transmitting
data to the next layer of the network if its output is greater than the value set as the
threshold for activation. No information will be transmitted to the next network level
unless this prerequisite is satisfied[76]. While feedforward networks were the primary
subject of that article, there are really many different types of neural nets, each of
which is used for a unique set of tasks and data. Convolutional neural networks
(CNNs) are used more commonly for classification and computer vision-related tasks
than are recurrent neural networks (RNNs), which are used for natural language
processing and speech recognition. Object recognition in images was a time-
consuming and labor-intensive process before to the advent of CNNs, which rely on
automatic feature detection and elimination. Contrarily, convolutional neural
networks provide a scalable solution to picture categorization and object recognition.
These networks scan an image with linear algebra in mind, namely the multiplication
of matrices, in order to spot repeating motifs. Model training typically necessitates the
use of graphics processing units, however, and can be computationally demanding
(GPUs).
16
Figure 3.5: CNN in Image
Convolutional neural networks are distinguishable from other varieties of neural

networks by their superior performance when given inputs involving pictures, voice,
or audio signals. There are primarily three types of strata that make up these layers:
● The Convolutional Data Layer
● The Fully Connected Pooling Layer (FC)
The first layer of a convolutional network is often a convolutional layer. The initial
layer of a neural network is always a convolutional layer; subsequent layers may be
additional convolutional layers or pooling layers; the final layer is always a fully
connected layer. As more layers are added, the CNN becomes more complex and
better able to recognize more and more of the image. Layers lower on the stack focus
on the most elemental details, like base colors and outline shapes. As the image data
travels through the CNN's layers, it initially begins to detect the object's key
components or forms and continues to do so until it recognizes the object for which it
was created. a Convolutional Data Layer
The convolutional layer is the backbone of a CNN, performing the bulk of the
computation. To work, it requires input data, a filter, and a feature map. Let's assume
the input is a colored image, which is a three-dimensional matrix of pixels. What this
means is that the input will have the same height, width, and depth that a typical
image does. We'll need to know the height, breadth, and depth of the space. We also
utilize a feature detector, also known as a kernel or a filter, which scans the image's
17
receptive fields to check for the presence of the feature. In mathematical terms, this
process is known as a convolution. To detect features, the feature detector uses a two-
dimensional array of weights[77]. This array oversees encoding some visual detail.
Although the size of the filter itself might vary, typically it is a three-by-three matrix,
which determines the size of the receptive field. As soon as the filter is applied to
some of the image, a dot product is calculated to establish how the filtered pixels
relate to the original image. The resulting dot product is then passed on to be analyzed
by an output array. Thereafter, the filter advances by a stride, and this process is
repeated until the kernel has covered the entire image. The ultimate output of a series
of dot products resulting from the input and the filter is known as a feature map,
activation map, or convolved feature.
As can be seen in the image directly above this one, it is not required that each output
value in the feature map correspond to each pixel value in the input image. As the
filter is being applied to the receptive field, that is the only connection it needs to
make. "Partially connected" layers are a common way to describe convolutional and
pooling layers. This is because it is not required that each value in the input array
correspond exactly to a value in the output array. However, another name for this trait
is "local connection." Bear in mind that the feature detector uses constant weights
while it scans the image. Parameter sharing is another name for this kind of behavior.
Backpropagation and gradient descent are used during training to automatically adjust
values for some parameters, including the weight value. However, three
hyperparameters must be set before the neural network training process can begin. In
terms of output volume, these hyperparameters matter[78]. There are a number of
these, including: To begin with, the output's richness is affected by the total number
of filters. Using three distinct filters, for instance, would yield three distinct feature
maps, for a total of three levels of depth. When talking about the kernel's traversal of
the input matrix, the distance it covers is referred to as the "stroke," which is
measured in pixels. Although stride values of two or more are exceedingly rare, they
do occur and result in a decrease in output when they do. Zero-padding is commonly
used when filters do not perfectly match the input image. Depending on whether or
not all of the entries are zero, the output matrix is either larger than the input matrix or
the same size. Padding can be divided into three categories, including: It's valid
padding even though it has the appearance of having none. If the dimensions are not
compatible, the most recent convolution is bypassed in this case.
18
The output layer's dimensions are preserved with this padding, making it identical in
size to the input layer.
To achieve full padding, one might increase the size of the output by adding zeros to
the input's outside edges.
When a CNN is used, the feature map is corrected using a Rectified Linear Unit
(ReLU) after each convolution operation, which introduces nonlinearity to the model.
Because of this, the model's structure has become more intricate.
A second convolution layer could be added following the first one we covered earlier.
When this happens, the CNN might take on a hierarchical shape, with higher layers
having access to information about pixels within lower layers' receptive fields and
vice versa. Take this example into consideration: We need to know if a photo includes
a bicycle. The parts of the bicycle can be considered. This item consists of a frame, a
set of handlebars, wheels, and pedals. The individual parts of the bicycle represent
patterns at a lower level within the convolutional neural network (CNN), while the
entire bicycle itself represents a pattern at a higher level. A CNN's features can then
be organized into a hierarchy[77]. Finally, the neural network analyses the image and
extracts relevant patterns thanks to the convolutional layer's effort in translating it into
numerical values.
Layer of Pooling
Layer pooling, often known as down sampling, is a technique used to reduce
dimensionality. Through this method, we may reduce the number of input
parameters[79]. While the pooling process is functionally analogous to the
convolutional layer, its filter is applied to the full input without the use of weights.
The information in the receptive field is aggregated using the kernel's aggregation
function, which is then used to fill the output array. It's possible to classify pools into
two broad types:
In max pooling, the filter iteratively scans the input, selecting the highest-valued pixel
along the way for transmission to the output matrix. The process proceeds throughout
the input in this fashion. As a side aside, this strategy is used more frequently than the
average pooling method.
The average value inside the receptive field is determined as the filter sweeps over the
input, and this value is then sent to the output array. The term "average pooling"
describes this procedure.
19
Although a lot of data is lost in the pooling layer, it offers several advantages for
CNN. They help streamline it, improve its efficacy, and reduce the possibility of
overfitting.
A wholly interconnected layer

The name given to the layer that stores all of the interconnections is a perfect fit. As
was previously mentioned, the pixel values of the input image are not always related
to the values of the output layer in images with partially connected layers. However,
in the fully connected layer, every node in the layer below it has a direct connection to
a node in the layer above it[80]. As the name implies, this layer is in charge of
classifying data based on the features that were filtered out by the layers that came
before it. In contrast to the more common usage of ReLu functions in convolutional
and pooling layers, softmax activation functions are commonly used in FC layers to
reliably classify inputs, yielding a probability between 0 and 1. Whenever we hear the
term "convolutional neural network," our minds immediately go to the realm of
computer vision (CNN). In recent years, convolutional neural networks, often known
as CNNs, have been at the center of many developments in computer vision. Some
examples of these advancements include the automatic photo tagging feature on
Facebook and self-driving vehicles. In more recent times, we've started applying
CNNs to problems involving natural language processing, and so far, the results have
been quite encouraging. In this piece, I'm going to do my best to explain what
convolutional neural networks (CNNs) are and how they're used in natural language
processing (NLP), which is an acronym for natural language processing. Before I
move on to the NLP application, we begin by concentrating on the application of
CNNs in computer vision because the fundamental intuitions that underlie it are easier
to understand in that context[81]. Most of the tasks involved in natural language
processing accept words or phrases stored as a matrix rather than individual image
pixels as their point of entry. In the matrix, each row serves as a representation of a
token, which may take the form of a word or a letter. This indicates that each row is a
vector, and that each vector represents a different word. These vectors may be word
embeddings (representations with a low dimensionality), such as word2vec or GloVe,
or they may be one-hot vectors that index the word into a lexicon. We would need an
input matrix of 10100 if we were to apply a 100-dimensional embedding to a text that
only contains 10 words. Our "image" is like such. When it comes to the processing of
20
natural language, we frequently use filters that travel over full rows of a matrix[82].
This is in contrast to the filters that we use when we're seeing, which move over only
small portions of an image (words). The "width" of our filters is often the same as the
width of the matrices that they are processing as input. Sliding windows often obscure
anywhere from two to five words at once; however, this might change depending on
the height or the size of the region being covered. When all of the above is taken into
consideration, it's possible that a Convolutional Neural Network for Natural Language
Processing would look something like the following: (invest some time in trying to
decipher this image and the computations behind its dimensions).
Figure 3.6: Convolutional Neural Network (CNN) architecture for sentence

classification.
Comparison of Narrow and Wide Convolutions In earlier explanation, we failed to

make a distinction between narrow and wide convolutions, which is a distinction that
is both insignificant and significant. The problem can be fixed with a 3x3 filter that is
21
applied in the matrix's sweet spot, but what about the edges? What effect does the first
element of a matrix have on the filter if there are no neighbors to its upper-left or
lower-left corners? It is allowed to have zero padding. Any scalar value that falls
outside the bounds of the matrix is considered to have a value of zero. This gives you
the ability to filter the entire input matrix and produce an output that is either larger or
of a uniform size. When zero padding is used during wide convolution, the result is
wider than when it is not used during narrow convolution. Take a look at this one-
dimensional illustration that serves as an example:
Figure 3.7: Narrow vs. Wide Convolution
When the filter size is somewhat large in comparison to the input size, it becomes
evident why broad convolution is beneficial, and in certain situations, vital. This is
especially true in cases when the filter size is pretty large in contrast to the input size.
The output of the narrow convolution has a size of size (7-5) + 1=3(75)+1=3, but the
output of the broad convolution has a size of size (7+2*4 - 5) + 1 = 11(7+245)+1=11
in size. Using the procedure, one can compute the size of the output in a more generic
sense.
A convolution hyperparameter known as the stride size determines how much of a
filter shift occurs between successive rounds of the process. In each and every one of
these scenarios, the stride size was set to 1, which allowed filter applications to
operate in parallel[83]. When the stride is made larger, the number of filter iterations
is cut down, which in turn causes the output to become less significant. The following
inputs, taken from the cs231 website at Stanford University, have stride sizes of 1 and
2, respectively.
22
Figure 3.8: Convolution Stride Size
Convolutional Neural Networks commonly use pooling layers after the convolutional
layers in order to continue processing the data after those layers. A subset of the input
to the pooling layers is taken as a sample. The execution of a maxmax operation on
the output of each filter is the form of pooling that is most frequently seen in use. If
you like, you can pool your information over a window rather than the entire matrix.
The example that follows demonstrates maximum pooling for a window with a square
22 grid. In natural language processing, we pool over the entirety of the output almost
always, which results in a single value being returned by each filter.
Figure 3.9: Max pooling in CNN
The advantages of sharing resources must unquestionably outweigh the

disadvantages. The first and second reasons are as detailed below. Pooling can
provide a fixed-size output matrix, which is frequently required for classification due
to other aspects of the method. [Case in point:] [Case in point:] [Case in point:] If you
have many filters, such as 1,000, and you apply max pooling to each of them, the
23
output will be as large as the input, which indicates that it will have 1000 dimensions.
If you do not apply max pooling to any of the filters, the output will be as small as the
input. Because of this, it will now be possible to employ phrases and filters of varied
widths in a classification algorithm while still producing output dimensions that are
consistent with one another.
In a similar vein, pooling reduces the dimensionality of the output while also
preserving (hopefully) the most essential data. To assess whether or not the text
satisfies its criteria, one filter, for instance, could look for the existence of a negative
phrase such as "not remarkable." The filter will produce a very large number in the
portion of the sentence that contains this phrase, but it will produce an extremely
small value everywhere else. The max operation does not keep track of where the
feature was located in the phrase, despite the fact that it does keep track of whether or
not the feature appeared in the phrase. However, wouldn't it be beneficial to know
where we are? The answer to this question is yes, and the process may be thought of
as being somewhat comparable to a bag of n-grams model. Your filters will only
capture and keep local information, so any global information about location, such as
where in a phrase something happened, will be lost. The distinction between phrases
such as "not amazing" and "amazing not" is maintained, for instance[84]. When it
comes to picture recognition, pooling not only provides basic invariance to translation
(shifting), but also to rotation. Because max operations always select the maximum
value, the outcome of pooling over a region stays largely the same even if the image
is rotated or shifted by a little amount. This is because max operations always chose
the maximum value. Channels is the last concept that we have to have a handle on.
The same data can be viewed from a variety of "perspectives" when viewed through
different channels. For example, the red, green, and blue channels, abbreviated as
RGB, are the standard when it comes to image identification. Convolutions are a
mathematical operation that can be applied between channels using either individual
weights or shared weights. The processing of natural languages may also involve
numerous channels, including the following: You may have a channel for the same
text that is conveyed in a number of different languages, or that is stated in a number
of different ways; alternatively, you could have a channel for several word
embeddings (word2vec and GloVe, for example). The study of user sentiment, the
detection of spam, and the categorization of topics are just a few instances of the
24
classification duties that seem to be created specifically for CNNs. The loss of
information about the local order of words that occurs during convolutions and
pooling operations makes it considerably more difficult to integrate sequence tagging
into a pure CNN architecture. Examples of sequence tagging include point-of-speech
tagging and entity extraction (though not impossible, you can add positional features
to the input).
This section's objective is to evaluate a CNN architecture [101] by making use of a

number of classification datasets, the majority of which are devoted to tasks involving
sentiment analysis and topic categorization. On a number of different datasets, the
CNN architecture outperforms the state-of-the-art, and on a subset of those datasets, it
outperforms the state-of-the-art. The network that is used in this study is incredibly
plain, and the fact that it appears to be so simple is one of its greatest strengths. The
input layer of the model is composed of word embeddings concatenated using the
word2vec algorithm. The next layer is a multi-filter convolutional layer, followed by a
max-pooling layer, and finally a softmax classifier is used as the final layer.
Experiments are also carried out in this study using two channels, one of which is
altered during training while the other is not. These channels take the form of static
and dynamic word embeddings, respectively. There had previously been a proposal
for a comparable architecture, although one that contains more moving elements. The
preexisting network structure is enhanced by the addition of a "semantic clustering"
layer.
Figure 3.10: Convolutional Neural Networks for Sentence Classification
25
We can create a CNN from scratch by not utilizing any word vectors that have been
pre-trained in any prior manner (such as word2vec or GloVe). In this scenario, one-
hot vectors are convolved directly with other vectors. The number of parameters that
the network has to learn has been cut down, and the authors propose a space-efficient
form for the input data that is similar to a bag of words. As demonstrated in [105], a
convolutional neural network (CNN) is used to train an additional unsupervised
"region embedding" to make predictions about the context of parts of text. It would
appear that the approach presented in these publications is successful for larger pieces
of writing, such as movie reviews; however, the method's success with shorter texts,
such as tweets, is less certain. It is only logical to assume that the advantages of using
pre-trained word embeddings for shorter texts will outweigh those for longer ones if
both sets of texts were considered. When developing a CNN, you have the ability to
experiment with with a number of hyperparameters, some of which I have included
above: Variables such as the type of input representation that was used (word2vec,
GloVe, or one-hot), the size of the convolution filters, the type of pooling that was
utilised (max or average), and the activation functions can all have an impact (ReLU,
tanh)[85]. An empirical study is carried out in order to evaluate the influence of
various hyperparameters on CNN architectures, as well as the effect these
hyperparameters have on performance and variation throughout several runs.
Constructing your very own convolutional neural network (CNN) for text
classification is possible if you make use of the findings presented in the study as a
reference point. It was discovered that max-pooling was preferable to average pooling
in every scenario, the optimum filter sizes varied depending on the work, and
regularization did not appear to make much of a difference in the NLP jobs that were
tested. The fact that all the datasets were extremely comparable to one another in
terms of document length is one of the potential limitations of this study. Because of
this, it is possible that the same criteria will not apply to data that appears
considerably differently. Two applications of CNNs that are investigated in this article
are called Relation Extraction and Relation Classification. The authors not only feed
the word vectors into the convolutional layer, but they also consider the locations of
the words in relation to the things of interest that are being searched for. In this model,
there is an expectation that each input sample will have precisely one relation, and the
model infers that the entity placements will remain unchanged. One of the
applications that is discussed in the papers makes document recommendations for the
26
user based on what they are currently reading. Data taken from search engine records
is used during the training process to refine representations of sentences. Most CNN
designs learn to acquire embedding’s, or low-dimensional representations, for
individual words and phrases while they are being trained[86]. However, not all
works study the significance of the learnt embed dings, and some don't even address it
as a training consideration. This is even though it is very important. Presented a CNN
architecture that can predict hashtags in Facebook posts and produce meaningful
embeddings for words and phrases. Users receive potentially interesting text
recommendations after clickstream data is used to train embeddings, which are then
applied successfully to the task of recommending texts to users.
NNs That Are Able to Recognize Particular Characters
Language has played a central role in every one of the models that have been
presented up until this point. Having said that, there is also a body of research looking
into the direct application of CNNs to characters. uses a convolutional neural network
(CNN) to train embeddings for individual characters. These embeddings are then
coupled with embeddings for individual words that have already been learned to
perform Part of Speech tagging. Look into whether or whether it is possible to teach
CNNs to learn directly from characters without the assistance of embeddings that
have been pre-trained in advance[86]. This is a significant illustration of a deep
network because the authors employ a 9-layer network to apply to sentiment analysis
and text categorization. This makes the study a good example of a deep network.
When used to large datasets (millions of examples), learning from character-level
input is effective; yet, when applied to smaller datasets, it is unable to compete with
models that are more straightforward (hundreds of thousands of examples). The
purpose of this study is to evaluate the potential of character-level convolutions in
Language Modeling by feeding the output of a character-level CNN into an LSTM at
each time step. The same model is used for a great number of tongues.
27

Chapter III

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter III

Uploaded by

Copyright:

Available Formats

Chapter 3

THE HISTORY OF DIFFERENT LANGUAGE MODELS

3.1 Long Short-Term Memory Based Architecture

Figure 3.1: Long Short-Term Memory Process

In a nutshell, the vanishing gradient problem renders it impossible for RNN to

New weight = Old weight – (learning rate * gradient)

Figure 3.3: Gated Recurrent Unit in LSTM

3.2 Bi-Directional LSTM

Figure 3.4: Bi-Directional LSTM

3.2.1 Hyper parameters Tuning of LSTM

● Number of nodes and hidden layers

● Number of units in a dense layer

3.2.1.1 Additional Hyper parameters Tuning of LSTM

3.3 CNN: Convolutional Neural Networks

Convolutional neural networks are distinguishable from other varieties of neural

A wholly interconnected layer

Figure 3.6: Convolutional Neural Network (CNN) architecture for sentence

Comparison of Narrow and Wide Convolutions In earlier explanation, we failed to

Figure 3.7: Narrow vs. Wide Convolution

Figure 3.9: Max pooling in CNN

The advantages of sharing resources must unquestionably outweigh the

This section's objective is to evaluate a CNN architecture [101] by making use of a

Figure 3.10: Convolutional Neural Networks for Sentence Classification

You might also like