Professional Documents
Culture Documents
A Degree Thesis
Submitted to the Faculty of the
Escola Tècnica d'Enginyeria de Telecomunicació de
Barcelona
Universitat Politècnica de Catalunya
by
Berta Viñas Redondo
In partial fulfilment
of the requirements for the degree in
TELECOMMUNICATIONS TECHNOLOGIES AND SERVICES
ENGINEERING
The aim of this thesis is the development of a system for identifying and categorizing offensive
language in tweets using machine learning techniques. The project is based on Task 12 of the
SemEval 2020 competition. This task consists of identifying offensive tweets and classifying
the type and target of the offense.
For this task, the Offensive Language Identification dataset (OLID) is used. The dataset
contains English tweets annotated. The task is divided into three subtasks depending on the
type and target of the offense.
Different machine learning models are applied for the development of the project. The thesis
provides a detailed analysis and evaluation of the results obtained with the different models
and a comparison with the results in last year’s competition.
It is demonstrated that one of the best models for this task consists of an ensemble of different
deep learning models, resulting a final macro F1 score of 0.7807 for subtask A, 0.6634 for
subtask B and 0.6062 for subtask C.
1
Resum
Per a aquesta tasca, s’utilitza el conjunt de dades Offensive Language Identification Dataset
(OLID). El conjunt de dades conté tuits en anglès anotats. La tasca es divideix en tres
subtasques, segons el tipus i l’objectiu de l’ofensa.
Es demostra que un dels millors models per a aquesta tasca consisteix en una combinació de
diferents models d’aprenentatge profund, resultant una puntuació final de macro F1 de 0,7807
per a la subtasca A, 0,6634 per a la subtasca B i 0,6062 per a la subtasca C.
2
Resumen
Para esta tarea, se utiliza el conjunto de datos Offensive Language Identification Dataset
(OLID). El conjunto de datos contiene tuits en inglés anotados. La tarea se divide en tres
subtareas, según el tipo y el objetivo de la ofensa.
Se demuestra que uno de los mejores modelos para esta tarea consiste en una combinación
de distintos modelos de aprendizaje profundo, resultando una puntuación final de macro F1
de 0,7807 para la subtarea A, 0,6634 para la subtarea B y 0,6062 para la subtarea C.
3
This thesis is dedicated to my family, who have been always by my side, supporting me and
giving me strength during the whole degree.
To the friends in Lund, who shared their advice and encouragement while writing my thesis,
especially to Miquel, loyal workmate during the last weeks.
4
Acknowledgements
I wish to express my gratitude to my supervisor in Lund, Pierre Nugues, who guided and
support me during my study. His knowledge and advice helped me in the research and writing
of my thesis.
Finally, I would like to thank Jose Antonio Lázaro, my supervisor in Barcelona, for giving me
the opportunity to write my thesis at Lund University and for his help during the process.
5
Revision history and approval record
Name e-mail
6
Table of contents
Abstract ......................................................................................................................................1
Resum ........................................................................................................................................2
Resumen ....................................................................................................................................3
Acknowledgements ....................................................................................................................5
Revision history and approval record .........................................................................................6
Table of contents........................................................................................................................7
List of Figures...........................................................................................................................10
List of Tables: ...........................................................................................................................11
1. Introduction........................................................................................................................12
1.1. Project background .....................................................................................................12
1.2. Statement of purpose .................................................................................................12
1.3. Project requirements...................................................................................................13
1.4. Project specifications ..................................................................................................13
1.5. Updated work plan ......................................................................................................14
1.6. Deviations and incidences ..........................................................................................17
2. State of the art of the technology used or applied in this thesis: .......................................18
2.1. Introduction to machine learning.................................................................................18
2.1.1. Logistic Regression classifier ...............................................................................19
2.1.2. Evaluating machine learning models ...................................................................20
2.1.3. Scikit-learn............................................................................................................20
2.2. Introduction to deep learning ......................................................................................21
2.2.1. Introduction to neural networks ............................................................................21
2.2.2. Data representation for neural networks ..............................................................22
2.2.2.1. Vectors .............................................................................................................23
2.2.2.2. Matrices ............................................................................................................23
2.2.3. Overfitting .............................................................................................................23
2.2.3.1. Dropout.............................................................................................................23
2.2.4. Keras ....................................................................................................................24
2.3. Deep learning for text sequences ...............................................................................24
2.3.1. Preparing the text data .........................................................................................24
2.3.2. Working with text data ..........................................................................................25
7
2.3.2.1. One-hot encoding .............................................................................................25
2.3.2.2. Word embeddings ............................................................................................26
2.3.3. Recurrent Neural Networks ..................................................................................26
2.3.3.1. SimpleRNN.......................................................................................................27
2.3.3.2. LSTM and GRU ................................................................................................27
2.3.4. Convolutional Neural Networks ............................................................................28
2.3.5. Ensembling models ..............................................................................................29
3. Methodology / project development: .................................................................................30
3.1. Task description..........................................................................................................30
3.2. Dataset .......................................................................................................................30
3.3. Data pre-processing ...................................................................................................31
3.4. Measurement ..............................................................................................................32
3.5. Creating the baseline ..................................................................................................32
3.6. Training deep learning models ...................................................................................33
3.6.1. Used parameters..................................................................................................33
3.6.2. Deep learning models (I) ......................................................................................33
3.6.2.1. Simple RNN......................................................................................................33
3.6.2.2. Simple LSTM ....................................................................................................34
3.6.2.3. LSTM ................................................................................................................34
3.6.2.4. BiLSTM + Dropout ............................................................................................34
3.6.3. Deep learning models (II) .....................................................................................35
3.6.3.1. BiLSTM + BiGRU..............................................................................................35
3.6.3.2. CNN + Global Max Pooling...............................................................................35
3.6.3.3. CNN (3 filters)...................................................................................................35
3.6.3.4. Ensemble..........................................................................................................36
4. Results...............................................................................................................................37
4.1. Baseline results ..........................................................................................................37
4.2. Deep learning results ..................................................................................................37
4.2.1. Subtask A .............................................................................................................37
4.2.2. Subtask B .............................................................................................................38
4.2.3. Subtask C.............................................................................................................38
4.3. Comparison with OffensEval 2019 .............................................................................39
5. Budget ...............................................................................................................................40
6. Conclusions and future development: ...............................................................................41
Bibliography:.............................................................................................................................42
8
Appendices:..............................................................................................................................43
Appendix I: Detailed results of the machine learning models. ..............................................43
Results for Logistic Regression.........................................................................................43
Subtask A: .....................................................................................................................43
Subtask B: .....................................................................................................................43
Subtask C: .....................................................................................................................44
Results for deep learning models......................................................................................45
Subtask A: .....................................................................................................................45
Subtask B: .....................................................................................................................46
Subtask C: .....................................................................................................................47
Appendix II: Previous project ................................................................................................49
Appendix III: Installation guide..............................................................................................50
Appendix IV: Code................................................................................................................52
Glossary ...................................................................................................................................53
9
List of Figures
10
List of Tables:
11
1. Introduction
The project is carried out at the Language Technology department in Lunds Tekniska
Högskola, Lund University. This project is based on the previous theoretical course Language
Technology, which introduces theories and techniques of language technology and natural
language processing. This course attempts to cover the whole field from speech recognition
and synthesis to semantics and dialogue.
Nowadays, due to the huge increase of the use of social media, lots of offensive language can
be seen on these platforms. As manual filtering is very hard, there have been many
researches aiming at automating this process.
This topic is one of the tasks proposed in the SemEval 2020 competition. Semantic Evaluation
(SemEval) is a series of workshops focused on the evaluation and comparison of systems that
can analyze diverse semantic phenomena in text aiming to extend the current state of the art
in semantic analysis and creating high quality datasets. This organization provides a really
interesting forum for researches to propose challenging research problems in the field of
semantics and to build new techniques to solve such research problems. Every year, many
tasks are proposed, so this project is based on the task 12 of the SemEval 2020: Identifying
and Categorizing Offensive Language in Social Media.
The main goal of this task is to identify offensive language on Twitter and categorize the type
and target of offense. This task is the second iteration of OffensEval, which was proposed in
SemEval 2019. The task this year, SemEval 2020, is still going on, so the project takes into
account the results obtained in last year’s competition.
The offensive content is broken down into the following three subtasks:
- Subtask A - Offensive language identification.
- Subtask B - Automatic categorization of offense types.
- Subtask C - Offense target identification.
The purpose of this project is to automate the process of detecting offensive language in
social media platforms, in this case, Twitter, using language technology techniques such as
machine learning.
12
The project main goals are:
1) Automate the detection of offensive language in social media platforms:
a. Discriminate between offensive and non-offensive posts.
b. Identify the type of offensive content in the posts.
c. Detect the target of the offensive posts.
2) Use machine learning techniques for language analysis.
3) Analyze and evaluate the machine learning models.
4) Compare the results with OffensEval 2019.
The principal requirement of the project is to be able to detect offensive language on Twitter,
which can be divided into the following points:
- Identify offensive language.
- Categorize offense types.
- Identify target of offense.
13
1.5. Updated work plan
14
Figure 1 details the work packages structure, whereas Figure 2, Figure 3 and Figure 4 show
the detail of the WP6 for each subtask.
Figure 2: Internal structure of WP6 for subtask A. Figure 3: Internal structure of WP6 for subtask B.
15
The final Gantt diagram is presented in the figure below:
The final distribution of the hours dedicated to each of the main tasks of the project is shown
in Figure 6.
16
1.6. Deviations and incidences
At the beginning of the project, I had some difficulties installing Keras Python module, which
was needed for the code implementation. My computer couldn’t support to install Keras so in
the end, I started using Google Colab, which is a free cloud service that allows developing
deep learning applications using the Python libraries such as Keras, TensorFlow, etc. It made
the implementation process much easier and also the evaluation process. Some of the
models required high computational cost to train, so Colab also provided less computing times
in the training process.
Regarding the Work Plan, what has changed between the last one is mainly the structure and
the order of execution of each task. The development of the project follows the agile
methodology. It consists of iteration. For the first step WP2: Creating the baseline, the three
different subtasks were implemented and then analyzed. But for the rest of the steps WP3 –
WP5, the implementation of subtask A was done first. Once we had the results for the first
subtask with all the models, then it was easier to expand the models to be applied to the rest
of subtasks B and C.
17
2. State of the art of the technology used or applied in this thesis:
This section gives a brief overview of machine learning and deep learning concepts, and the
different related models that are applied in the project. It also describes how neural networks
work and the types that are used in the project. The final section presents the concrete case
of how to work with text data using deep learning techniques.
The different Python tools and libraries for machine learning are outlined in some of the next
chapters.
In the past few years, artificial intelligence (AI) has garnered a lot of attention. AI can be
defined as the capability of a computer system to automate intellectual tasks that normally
require human intelligence, such as speech recognition, visual perception, decision-making
and language translation. The AI field comprises machine learning and deep learning
approaches, as well as many more techniques that don’t involve any learning.
In classical AI, humans input rules and data. This data is processed according to these rules
and outcome answers, whereas machine learning is a trained system that outcomes rules
given input data as well as the answers for the input data. Machine learning has the ability to
modify itself without human intervention each time that new data is fed.
18
The machine learning goal is to transform the input data into significant outputs, but the main
objective is to learn useful representations of the input data. All the different algorithms consist
of automatically find representations that get closer to the expected output.
A machine learning approach that will be used in this project is logistic regression, explained
in the next section.
Logistic Regression is a machine learning algorithm from the field of statistics, used for
classification problems. It is a predictive analysis algorithm based on the concept of probability.
It is useful in binary classification problems.
Its name comes from the logistic function, also known as sigmoid function. Logistic
Regression uses this function to predict the probability of a given point to belong to a class.
The function maps any real value into a value between 0 and 1.
In Logistic Regression, the input values (x) are combined linearly using coefficient values (β),
called weights to predict an output value (y).
These weights are estimated from the training data using maximum-likelihood estimation, a
learning algorithm used by several machine learning algorithms. The goal of maximum-
likelihood is to find the weights that minimize the error in the probabilities predicted by the
model to the ones in the data.
19
2.1.2. Evaluating machine learning models
Machine learning models cannot be evaluated on the same data as they were trained. The
goal of machine learning models is to perform well on never-before-seen data, so the main
objective is to be able to generalize. If a model is evaluated on the same data as it was trained,
it overfits, which means that the performance of the model starts worsening.
This is the reason why it is necessary to split the data into different sets to evaluate the
models:
- Train data.
- Validation data.
- Test data.
The model is trained using the training data and evaluated on the validation data. The
performance of the model on the validation data is used as feedback to tune the configuration
of the model. Once the model is ready, the final evaluation is done using the test data.
There are different techniques to split the data into the three needed sets. In this project,
simple hold-out validation is used, which consists of set apart a fraction of the data as a test
set. An example is shown in the figure below.
2.1.3. Scikit-learn
Scikit-learn is a machine learning library, which provides a large set of algorithms that can be
used with Python.
The two main functions in the library used for the classifier are fit(), to train a model and
predict(), to predict a class.
The input data for machine learning models that use this library follows the next structure:
- Features: x denotes a feature vector containing information of one observation,
whereas X is a feature matrix representing the whole dataset.
- Labels: y denotes a vector, which contains the classes for the dataset.
20
Both X and y must be in the numpy array format, where numpy is the numerical computation
library on which scikit-learn is built.
For example, in our project, the input data has the structure shown in Figure 10.
In this case, the observations are the different tweets, and the features of each observation
are the words in each tweet. Finally, the vector y contains the corresponding classes.
Deep learning is a specific subfield of machine learning. The meaning of the word ‘deep’ in its
name refers to the idea of having successive layers of representations. It uses multiple layers
to extract higher-level features from the input data. This is the main difference with other
machine learning approaches, which tend to focus on learning only one or two layers of
representations. In deep learning, these layered representations are learned via models called
neural networks.
Neural networks are composed of layers where each layer contains a set of nodes, called
neurons or units. The input layer corresponds to the input features and the output layer
produces the classification result. Each layer has an activation function, which produces the
output of the neurons in the layer.
Every single layer represents a data transformation, and these data transformations are
learned by exposure to input data. These data transformations are represented in the layer’s
21
weights, which are a set of numbers that parameterize the transformation in each layer. The
goal of a neural network is to find the weights’ values in all the layers that together, provide
the best approximation for the input to the associated target.
Initially, the weights of the network are set randomly, but with each example that the network
processes, the weights are updated. The process for the network to find the best weights’
values is called the training of the network. It consists of feeding the network with the input
data as many times as wanted, updating the weights of the network in each iteration, aiming
to yield the weights’ values that minimize the difference between the output value and the
expected one.
To measure the difference between the given output from the expected output, the network
uses the loss function. This function will give a score that will be used as feedback to adjust
the weights’ values aiming to lower the loss score. This is made by the optimizer, which
implements the backpropagation algorithm.
To summarize, a network is composed of different layers that are chained together. Given
input data, it returns a prediction for its target. These predictions are compared to the real
targets using the loss function. The resulting value, loss score, is used by the optimizer to
update the network’s weights. The stack of layers is called model.
In the previous section Scikit-learn, the numpy arrays were presented. They can also be
called tensors. Most of the actual machine learning systems use tensors as the data structure.
They generally contain numerical data.
22
The following sections describe two different types of tensors that can be used in machine
learning algorithms.
2.2.2.1. Vectors
Vectors are arrays of numbers and they can also be called 1D tensors. It is said that a 1D
tensor has exactly one axis. It is important to understand the difference between axis and
dimension. Dimensionality denotes the number of entries for a specific axis or the number of
axis in a tensor. For example, a 3D vector has only one axis and has three dimensions along
its axis, whereas a 3D tensor has three axes and can have any number of dimensions for
each axis.
2.2.2.2. Matrices
Matrices are arrays of vectors, or also called 2D tensors. In this case, a matrix has two axes,
the entries from the first one are called rows and the entries from the second one are called
columns. The result of packing 2D tensors in an array is a 3D tensor, and so on.
2.2.3. Overfitting
As stated in the section Evaluating machine learning models, overfit occurs when the model is
beginning to learn patterns that are specific to the training data but they don’t have any
relevance when it comes to data that has never seen before. Then, the performance of the
model begins to be worse. This happens in every machine learning problem.
To deal with overfitting, the best solution is to get more training data. A model trained on more
data will naturally generalize better. Some other techniques can also be applied to fight
overfitting.
2.2.3.1. Dropout
Dropout is one of the most common and effective techniques to deal with overfitting in neural
networks. It is applied to the layers and it consists of randomly set to zero several output
features of the layer during the training process.
When using dropout, a dropout rate is applied, it is defined as the fraction of features that are
zeroed out. It normally has values between 0.2 and 0.5. In the testing process, no units are
dropped out.
23
2.2.4. Keras
Keras is a deep learning framework for Python that provides a way to define and train almost
any kind of deep learning model. It will be used for the development of the project.
It contains different modules to create new deep learning models, such as neural layers,
optimizers, activation functions, etc.
Keras relies on back-end techniques such as TensorFlow for doing its own low-level
operations. TensorFlow is another machine learning framework created by Google used to
design, build and train deep learning models. It also has a library to do complex numerical
computations.
How deep learning models can process text and sequence data is described in this chapter.
The two main deep learning algorithms are recurrent neural networks and 1D convents, which
will be presented in the following sections.
In most of the cases, text data has to be stored and modified before feeding it into the neural
network. Some of the Python libraries used in the project are the following:
- Pandas DataFrame: it consists of a two-dimensional data structure with labeled axes
(rows and columns). It has three main components: data, rows and columns. Pandas
DataFrame can be created from existing datasets storage in SQL Databases, CSV
files, Excel files, etc. An example can be observed in the figure below:
24
- Regular expressions: are devices to define and search patterns in texts. They can be
used for text processing such as translating characters, match sequences of
characters, substituting words, or counting them. Python has a module that provides
support for regular expressions.
Like all other neural networks, the models only work with numeric tensors. This is why the text
has to be transformed into numeric tensors. This process is called vectorization. There are
several techniques used to implement it:
- Split the text into words and transform each word into a vector.
- Split the text into characters and transform each character into a vector.
- Extract N-grams of words or characters, and transform each N-gram into a vector. N-
grams are groups of multiple consecutive words or characters.
The resulting units into which the text is split (characters, words or N-grams) are called tokens,
which names the process of breaking down the text into such tokens, tokenization. Some
techniques to implement tokenization are one-hot encoding and word embeddings.
One-hot encoding is the most common and basic way to transform a token into a vector. It
uses a vocabulary, which is a set of unique words in the whole corpus. It associates a unique
integer index with every word in the vocabulary and then turns this integer index into a binary
vector of size N (the size of the vocabulary). The vector representing the token with the
corresponding index i is all zeros except for the i th position, which has value one.
Figure 13 introduces an example of one-hot encoding. The corpus contains two different
sentences, sentence1 and sentence2 respectively. The token_index represents the
vocabulary, which associates each unique word in the corpus to an integer. And finally, the
binary vector representation for both sentences.
25
Figure 13: One-hot encoding example.
Another powerful way to tokenize is using the dense of word vectors, also called word
embeddings. The resulting vectors from one-hot encoding are binary, mostly made of zeros
and very high-dimensional (their size is the number of unique words in the corpus). Word
embeddings provide us low-dimensional floating-point vectors, which are more useful due to
that they contain more information into fewer dimensions.
Recurrent Neural Networks (RNN) are a type of neural network that processes sequences by
iterating through the elements of the sequence and keep a state containing information
relative to what the network has seen until the moment. This state is updated each time the
network process an input sequence. To simplify, we can say that RNN are networks that have
an internal loop.
Figure 14 shows an example of RNN, where each timestep is the output of the loop at time t.
W and U represent the weight matrices and Input represents the different input features.
26
Figure 14: Basic example of a RNN.
To implement the recurrent neural networks in the project we will use different Keras layers.
2.3.3.1. SimpleRNN
Keras has a layer that implements a simple recurrent neural network, the SimpleRNN layer.
LSTM and GRU are other recurrent layers in Keras. Long Short-Term Memory (LSTM) layer is
a variant of recurrent neural networks. This layer, adds a way to carry information across
many timesteps, this means that information from a sequence can jump onto any point of the
network and be transported to a later timestep. To simplify, what it does is to save information
for later.
The figure below details an example of a LSTM. It can be observed that a new data flow is
added, it carries information across timesteps. This information will be combined with the input
connection and the recurrent connection and it will affect the state being sent to the next
timestep.
27
Gated recurrent unit (GRU) layers use the same principle as LSTM, but they are more efficient
and cheaper to run.
The main characteristic of Convolutional Neural Networks is that they learn local patterns
instead of global patterns. A pattern learned at a certain position in a sentence can later be
recognized in a different position. It has been observed that they have a successful
performance for Natural Language Processing (NLP).
As their name says, they use convolution filters that are applied to the local features. The
filters are applied to windows of words, which can have different sizes. The result of applying
one filter to one window of words produces a new feature, and the resulting features from
applying this filter to all the possible windows of words is called feature map.
Convolutional networks use pooling operations, such as average and max pooling in order to
spatially downsample the feature maps. The idea is to keep the most important features (the
ones with the highest values or an average of all of them), for each feature map.
A simple example of how CNN work can be observed in the figure below.
For the example in Figure 16, two different filters are applied to the window sizes two and
three. The result of applying a filter to each window size gives two feature maps for each
28
window size. Finally, max pooling operation is applied and the resulting features are
concatenated forming the final feature vector.
Another powerful technique for complex problems in deep learning models is ensembling
models. Ensembling consists of combining the predictions of a set of different models to
produce better predictions. It is important that every single model has good performance
independently. Generally, different models look at slightly different features of the data, so this
is why combining them provides knowledge for a better prediction.
29
3. Methodology / project development:
The following section presents the steps of the project development and a description of the
different trained models.
In previous sections it was mentioned that the project is broken down into three different
subtasks, depending on the level of offense. The three subtasks are described next.
The first subtask, subtask A, consists of classifying the tweets between Offensive and Not
Offensive. Offensive posts include insults, threats, and posts containing any form of
untargeted profanity. The labels for this task are the following:
- NOT, Not Offensive
- OFF, Offensive
In subtask B, the goal is to predict the type of offense, meaning if the offense is to an
individual, group or others or if it is non-targeted. Only posts labeled as Offensive in subtask A
are included in this task. The different classes in the task are:
- TIN, Targeted Insult
- UNT, Untargeted
Finally, the subtask C aims to classify the type of target offense. For the Targeted Insult posts,
identify the corresponding target. The possible labels are:
- IND, Individual
- GRP, Group
- OTH, Others
3.2. Dataset
The dataset used in the task OffensEval is the Offensive Language Identification Dataset
(OLID) [3]. This dataset was created specifically for this task. It contains 14,100 English
tweets, 13,240 provided as the training data and 860 as the testing data. Each tweet in the
dataset has an id and the corresponding labels for each subtask. An example of the first and
second tweets on the training dataset is the following:
30
Figure 17: Example of the first two tweets in the training dataset.
Figure 18 provides an analysis of the data distribution in the train and test datasets.
OLID is annotated following the hierarchical three-level annotation schema that takes both the
target and the type of offensive content into account. The dataset was annotated using the
crowdsourcing platform Figure Eight [10].
The training dataset is split into a training and validation set. The validation set is the 10% of
the original training data.
Before feeding the data into the machine learning models, some pre-processing was applied.
Most of the steps were done using the Python libraries mentioned in the section Preparing the
text data.
31
The pre-processing steps used are the following:
- Cleaning:
Remove all the special characters and all the instances of USER and URL.
- Hashtags:
The hashtags are split into words. For example, #DeepStateCorruption is split into
Deep State Corruption.
- Lowercasing:
Lowercase all the tweets.
- Tokenization:
The text is broken into words and associates each word with a numeric vector.
- Embeddings:
Pre-trained word embeddings are used. They are word embeddings that were pre-
computed using a different machine learning task that the one we want to solve. There
are many different pre-trained word embeddings. In our project, we use GloVe pre-
trained word vectors [11].
3.4. Measurement
The official measure used to evaluate the performance of the different models established in
the competition is macro-averaged F1-score, due to the imbalance between the number of
instances in the different classes in the tasks.
The initial phase consists of creating the baseline. The first model implemented, is based on
Logistic Regression. In this case, our data is encoded using one-hot encoding, see an
example in the section One-hot encoding. To encode the tweets, these are split into words
and each tweet is represented as a vector of length the total number of words in the whole
corpus. This vector was initialized to 0 and then set to 1's in all the positions in which the word
is in the tweet. For this, it’s necessary to create an index of the whole unique words in the
corpus.
To encode the labels, a binary vector was obtained in which the value was '0' in case of 'NOT'
and '1' in case of 'OFF'.
Finally, the result was a binary matrix of size: #tweets x #tokens, and a label vector of size:
#tweets, which were the input data to the Logistic Regression classifier.
32
3.6. Training deep learning models
The next step after creating the baseline is to implement and train different deep learning
models to compare their performance with the baseline. The trained deep learning models are
outlined in this section.
After trying some variations of different parameters in all the trained models, it has been
observed that the ones that give the best results are the following, so for all the models used
we applied these same parameters:
- The maximum length of the tweets is fixed to 200, so the input sequences are padded
in order to make them all have the same length.
- The tweets are encoded using pre-trained word embeddings. There are several pre-
trained word vector dimensions in GloVe. For our project, we used the 200-
dimensional vectors. This means that one word is represented as a vector with 200
elements.
- And finally, all the models are trained with 10 epochs, which are the iterations in the
training process.
For all the models, an Embedding layer is applied at the beginning, and the final layer consists
of a Dense layer with 2 units and 'softmax' as the activation function. The trained deep
learning models are presented below:
Recurrent Neural Networks (RNN) have a good performance in cases of language because
each neuron or unit can use its internal memory to maintain information about the previous
input. As mentioned in the section Recurrent Neural Networks, RNN have loops in them that
allow information to be carried across neurons while reading an input. This allows the network
to have context from the beginning of a sentence, which will allow more accurate predictions
of a word at the end of a sentence.
33
This model consisted of a simple Recurrent Neural Network layer using 32 units.
Long Short Term Memory (LSTM) are capable of learning long-term dependencies,
remembering information for long periods of time. More detailed information of LSTM was
noted in the section LSTM and GRU.
3.6.2.3. LSTM
This model is more complex than the previous one. After the Embedding layer, we add a
Dropout layer with a dropout rate of 0.4. As mentioned in the section Dropout, The goal of a
Dropout layer is to prevent the model from overfitting. It randomly sets the outgoing neurons to
0 at each update of the training phase.
In this case, the LSTM layer has 196 units, and a Dropout layer with a rate of 0.25 is applied
before the final layer.
This model consisted of mixing Bidirectional LSTM (BiLSTM) with Dropout layers. BiLSTM are
bidirectional variants of LSTM. With them, the learning algorithm is fed with the original data
once from the beginning to the end and once from the end to the beginning.
We use 196 LSTM units wrapped by a Bidirectional layer, and 0.25 being the dropout rate,
followed by another 196 LSTM units wrapped by a Bidirectional layer, 0.25 being, as well, the
dropout rate.
A Flatten layer is applied before the final layer. The goal of the Flatten layer is to reshape the
tensor, removing all the dimensions except for one.
34
3.6.3. Deep learning models (II)
In order to improve the performance of our models, we decided to investigate what the teams
in the competition last year (OffensEval 2019) implemented. Most of the best-score teams in
the competition used the BERT model [9]. BERT is a new technique for Natural Language
Processing (NLP) that was open-sourced by the researchers at Google AI Language at the
end of 2018. This technique is quite new and complex, so we skip it and look at the teams that
obtained the best results without using BERT. These teams used the models presented below,
so we tried an approximation of them, to compare and evaluate our results.
This model consists of a BiLSTM layer and a BiGRU layer. BiGRU are Bidirectional Gated
Recurrent Units, they can also be considered as a variation on the BiLSTM. GRU are
mentioned in the chapter LSTM and GRU.
For this model, we use a Bidirectional LSTM layer with 196 units with dropout rate 0.3,
followed by a Bidirectional GRU layer with 64 GRU units also with 0.3 as dropout rate.
Then, a Max Pooling and Average Pooling layers are used and the results are concatenated.
Pooling layers provide an approach to downsampling feature maps by summarizing the
presence of features in the feature map.
This model is based on Convolutional Neural Network (CNN), which are presented in the
section Convolutional Neural Networks, with a Global Max Pooling layer.
Before the final layer, a hidden Dense layer is applied with 256 units and 'relu' as the
activation function.
In this model, we use a Dropout layer after the embedding with a dropout rate of 0.3. The
convolutional network applied consists of three different filters with window sizes 2, 3 and 4.
For each different window size, 256 filters are used and a Max Pooling layer is then applied to
the feature maps. The resulting vectors are concatenated.
35
A Dropout layer with dropout rate 0.3 is applied before the input to the final Dense layer with
256 neurons for classification.
3.6.3.4. Ensemble
36
4. Results
As reported in the previous section Measurement, the evaluation of the final results is done
with the values of accuracy and macro F1 score. The final evaluation is done with the results
of the model using the test dataset.
This section only shows the best results for each subtask. The complete results for all the
models in each subtask are presented in the appendix Appendix I: Detailed results of the
machine learning models.
The results obtained using the Logistic Regression classifier in the different subtasks are
displayed in the table below.
4.2.1. Subtask A
The models that perform better on the test dataset for the first subtask are the following:
LSTM, BiLSTM with Dropout and BiLSTM combined with BiGRU. The best one being BiLSTM
with Dropout, which gives us the maximum macro F1 score of 0.7739.
Aiming to obtain even better results, we tried to ensemble these three models. After trying
some different combinations of ensembles, the best result was given by ensembling LSTM
and BiLSTM + Dropout.
37
Validation dataset Test dataset
Model
Accuracy Macro F1 score Accuracy Macro F1 score
BiLSTM + Dropout
0.7689 0.7446 0.8337 0.7807
LSTM
Table 3: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset, for subtask A.
4.2.2. Subtask B
For the subtask B, the models based in RNN, LSTM and CNN combining three different filters
give the best results on the test dataset. The best macro F1 score is 0.6634, given by the last
named model.
In this case, the best ensemble model results for combining the three models that give the
best performance on the test set; Simple RNN, LSTM and CNN (3 filters). However, this
ensemble reach a macro F1 score of 0.5953, which is worse than the performance of the CNN
(3 filters) model. In conclusion, the best model for the subtask B is based on CNN with three
different filters.
4.2.3. Subtask C
Finally, for the last subtask, the models that provide the best performance on the test dataset
are LSTM, CNN with a Global Max Pooling layer and CNN combining three different filters.
The second model, CNN with Global Max Pooling, gives the best result and it reaches a
macro F1 score of 0.5943.
Ensembling the two models that performed best; CNN + Global Max Pooling and CNN (3
filters) even better results are obtained.
38
4.3. Comparison with OffensEval 2019
In last year’s competition, most of the best results were obtained using BERT model [9].
For subtask A, the team that performed best without using BERT used an ensemble of three
different models based on CNN, BiGRU and BiLSTM with Attention. This team was ranked
sixth out of 104 teams, submitting a final macro F1 score of 0.806 [5].
For the second subtask, 76 teams participated and the best teams used rule-based
approaches and ensembling of deep learning (including BERT) and non-neural machine
learning models. The best team reached a macro F1 score of 0.755 [7].
Finally, in subtask C, where 66 teams participated, ensembles were used by five of the top-10
teams. The best team obtained a macro F1 score of 0.660, using BERT model [6].
The final best results obtained with our models, for each subtask are the following:
BiLSTM + Dropout
A 0.7807 24th out of 104 teams
LSTM
As expected, it is very complicated to simulate the same models that the best teams
implemented, since there are many facts that affect the model performance. This can be
influenced by the different parameters applied in the deep learning models, or also the pre-
processing techniques used. Some of the teams in the competition also applied techniques
such as normalization and lemmatization to the initial tweets.
39
5. Budget
To carry out this project it is only necessary an engineer with Python skills and a computer.
The cost breakdown for the project is shown in Table 7.
Concept
TOTAL 4,696 €
Table 7: Cost breakdown of the project.
40
6. Conclusions and future development:
To sum up, we have demonstrated that recurrent and convolutional neural networks have a
good performance working with text data and that ensembling different deep learning models
is a successful technique to solve classification problems.
The most important limitation lies in the fact of how slow can be the training process of each
model. To find the optimal parameters for each model, the model has to be trained one time
for every possible combination of parameters. As it is impossible to try all the possible
combinations, we proceed trying some sets that should perform well.
However, given the number of times we trained our models, we have managed to obtain
satisfactory results that can be compared with the ones obtained by the professional teams
that participated in last year’s competitions. We believe our work could be a starting point for
future implementations.
To further research, we plan to train and test the system with bigger datasets and implement
more pre-processing techniques such as normalization of the tokens, converting emojis to text
and lemmatization, etc.
41
Bibliography:
[1] P. Nugues, An Introduction to Language Processing with Perl and Prolog. An Outline of Theories,
Implementation, and Application with Special Consideration of English, French, and German. Series:
Cognitive Technologies, Springer, 2006, ISBN: 3-540-25031-X.
[2] F. Chollet, Deep learning with Python. Manning, 2017, ISBN: 9781617294433.
[3] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar. “Predicting the Type and Target of
Offensive Posts in Social Media”. In Proceedings of NAACL. 2019a.
[4] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar. “SemEval-2019 Task 6: Identifying
and Categorizing Offensive Language in Social Media (OffensEval)”. In Proceedings of The 13th
International Workshop on Semantic Evaluation (SemEval). 2019b.
[5] D. Mahata, H. Zhang, K. Uppal, Y. Kumar, R. Ratn Shah, S. Shahid, L. Mehnaz, S. Anand. “MIDAS at
SemEval-2019 Task 6: Identifying offensive posts and targeted offense from Twitter”. In Proceedings of
The 13th International Workshop on Semantic Evaluation (SemEval). 2019.
[6] A. Nikolov and V. Radivchev. “NikolovRadivchev at SemEval-2019 Task 6: Offensive tweet classification
with BERT and ensembles”. In Proceedings of The 13th International Workshop on Semantic Evaluation
(SemEval). 2019.
[7] J. Han, X. Liu, S. Wu. “jhan014 at SemEval-2019 Task 6: Identifying and categorizing offensive language
in social media”. In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).
2019.
[8] Y. Kim. “Convolutional neural networks for sentence classification”. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.
[9] J. Delvin, M. Chang, K. Lee, K. Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding”. 2018.
[10] Figure Eight, The Essential High-Quality Data Annotation Platform. [Online] Available: https://www.figure-
eight.com/
[11] J. Pennington, R. Socher, C.D. Manning. “Glove: Global vectors for word representation”. 2014. In EMNLP.
42
Appendices:
Subtask A:
Figure 19: Confusion matrix of Logistic Regression on the test dataset, for subtask A.
Subtask B:
43
Figure 20: Confusion matrix of Logistic Regression on the test dataset, for subtask B.
Subtask C:
Figure 21: Confusion matrix of Logistic Regression on the test dataset, for subtask C.
44
Results for deep learning models
Subtask A:
BiLSTM + Dropout
0.7689 0.7446 0.8337 0.7807
LSTM
Table 12: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset, for subtask A.
Figure 22: Confusion matrix of Ensemble (BiLSTM+Dropout and LSTM models) on the test dataset, for subtask A.
45
Subtask B:
Simple RNN
0.9131 0.5382 0.8792 0.5953
Simple LSTM
CNN (3 filters)
Simple RNN
0.8860 0.5228 0.8625 0.5785
CNN (3 filters)
46
Figure 23: Confusion matrix of CNN (3 filters) on the test dataset, for subtask B.
Subtask C:
47
Figure 24: Confusion matrix of Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the test dataset, for
subtask C.
48
Appendix II: Previous project
This thesis is the continuation of a previous project carried out in the Project in Language
Technology course at Lund University. The final report for this course is included in this
appendix. Notice that the paper is based on subtask A and that the results reported are
slightly different than the ones obtained in the thesis development.
49
Identifying and Categorizing Offensive Language in Social Media
(OffensEval)
References
Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and
Marcos Zampieri. 2018. Benchmarking aggression
identification in social media. Proceeding of the
First Workshop on Trolling, Aggression and Cyber-
bulling (TRAC).
Debanjan Mahata, Haimin Zhang, Karan Uppal, Ya-
man Kumar, Rajiv Ratn Shah, Simra Shahid, Laiba
Mehnaz, and Sarthak Anand. 2019. Midas at
semeval-2019 task 6: Identifying offensive posts and
targeted offense from twitter. Proceedings of The
13th International Workshop on Semantic Evalua-
tion (SemEval).
Alex Nikolov and Victor Radivchev. 2019. Nikolov-
radivchev at semeval-2019 task 6: Offensive tweet
classification with bert and ensembles. Proceed-
ings of The 13th International Workshop on Seman-
tic Evaluation (SemEval).
Jefffrey Pennington, Richard Socher, and Christofer D.
Manning. 2014. Glove: Global vectors for word
representation. In EMNLP.
Alessandro Seganti, Helena Sobol, Iryna Orlova,
Hannam Kim, Jakub Staniszewski, Tymoteusz
Appendix III: Installation guide
The baseline of the project consists of implementing a Logistic Regression classifier, which
corresponds to the file logistic_regression.py. The rest of the project consists of implementing
different deep learning models, which corresponds to the jupyter
notebook deep_learning.ipynb.
Getting started
- Logistic Regression:
To run the program it is necessary to install the sklearn Python module and install the
Anaconda distribution.
Prerequisites
Both files are programmed with Python 3.
- Logistic Regression:
It is necessary to install:
o Sklearn Python module.
o Anaconda distribution.
50
Running the program
- Logistic Regression:
Run the Python file. Note that it is necessary to set the paths for all the needed files.
Example of execution:
The steps to train and evaluate one deep learning for one subtask are the following. Let’s
show an example for trying the model CNN (3 filters) for the subtask B.
a) Run all the cells in the section 1. First steps.
b) Run all the cells in the section 2. Setting the environment.
c) Run all the cells in the section 3. Initializing deep learning models.
d) Run the cells in the section 5. SUBTASK B. Note that in the second cell of the section
you need to discomment the line of the model you want to try, in this case, the line
calling to the function initialize_CNN_3().
As you can see in the previous example, the first three sections 1-3 need to be run always in
order to initialize all the functions required.
51
Appendix IV: Code
The code used for the project is detailed in this appendix and it is organized as follows:
52
Identifying and Categorizing Offensive Language in tweets using Deep Learning
models
1. First steps
The next cell, imports the keras module that will be essential to run the code.
import keras
keras.__version__
For the project, the code was run using Google CoLab. It is recommended to run the code using CoLab, due to the high computational cost that
the program requires. Then, it is necessary to import the directory where the dataset is stored on the drive.
from google.colab import drive
drive.mount('/content/drive')
The next cell imports all the necessary modules to run the program.
# IMPORT MODULES
import numpy as np
import re
import pandas as pd
from keras.models import Sequential, Model Fl
from keras.layers import Dense, SimpleRNN, Embedding, concatenate, Input, LSTM, SpatialDropout1D, Dropout, GRU, Bidirectional, Fl
from keras.utils import plot_model
from keras.utils.np_utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
In the cell below, all the necessary constants of the program are initialized. NOTE that the les path should be set.
# INITIALIZATION
train_file = 'drive/My Drive/Dataset/olid-training-v1.0.tsv' # set corresponding path for the train dataset file
test_file_a = 'drive/My Drive/Dataset/testset-levela.tsv' # set corresponding path for the test dataset file
test_file_b = 'drive/My Drive/Dataset/testset-levelb.tsv' # set corresponding path for the test dataset file
test_file_c = 'drive/My Drive/Dataset/testset-levelc.tsv' # set corresponding path for the test dataset file
test_labels_a = 'drive/My Drive/Dataset/labels-levela.csv' # set corresponding path for the labels of the test dataset file
test_labels_b = 'drive/My Drive/Dataset/labels-levelb.csv' # set corresponding path for the labels of the test dataset file
test_labels_c = 'drive/My Drive/Dataset/labels-levelc.csv' # set corresponding path for the labels of the test dataset file
maxlen = 200
max_fatures = 10000
# GloVe Pretrained Embeddings
embedding_dim = 200
models_a = list()
models_b = list()
models_c = list()
ensemble_models_a = list()
ensemble_models_b = list()
c =
ensemble_models_c = list()
In this section you can nd all the functions that have to be initialized before training and the models.
Data Preprocessing
The data preprocessing is applied at the cell below. The techniques applied are:
Remove all the special characters and instances of USER and URL.
Split the hashtag into words.
Lowercase all the tweets.
Tokenization.
NOTE that in this cell, the train dataset is split into a train and validation dataset. The validation dataset is the 10% of the original train dataset.
# UTIL FUNCTIONS
# SPLIT HASHTAG INTO WORDS UTIL FUNCTIONS
def replace_hashtag(tweet):
hashtags = find_hashtag(tweet)
for hashtag in hashtags:
split = split_hashtag(hashtag)
tweet = tweet.replace(hashtag, split)
tweet = tweet.replace('#', '')
return tweet
def find_hashtag(tweet):
hashtags = re.findall(r"#(\w+)", tweet)
return hashtags
def split_hashtag(hashtag):
fo = re.compile(r'#[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
fi = fo.findall(hashtag)
result = ''
for var in fi:
result += var + ' '
return result
# DATA PREPROCESSING STEPS
def data_preprocessing(data):
for i, j in data.iterrows():
data.at[i,'tweet'] = replace_hashtag(j['tweet']) # split hashtags into words
data['tweet'] = data['tweet'].apply(lambda x: x.lower()) # lowercase
data['tweet'] = data['tweet'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # remove special characters
data['tweet'] = data['tweet'].str.replace('user','') # remove 'user' tokens
data['tweet'] = data['tweet'].str.replace('url','') # remove 'url' tokens
# TRAINING DATA (FOR ALL THE SUBTASKS)
def import_data(train_file, test_file, test_labels, subtask_name):
# IMPORT DATASET
data = pd.read_csv(train_file, sep='\t', header=0)
data = data[['id','tweet', 'subtask_a', 'subtask_b', 'subtask_c']]
# DATA PREPROCESSING
data_preprocessing(data)
# TOKENIZER
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['tweet'].values)
X = tokenizer.texts_to_sequences(data['tweet'].values)
X = pad_sequences(X, maxlen=maxlen)
# REAL TEST DATASET
data_test = pd.read_csv(test_file, sep='\t', header=0)
# DATA PREPROCESSING
# split hashtags into words
for i, j in data_test.iterrows():
data_test.at[i,'tweet'] = replace_hashtag(j['tweet'])
data_test['tweet'] = data_test['tweet'].apply(lambda x: x.lower()) # lowercase
data_test['tweet'] = data_test['tweet'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # remove special characters
data_test['tweet'] = data_test['tweet'].str.replace('user','') # remove 'user' tokens
data_test['tweet'] = data_test['tweet'].str.replace('url','') # remove 'url' tokens
labels_test = pd.read_csv(test_labels, sep=',', header=0)
labels_test = labels_test[['id', subtask_name]]
data_test = pd.merge(data_test, labels_test, on='id')
Y = pd.get_dummies(data[subtask_name]).values
# Testing with validation dataset
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
# Testing with original test dataset
X_test_real = tokenizer.texts_to_sequences(data_test['tweet'].values)
X_test_real = pad_sequences(X_test_real, maxlen=maxlen)
Y_test_real = pd.get_dummies(data_test[subtask_name]).values
return X train X test Y train Y test X test real Y test real tokenizer
return X_train, X_test, Y_train, Y_test, X_test_real, Y_test_real, tokenizer
In the next cell, the pre-trained word embeddings are loaded, and the embedding matrix is obtained. NOTE that the le path for the embeddings
has to be set.
# PRETRAINED EMBEDDINGS
def pretrained_embeddings(tokenizer):
#embedding_file = open('drive/My Drive/embeddings/glove.6B.100d.txt') # 100-dimensional pre-trained word embeddings - set the f
embedding_file = open('drive/My Drive/embeddings/glove.6B.200d.txt') # 200-dimensional pre-trained word embeddings - set the fi
embeddings_index = {}
for line in embedding_file:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
embedding_file.close()
print('Found %s word vectors.' % len(embeddings_index))
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
embedding_matrix = np.zeros((max_fatures, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if i < max_fatures:
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
r =
embedding_matrix[i] = embedding_vector
return embedding_matrix
The next functions predict and plot the results of the prediction for the models using the validation and the test dataset.
def prediction_and_results(model, X_test, Y_test, X_test_real, Y_test_real):
# PREDICTION FOR ONE MODEL
# PREDICTION USING VALIDATION DATASET
print('Model:', model_name)
print('Model prediction with validation dataset:')
# Prediction for class Model
Y_pred = model.predict(X_test,batch_size = batch_size)
Y_pred = np.argmax(Y_pred,axis=1)
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred': Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("Confusion matrix:", confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred, digits=4))
# PREDICTION USING REAL TEST DATASET
print('Model:', model_name)
print('Model prediction with test dataset:')
# Prediction for class Model
Y_pred_real = model.predict(X_test_real,batch_size = batch_size)
Y_pred_real = np.argmax(Y_pred_real,axis=1)
df_test_real = pd.DataFrame({'true': Y_test_real.tolist(), 'pred':Y_pred_real})
df_test_real['true'] = df_test_real['true'].apply(lambda x: np.argmax(x))
print("Confusion matrix", confusion_matrix(df_test_real.true, df_test_real.pred))
print(classification_report(df_test_real.true, df_test_real.pred, digits=4))
def prediction_and_results_ensemble(ensemble_model, X_test, Y_test, X_test_real, Y_test_real):
# PREDICTION FOR ENSEMBLE MODELS
# PREDICTION USING VALIDATION DATASET
# PREDICTION USING VALIDATION DATASET
y_combine = [model.predict(X_test, batch_size = batch_size) for model in ensemble_model]
y_combine = np.array(y_combine)
# sum across ensembles
summed = np.sum(y_combine, axis=0)
# argmax across classes
Y_pred = np.argmax(summed, axis=1)
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred': Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("Confusion matrix:", confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred, digits=4))
# PREDICTION USING REAL TEST DATASET
y_combine = [model.predict(X_test_real, batch_size = batch_size) for model in ensemble_model]
y_combine = np.array(y_combine)
# sum across ensembles
summed = np.sum(y_combine, axis=0)
# argmax across classes
Y_pred_real = np.argmax(summed, axis=1)
df_test_real = pd.DataFrame({'true': Y_test_real.tolist(), 'pred':Y_pred_real})
df_test_real['true'] = df_test_real['true'].apply(lambda x: np.argmax(x))
print("Confusion matrix", confusion_matrix(df_test_real.true, df_test_real.pred))
print(classification_report(df_test_real.true, df_test_real.pred, digits=4))
The following cells are divided into the different functions that initialize the deep learning models (each cell initializes a different model).
SimpleRNN
def initialize_SimpleRNN(output):
# Simple RNN
model_name = 'Simple RNN'
de
model = Sequential()
#model.add(Embedding(max_fatures, embedding_dim,input_length = maxlen)) # without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) # with pret
model.add(SimpleRNN(32))
model.add(Dense(output, activation='softmax'))
return model, model_name
Simple LSTM
def initialize_SimpleLSTM(output):
# Simple LSTM
model_name = 'Simple LSTM'
model = Sequential()
#model.add(Embedding(max_features, 32)) # without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) # with pret
model.add(LSTM(32))
model.add(Dense(output, activation='softmax'))
me
return model, model_name
LSTM
def initialize_LSTM(output):
# LSTM
model_name = 'LSTM'
lstm_out = 196
model = Sequential()
#model.add(Embedding(max_fatures, embedding_dim,input_length = maxlen)) # without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) # with pret
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(output, activation='softmax'))
a
return model, model_name
BiLSTM + DROPOUT
def initialize_BiLSTM_Dropout(output):
# BILSTM + DROPOUT
model_name = 'BILSTM + DROPOUT'
lstm_out = 196
model = Sequential()
#model.add(Embedding(max_fatures, embedding_dim, input_length = maxlen, trainable = True)) #without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) #with pretr
model.add(Dropout(0.25))
model.add(Bidirectional(LSTM(lstm_out, return_sequences=True, recurrent_dropout=0.25)))
model.add(Dropout(0.25))
model.add(Bidirectional(LSTM(lstm_out, return_sequences=True, recurrent_dropout=0.25)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(output, activation='softmax'))
(L
return model, model_name
def initialize_CNN_GMP(output):
# CNN + GLOBAL MAX POOLING
model_name = 'CNN + GLOBAL MAX POOLING'
model = Sequential()
#model.add(Embedding(max_fatures, embedding_dim, input_length = maxlen, trainable = False)) #without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) #with pretr
model.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(256, activation='relu'))
model.add(Dense(output, activation='softmax'))
a
return model, model_name
CNN (3 lters)
def initialize_CNN_3(output):
# CNN (3 filters)
model_name = 'CNN'
embedding_dim = 200
i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = Embedding(max_fatures, embedding_dim ,weights = [embedding_matrix], input_length=maxlen, trainable=True)(i)
x = Dropout(0.4)(x)
def get_conv_pool(x_input, max_len, sufix, n_grams=[2,3,4], feature_maps=256):
branches = []
for n in n_grams:
branch = Conv1D(filters=feature_maps, kernel_size=n, activation='relu', name='Conv_'+sufix+'_'+str(n))(x_input)
branch = MaxPooling1D(pool_size=max_len-n+1, strides=None, padding='valid', name='MaxPooling_'+sufix+'_'+str(n))(bran
branch = Flatten(name='Flatten_'+sufix+'_'+str(n))(branch)
branches.append(branch)
return branches
branches = get_conv_pool(x, maxlen, 'dynamic')
z = concatenate(branches, axis=-1)
z1 = Dropout(0.3)(z)
z2 = Dense(256, activation='relu')(z1)
o = Dense(2, activation='softmax')(z2)
model = Model(inputs=i, outputs=o)
return model, model_name
BiLSTM + BiGRU
def initialize_BiLSTM_BiGRU(output):
# BILSTM + BIGRU
model_name = 'BILSTM + BIGRU'
lstm_units = 196
gru_units = 64
i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)(i)
x = Dropout(0.4)(x)
x1 = Bidirectional(LSTM(lstm_units, return_sequences=True, recurrent_dropout=0.3))(x)
x2 = Dropout(0.3)(x1)
x3 = Bidirectional(GRU(gru units, return sequences=True))(x2)
_ _
x4 = Dropout(0.3)(x3)
max_pooling = MaxPooling1D()(x4)
max_pooling = Flatten()(max_pooling)
average_pooling = AveragePooling1D()(x4)
average_pooling = Flatten()(average_pooling)
z1 = concatenate([max_pooling, average_pooling], axis=-1)
z2 = Dense(128, activation='relu')(z1)
o = Dense(2, activation='softmax')(z2)
model = Model(inputs=i, outputs=o)
me
return model, model_name
4. SUBTASK A
This section has the code for the implementation of the Subtask A.
# SUBTASK A
# IMPORT DATA
print('IMPORT DATA...')
X_train, X_test_a, Y_train_a, Y_test_a, X_test_real_a, Y_test_real_a, tokenizer = import_data(train_file, test_file_a, test_label
embedding_matrix = pretrained_embeddings(tokenizer)
The following cell initializes, compiles, plots and trains the model. You can only try one model each time.
NOTE that you need to comment the lines that initialize the models that are not used.
# INITIALIZE MODELS
mod
#model, model_name = initialize_SimpleRNN(2)
#model, model_name = initialize_SimpleLSTM(2)
#model, model_name = initialize_LSTM(2)
#model, model_name = initialize_BiLSTM_Dropout(2)
#model, model_name = initialize_CNN_GMP(2)
#model, model_name = initialize_CNN_3(2)
#model, model_name = initialize_BiLSTM_BiGRU(2)
# COMPILE MODEL
print('COMPILE THE MODEL...')
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
# MODEL SCHEME
print('PLOT THE MODEL...')
print('Model:', model_name)
print(model.summary())
plot_model(model, show_shapes=True, show_layer_names=True)
# SUBTASK A: FIT THE MODEL
print('FIT THE MODEL...')
print('Model:', model_name)
batch_size = 128
model.fit(X_train, Y_train_a, epochs = 10, batch_size=batch_size)
When training different deep learning models is useful to save them, in this case we use a list. This will be used also to implement the model
ensembling.
# SAVE THE MODEL TO A LIST
models_a.append(model) # save all the trained models
ensemble_models_a.append(model) # save models for later ensembling
The following cell executes the prediction and evaluation for the model that was trained last, which is saved in the variable model.
# PREDICTION FOR ONE MODEL
print('PREDICT FOR ONE MODEL...')
prediction_and_results(model, X_test_a, Y_test_a, X_test_real_a, Y_test_real_a)
The next cell only needs to be run, if you want to try ensembling models. To ensemble different deep learning models is necessary to train them
separately and save them in the list ensemble_models.
# PREDICTION FOR ENSEMBLES
print('PREDICT FOR ENSEMBLE...')
prediction_and_results_ensemble(ensemble_models_a, X_test_a, Y_test_a, X_test_real_a, Y_test_real_a)
5. SUBTASK B
This section has the code for the implementation of the Subtask B.
# SUBTASK B
# IMPORT DATA
print('IMPORT DATA...')
X_train, X_test_b, Y_train_b, Y_test_b, X_test_real_b, Y_test_real_b, tokenizer = import_data(train_file, test_file_b, test_label
embedding_matrix = pretrained_embeddings(tokenizer)
The following cell initializes, compiles, plots and trains the model. You can only try one model each time.
NOTE that you need to comment the lines that initialize the models that are not used.
# INITIALIZE MODELS
#model, model_name = initialize_SimpleRNN(2)
#model, model_name = initialize_SimpleLSTM(2)
#model, model_name = initialize_LSTM(2)
#model, model_name = initialize_BiLSTM_Dropout(2)
mod
#model, model_name = initialize_CNN_GMP(2)
#model, model_name = initialize_CNN_3(2)
#model, model_name = initialize_BiLSTM_BiGRU(2)
# COMPILE MODEL
print('COMPILE THE MODEL...')
model.add(Dense(2, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
# MODEL SCHEME
print('PLOT THE MODEL...')
print('Model:', model_name)
print(model.summary())
plot_model(model, show_shapes=True, show_layer_names=True)
# SUBTASK B: FIT THE MODEL
print('FIT THE MODEL...')
print('Model:', model_name)
batch_size = 128
model.fit(X_train, Y_train_b, epochs = 10, batch_size=batch_size)
When training different deep learning models is useful to save them, in this case we use a list. This will be used also to implement the model
ensembling.
# SAVE THE MODEL TO A LIST
models_b.append(model) # save all the trained models
ensemble_models_b.append(model) # save models for later ensembling
The following cell executes the prediction and evaluation for the model that was trained last, which is saved in the variable model.
# PREDICTION FOR ONE MODEL
print('PREDICT FOR ONE MODEL...')
prediction_and_results(model, X_test_b, Y_test_b, X_test_real_b, Y_test_real_b)
The next cell only needs to be run, if you want to try ensembling models. To ensemble different deep learning models is necessary to train them
separately and save them in the list ensemble_models.
# PREDICTION FOR ENSEMBLES
print('PREDICT FOR ENSEMBLE...')
prediction_and_results_ensemble(ensemble_models_b, X_test_b, Y_test_b, X_test_real_b, Y_test_real_b)
6. SUBTASK C
This section has the code for the implementation of the Subtask C.
# SUBTASK C
# IMPORT DATA
print('IMPORT DATA...')
X_train, X_test_c, Y_train_c, Y_test_c, X_test_real_c, Y_test_real_c, tokenizer = import_data(train_file, test_file_c, test_label
embedding_matrix = pretrained_embeddings(tokenizer)
The following cell initializes, compiles, plots and trains the model. You can only try one model each time.
NOTE that you need to comment the lines that initialize the models that are not used.
# INITIALIZE MODELS
#model, model name = initialize SimpleRNN(2)
#model, model_name = initialize_SimpleLSTM(2)
#model, model_name = initialize_LSTM(2)
#model, model_name = initialize_BiLSTM_Dropout(2)
mod
#model, model_name = initialize_CNN_GMP(2)
#model, model_name = initialize_CNN_3(2)
#model, model_name = initialize_BiLSTM_BiGRU(2)
# COMPILE MODEL
print('COMPILE THE MODEL...')
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
# MODEL SCHEME
print('PLOT THE MODEL...')
print('Model:', model_name)
print(model.summary())
plot_model(model, show_shapes=True, show_layer_names=True)
# SUBTASK C: FIT THE MODEL
print('FIT THE MODEL...')
print('Model:', model_name)
batch_size = 128
model.fit(X_train, Y_train_c, epochs = 10, batch_size=batch_size)
When training different deep learning models is useful to save them, in this case we use a list. This will be used also to implement the model
ensembling.
# SAVE THE MODEL TO A LIST
models_c.append(model) # save all the trained models
ensemble_models_c.append(model) # save models for later ensembling
The following cell executes the prediction and evaluation for the model that was trained last, which is saved in the variable model.
# PREDICTION FOR ONE MODEL
print('PREDICT FOR ONE MODEL...')
prediction_and_results(model, X_test_c, Y_test_c, X_test_real_c, Y_test_real_c)
The next cell only needs to be run, if you want to try ensembling models. To ensemble different deep learning models is necessary to train them
separately and save them in the list ensemble_models.
# PREDICTION FOR ENSEMBLES
print('PREDICT FOR ENSEMBLE...')
prediction_and_results_ensemble(ensemble_models_c, X_test_c, Y_test_c, X_test_real_c, Y_test_real_c)
Glossary
A list of all acronyms and the meaning they stand for.
53