Degree Thesis BertaVinas.V0.5

Identifying and Categorizing Offensive Language in tweets
using Machine Learning
A Degree Thesis
Submitted to the Faculty of the
Escola Tècnica d'Enginyeria de Telecomunicació de
Barcelona
Universitat Politècnica de Catalunya
by
Berta Viñas Redondo
In partial fulfilment
of the requirements for the degree in
TELECOMMUNICATIONS TECHNOLOGIES AND SERVICES
ENGINEERING
Advisor: Jose Antonio Lázaro Villa
Barcelona, January 2020

Abstract
The aim of this thesis is the development of a system for identifying and categorizing offensive
language in tweets using machine learning techniques. The project is based on Task 12 of the
SemEval 2020 competition. This task consists of identifying offensive tweets and classifying
the type and target of the offense.
For this task, the Offensive Language Identification dataset (OLID) is used. The dataset
contains English tweets annotated. The task is divided into three subtasks depending on the
type and target of the offense.
Different machine learning models are applied for the development of the project. The thesis
provides a detailed analysis and evaluation of the results obtained with the different models
and a comparison with the results in last year’s competition.
It is demonstrated that one of the best models for this task consists of an ensemble of different
deep learning models, resulting a final macro F1 score of 0.7807 for subtask A, 0.6634 for
subtask B and 0.6062 for subtask C.
1
Resum
L’objectiu d’aquesta tesi és el desenvolupament d’un sistema per identificar i categoritzar el

llenguatge ofensiu en tuits mitjançant tècniques d’aprenentatge automàtic. El projecte es basa
en la tasca número 12 de la competició SemEval 2020. Aquesta tasca consisteix a identificar
els tuits ofensius i classificar el tipus i l’objectiu de l’ofensa.
Per a aquesta tasca, s’utilitza el conjunt de dades Offensive Language Identification Dataset
(OLID). El conjunt de dades conté tuits en anglès anotats. La tasca es divideix en tres
subtasques, segons el tipus i l’objectiu de l’ofensa.
Per al desenvolupament del projecte, s'apliquen diferents models d'aprenentatge automàtic.

La tesi proporciona una anàlisi i avaluació detallades dels resultats obtinguts amb els
diferents models i una comparació amb els resultats de la competició de l'any passat.
Es demostra que un dels millors models per a aquesta tasca consisteix en una combinació de
diferents models d’aprenentatge profund, resultant una puntuació final de macro F1 de 0,7807
per a la subtasca A, 0,6634 per a la subtasca B i 0,6062 per a la subtasca C.
2
Resumen
El objetivo de esta tesis es el desarrollo de un sistema para identificar y categorizar el

lenguaje ofensivo en tuits mediante técnicas de aprendizaje automático. El proyecto se basa
en la tarea número 12 de la competición SemEval 2020. Esta tarea consiste en identificar los
tuits ofensivos y clasificar el tipo y el objetivo de la ofensa.
Para esta tarea, se utiliza el conjunto de datos Offensive Language Identification Dataset
(OLID). El conjunto de datos contiene tuits en inglés anotados. La tarea se divide en tres
subtareas, según el tipo y el objetivo de la ofensa.
Para el desarrollo del proyecto, se aplican diferentes modelos de aprendizaje automático. La

tesis proporciona un análisis y evaluación detalladas de los resultados obtenidos con los
diferentes modelos y una comparación con los resultados de la competición del año pasado.
Se demuestra que uno de los mejores modelos para esta tarea consiste en una combinación
de distintos modelos de aprendizaje profundo, resultando una puntuación final de macro F1
de 0,7807 para la subtarea A, 0,6634 para la subtarea B y 0,6062 para la subtarea C.
3
This thesis is dedicated to my family, who have been always by my side, supporting me and
giving me strength during the whole degree.
To the friends in Lund, who shared their advice and encouragement while writing my thesis,
especially to Miquel, loyal workmate during the last weeks.
4
Acknowledgements
I wish to express my gratitude to my supervisor in Lund, Pierre Nugues, who guided and
support me during my study. His knowledge and advice helped me in the research and writing
of my thesis.
Finally, I would like to thank Jose Antonio Lázaro, my supervisor in Barcelona, for giving me
the opportunity to write my thesis at Lund University and for his help during the process.
5
Revision history and approval record
Revision Date Purpose
0 10/12/2019 Document creation
1 22/12/2019 Document revision
DOCUMENT DISTRIBUTION LIST
Name e-mail
Berta Viñas Redondo bertavr@hotmail.com
Jose Antonio Lázaro Villa subdir.internacional@etsetb.upc.edu
Pierre Nugues pierre.nugues@cs.lth.se
Written by: Reviewed and approved by:
Date 10/12/2019 Date 21/01/2020
Name Berta Viñas Redondo Name Jose Antonio Lázaro Villa
Position Project Author Position Project Supervisor
6
Table of contents
Abstract ......................................................................................................................................1
Resum ........................................................................................................................................2
Resumen ....................................................................................................................................3
Acknowledgements ....................................................................................................................5
Revision history and approval record .........................................................................................6
Table of contents........................................................................................................................7
List of Figures...........................................................................................................................10
List of Tables: ...........................................................................................................................11
1. Introduction........................................................................................................................12
1.1. Project background .....................................................................................................12
1.2. Statement of purpose .................................................................................................12
1.3. Project requirements...................................................................................................13
1.4. Project specifications ..................................................................................................13
1.5. Updated work plan ......................................................................................................14
1.6. Deviations and incidences ..........................................................................................17
2. State of the art of the technology used or applied in this thesis: .......................................18
2.1. Introduction to machine learning.................................................................................18
2.1.1. Logistic Regression classifier ...............................................................................19
2.1.2. Evaluating machine learning models ...................................................................20
2.1.3. Scikit-learn............................................................................................................20
2.2. Introduction to deep learning ......................................................................................21
2.2.1. Introduction to neural networks ............................................................................21
2.2.2. Data representation for neural networks ..............................................................22
2.2.2.1. Vectors .............................................................................................................23
2.2.2.2. Matrices ............................................................................................................23
2.2.3. Overfitting .............................................................................................................23
2.2.3.1. Dropout.............................................................................................................23
2.2.4. Keras ....................................................................................................................24
2.3. Deep learning for text sequences ...............................................................................24
2.3.1. Preparing the text data .........................................................................................24
2.3.2. Working with text data ..........................................................................................25
7
2.3.2.1. One-hot encoding .............................................................................................25
2.3.2.2. Word embeddings ............................................................................................26
2.3.3. Recurrent Neural Networks ..................................................................................26
2.3.3.1. SimpleRNN.......................................................................................................27
2.3.3.2. LSTM and GRU ................................................................................................27
2.3.4. Convolutional Neural Networks ............................................................................28
2.3.5. Ensembling models ..............................................................................................29
3. Methodology / project development: .................................................................................30
3.1. Task description..........................................................................................................30
3.2. Dataset .......................................................................................................................30
3.3. Data pre-processing ...................................................................................................31
3.4. Measurement ..............................................................................................................32
3.5. Creating the baseline ..................................................................................................32
3.6. Training deep learning models ...................................................................................33
3.6.1. Used parameters..................................................................................................33
3.6.2. Deep learning models (I) ......................................................................................33
3.6.2.1. Simple RNN......................................................................................................33
3.6.2.2. Simple LSTM ....................................................................................................34
3.6.2.3. LSTM ................................................................................................................34
3.6.2.4. BiLSTM + Dropout ............................................................................................34
3.6.3. Deep learning models (II) .....................................................................................35
3.6.3.1. BiLSTM + BiGRU..............................................................................................35
3.6.3.2. CNN + Global Max Pooling...............................................................................35
3.6.3.3. CNN (3 filters)...................................................................................................35
3.6.3.4. Ensemble..........................................................................................................36
4. Results...............................................................................................................................37
4.1. Baseline results ..........................................................................................................37
4.2. Deep learning results ..................................................................................................37
4.2.1. Subtask A .............................................................................................................37
4.2.2. Subtask B .............................................................................................................38
4.2.3. Subtask C.............................................................................................................38
4.3. Comparison with OffensEval 2019 .............................................................................39
5. Budget ...............................................................................................................................40
6. Conclusions and future development: ...............................................................................41
Bibliography:.............................................................................................................................42
8
Appendices:..............................................................................................................................43
Appendix I: Detailed results of the machine learning models. ..............................................43
Results for Logistic Regression.........................................................................................43
Subtask A: .....................................................................................................................43
Subtask B: .....................................................................................................................43
Subtask C: .....................................................................................................................44
Results for deep learning models......................................................................................45
Subtask A: .....................................................................................................................45
Subtask B: .....................................................................................................................46
Subtask C: .....................................................................................................................47
Appendix II: Previous project ................................................................................................49
Appendix III: Installation guide..............................................................................................50
Appendix IV: Code................................................................................................................52
Glossary ...................................................................................................................................53
9
List of Figures
Figure 1: Work packages structure. .........................................................................................15

Figure 2: Internal structure of WP6 for subtask A. ...................................................................15
Figure 3: Internal structure of WP6 for subtask B. ...................................................................15
Figure 4: Internal structure of WP6 for subtask C. ...................................................................15
Figure 5: Final Gantt diagram...................................................................................................16
Figure 6: Distribution of the hours dedicated to the project......................................................16
Figure 7: AI, Machine learning and Deep learning ...................................................................18
Figure 8: Logistic function. .......................................................................................................19
Figure 9: Data split. ..................................................................................................................20
Figure 10: Input data structure used in the project...................................................................21
Figure 11: Basic scheme of machine learning. [2] ...................................................................22
Figure 12: Pandas DataFrame example. .................................................................................24
Figure 13: One-hot encoding example. ....................................................................................26
Figure 14: Basic example of a RNN. ........................................................................................27
Figure 15: Basic example of LSTM. .........................................................................................27
Figure 16: Basic example of convolution..................................................................................28
Figure 17: Example of the first two tweets in the training dataset. ...........................................31
Figure 18: Data distribution in OLID dataset. ...........................................................................31
Figure 19: Confusion matrix of Logistic Regression on the test dataset, for subtask A. ..........43
Figure 20: Confusion matrix of Logistic Regression on the test dataset, for subtask B. ..........44
Figure 21: Confusion matrix of Logistic Regression on the test dataset, for subtask C. ..........44
Figure 22: Confusion matrix of Ensemble (BiLSTM+Dropout and LSTM models) on the test
dataset, for subtask A. ......................................................................................................45
Figure 23: Confusion matrix of CNN (3 filters) on the test dataset, for subtask B....................47
Figure 24: Confusion matrix of Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the
test dataset, for subtask C. ...............................................................................................48
10
List of Tables:
Table 1: Final work plan. ..........................................................................................................14

Table 2: Results for the Logistic Regression model on the test dataset, for the different
subtasks. ...........................................................................................................................37
Table 3: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset, for
subtask A. .........................................................................................................................38
Table 4: Results for CNN (3 filters) on the train and test dataset, for subtask B......................38
Table 5: Results for Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the train and
Table 6: Final best results for each subtask on the test dataset. .............................................39
Table 7: Cost breakdown of the project....................................................................................40
Table 8: Logistic Regression results on the test dataset, for subtask A...................................43
Table 9: Logistic Regression results on the test dataset, for subtask B...................................43
Table 10: Logistic Regression results on the test dataset, for subtask C.................................44
Table 11: Results for all the trained models, for subtask A. .....................................................45
Table 12: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset,
for subtask A. ....................................................................................................................45
Table 13: Results for all the trained models, for subtask B. .....................................................46
Table 14: Results for Ensembles and CNN (3 filters) on the train and test dataset, for subtask
B........................................................................................................................................46
Table 15: Results for all the trained models, for subtask C......................................................47
Table 16: Results for Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the train and
11
1. Introduction
1.1. Project background
The project is carried out at the Language Technology department in Lunds Tekniska
Högskola, Lund University. This project is based on the previous theoretical course Language
Technology, which introduces theories and techniques of language technology and natural
language processing. This course attempts to cover the whole field from speech recognition
and synthesis to semantics and dialogue.
Nowadays, due to the huge increase of the use of social media, lots of offensive language can
be seen on these platforms. As manual filtering is very hard, there have been many
researches aiming at automating this process.
This topic is one of the tasks proposed in the SemEval 2020 competition. Semantic Evaluation
(SemEval) is a series of workshops focused on the evaluation and comparison of systems that
can analyze diverse semantic phenomena in text aiming to extend the current state of the art
in semantic analysis and creating high quality datasets. This organization provides a really
interesting forum for researches to propose challenging research problems in the field of
semantics and to build new techniques to solve such research problems. Every year, many
tasks are proposed, so this project is based on the task 12 of the SemEval 2020: Identifying
and Categorizing Offensive Language in Social Media.
The main goal of this task is to identify offensive language on Twitter and categorize the type
and target of offense. This task is the second iteration of OffensEval, which was proposed in
SemEval 2019. The task this year, SemEval 2020, is still going on, so the project takes into
account the results obtained in last year’s competition.
The offensive content is broken down into the following three subtasks:
- Subtask A - Offensive language identification.
- Subtask B - Automatic categorization of offense types.
- Subtask C - Offense target identification.
1.2. Statement of purpose
The purpose of this project is to automate the process of detecting offensive language in
social media platforms, in this case, Twitter, using language technology techniques such as
machine learning.
12
The project main goals are:
1) Automate the detection of offensive language in social media platforms:
a. Discriminate between offensive and non-offensive posts.
b. Identify the type of offensive content in the posts.
c. Detect the target of the offensive posts.
2) Use machine learning techniques for language analysis.
3) Analyze and evaluate the machine learning models.
4) Compare the results with OffensEval 2019.
1.3. Project requirements
The principal requirement of the project is to be able to detect offensive language on Twitter,
which can be divided into the following points:
- Identify offensive language.
- Categorize offense types.
- Identify target of offense.
1.4. Project specifications
The project specifications are the following:

- Training and testing dataset:
Offensive Language Identification Dataset (OLID), which contains a collection
of 14.200 annotated English tweets.
- Python 3:
Anaconda Distribution.
Machine learning based program to detect offensive language.
Python libraries:
Scikit-learn.
Keras.
Numpy.
Pandas.
Regex.
- Evaluation:
F1-score.
13
1.5. Updated work plan
The final work plan is distributed in the following work packages:
WP# Short title Milestone / deliverable Date (week)

1 Research and data study Theoretical information to start 20/10/2019
with the project.
2 Creating the baseline Baseline to be improved 08/10/2019
during the project.
3 Using RNN Obtain the best results using 25/10/2019
RNN.
4 Using LSTM Obtain the best results using 10/11/2019
LSTM.
5 Improvements Obtain the state of the art 04/12/2019
results.
6 System Analysis Results for each different 15/10/2019
technique used.
7 Documentation Final report of the 23/09/2019
development and evaluation
of the project.
8 Offensive language System that identifies 15/10/2019
identification offensive and non-offensive
language.
9 Automatic categorization of System that classifies the 20/10/2019
offense types offense posts.
10 Offense target identification System that identifies a target 25/10/2019
given an offensive post.
11 Training Trained system. 15/10/2019

12 Testing Tested system. 17/10/2019
13 Evaluation Final documentation of the 19/10/2019
behavior of the system.
14 Results comparison Comparison of the results 10/01/2020
obtained with the ones in
OffensEval 2019.
Table 1: Final work plan.
14
Figure 1 details the work packages structure, whereas Figure 2, Figure 3 and Figure 4 show
the detail of the WP6 for each subtask.
Figure 1: Work packages structure.
Figure 2: Internal structure of WP6 for subtask A. Figure 3: Internal structure of WP6 for subtask B.
Figure 4: Internal structure of WP6 for subtask C.
15
The final Gantt diagram is presented in the figure below:
Figure 5: Final Gantt diagram.
The final distribution of the hours dedicated to each of the main tasks of the project is shown
in Figure 6.
Figure 6: Distribution of the hours dedicated to the project.
16
1.6. Deviations and incidences
At the beginning of the project, I had some difficulties installing Keras Python module, which
was needed for the code implementation. My computer couldn’t support to install Keras so in
the end, I started using Google Colab, which is a free cloud service that allows developing
deep learning applications using the Python libraries such as Keras, TensorFlow, etc. It made
the implementation process much easier and also the evaluation process. Some of the
models required high computational cost to train, so Colab also provided less computing times
in the training process.
Regarding the Work Plan, what has changed between the last one is mainly the structure and
the order of execution of each task. The development of the project follows the agile
methodology. It consists of iteration. For the first step WP2: Creating the baseline, the three
different subtasks were implemented and then analyzed. But for the rest of the steps WP3 –
WP5, the implementation of subtask A was done first. Once we had the results for the first
subtask with all the models, then it was easier to expand the models to be applied to the rest
of subtasks B and C.
17
2. State of the art of the technology used or applied in this thesis:
This section gives a brief overview of machine learning and deep learning concepts, and the
different related models that are applied in the project. It also describes how neural networks
work and the types that are used in the project. The final section presents the concrete case
of how to work with text data using deep learning techniques.
The different Python tools and libraries for machine learning are outlined in some of the next
chapters.
2.1. Introduction to machine learning
In the past few years, artificial intelligence (AI) has garnered a lot of attention. AI can be
defined as the capability of a computer system to automate intellectual tasks that normally
require human intelligence, such as speech recognition, visual perception, decision-making
and language translation. The AI field comprises machine learning and deep learning
approaches, as well as many more techniques that don’t involve any learning.
Figure 7: AI, Machine learning and Deep learning
In classical AI, humans input rules and data. This data is processed according to these rules
and outcome answers, whereas machine learning is a trained system that outcomes rules
given input data as well as the answers for the input data. Machine learning has the ability to
modify itself without human intervention each time that new data is fed.
Machine learning algorithms need three essential things:

- Input data.
- Expected output data.
- Measure for the performance of the system, which is used as feedback to adjust the
used algorithm; this is what is called learning.
18
The machine learning goal is to transform the input data into significant outputs, but the main
objective is to learn useful representations of the input data. All the different algorithms consist
of automatically find representations that get closer to the expected output.
A machine learning approach that will be used in this project is logistic regression, explained
in the next section.
2.1.1. Logistic Regression classifier
Logistic Regression is a machine learning algorithm from the field of statistics, used for
classification problems. It is a predictive analysis algorithm based on the concept of probability.
It is useful in binary classification problems.
Its name comes from the logistic function, also known as sigmoid function. Logistic
Regression uses this function to predict the probability of a given point to belong to a class.
The function maps any real value into a value between 0 and 1.
Figure 8: Logistic function.
In Logistic Regression, the input values (x) are combined linearly using coefficient values (β),
called weights to predict an output value (y).
These weights are estimated from the training data using maximum-likelihood estimation, a
learning algorithm used by several machine learning algorithms. The goal of maximum-
likelihood is to find the weights that minimize the error in the probabilities predicted by the
model to the ones in the data.
19
2.1.2. Evaluating machine learning models
Machine learning models cannot be evaluated on the same data as they were trained. The
goal of machine learning models is to perform well on never-before-seen data, so the main
objective is to be able to generalize. If a model is evaluated on the same data as it was trained,
it overfits, which means that the performance of the model starts worsening.
This is the reason why it is necessary to split the data into different sets to evaluate the
models:
- Train data.
- Validation data.
- Test data.
The model is trained using the training data and evaluated on the validation data. The
performance of the model on the validation data is used as feedback to tune the configuration
of the model. Once the model is ready, the final evaluation is done using the test data.
There are different techniques to split the data into the three needed sets. In this project,
simple hold-out validation is used, which consists of set apart a fraction of the data as a test
set. An example is shown in the figure below.
Figure 9: Data split.
2.1.3. Scikit-learn
Scikit-learn is a machine learning library, which provides a large set of algorithms that can be
used with Python.
The two main functions in the library used for the classifier are fit(), to train a model and
predict(), to predict a class.
The input data for machine learning models that use this library follows the next structure:
- Features: x denotes a feature vector containing information of one observation,
whereas X is a feature matrix representing the whole dataset.
- Labels: y denotes a vector, which contains the classes for the dataset.
20
Both X and y must be in the numpy array format, where numpy is the numerical computation
library on which scikit-learn is built.
For example, in our project, the input data has the structure shown in Figure 10.
Figure 10: Input data structure used in the project.
In this case, the observations are the different tweets, and the features of each observation
are the words in each tweet. Finally, the vector y contains the corresponding classes.
2.2. Introduction to deep learning
Deep learning is a specific subfield of machine learning. The meaning of the word ‘deep’ in its
name refers to the idea of having successive layers of representations. It uses multiple layers
to extract higher-level features from the input data. This is the main difference with other
machine learning approaches, which tend to focus on learning only one or two layers of
representations. In deep learning, these layered representations are learned via models called
neural networks.
2.2.1. Introduction to neural networks
Neural networks are composed of layers where each layer contains a set of nodes, called
neurons or units. The input layer corresponds to the input features and the output layer
produces the classification result. Each layer has an activation function, which produces the
output of the neurons in the layer.
Every single layer represents a data transformation, and these data transformations are
learned by exposure to input data. These data transformations are represented in the layer’s
21
weights, which are a set of numbers that parameterize the transformation in each layer. The
goal of a neural network is to find the weights’ values in all the layers that together, provide
the best approximation for the input to the associated target.
Initially, the weights of the network are set randomly, but with each example that the network
processes, the weights are updated. The process for the network to find the best weights’
values is called the training of the network. It consists of feeding the network with the input
data as many times as wanted, updating the weights of the network in each iteration, aiming
to yield the weights’ values that minimize the difference between the output value and the
expected one.
To measure the difference between the given output from the expected output, the network
uses the loss function. This function will give a score that will be used as feedback to adjust
the weights’ values aiming to lower the loss score. This is made by the optimizer, which
implements the backpropagation algorithm.
Figure 11: Basic scheme of machine learning. [2]
To summarize, a network is composed of different layers that are chained together. Given
input data, it returns a prediction for its target. These predictions are compared to the real
targets using the loss function. The resulting value, loss score, is used by the optimizer to
update the network’s weights. The stack of layers is called model.
2.2.2. Data representation for neural networks
In the previous section Scikit-learn, the numpy arrays were presented. They can also be
called tensors. Most of the actual machine learning systems use tensors as the data structure.
They generally contain numerical data.
22
The following sections describe two different types of tensors that can be used in machine
learning algorithms.
2.2.2.1. Vectors
Vectors are arrays of numbers and they can also be called 1D tensors. It is said that a 1D
tensor has exactly one axis. It is important to understand the difference between axis and
dimension. Dimensionality denotes the number of entries for a specific axis or the number of
axis in a tensor. For example, a 3D vector has only one axis and has three dimensions along
its axis, whereas a 3D tensor has three axes and can have any number of dimensions for
each axis.
2.2.2.2. Matrices
Matrices are arrays of vectors, or also called 2D tensors. In this case, a matrix has two axes,
the entries from the first one are called rows and the entries from the second one are called
columns. The result of packing 2D tensors in an array is a 3D tensor, and so on.
2.2.3. Overfitting
As stated in the section Evaluating machine learning models, overfit occurs when the model is
beginning to learn patterns that are specific to the training data but they don’t have any
relevance when it comes to data that has never seen before. Then, the performance of the
model begins to be worse. This happens in every machine learning problem.
To deal with overfitting, the best solution is to get more training data. A model trained on more
data will naturally generalize better. Some other techniques can also be applied to fight
overfitting.
2.2.3.1. Dropout
Dropout is one of the most common and effective techniques to deal with overfitting in neural
networks. It is applied to the layers and it consists of randomly set to zero several output
features of the layer during the training process.
When using dropout, a dropout rate is applied, it is defined as the fraction of features that are
zeroed out. It normally has values between 0.2 and 0.5. In the testing process, no units are
dropped out.
23
2.2.4. Keras
Keras is a deep learning framework for Python that provides a way to define and train almost
any kind of deep learning model. It will be used for the development of the project.
It contains different modules to create new deep learning models, such as neural layers,
optimizers, activation functions, etc.
Keras relies on back-end techniques such as TensorFlow for doing its own low-level
operations. TensorFlow is another machine learning framework created by Google used to
design, build and train deep learning models. It also has a library to do complex numerical
computations.
2.3. Deep learning for text sequences
How deep learning models can process text and sequence data is described in this chapter.
The two main deep learning algorithms are recurrent neural networks and 1D convents, which
will be presented in the following sections.
2.3.1. Preparing the text data
In most of the cases, text data has to be stored and modified before feeding it into the neural
network. Some of the Python libraries used in the project are the following:
- Pandas DataFrame: it consists of a two-dimensional data structure with labeled axes
(rows and columns). It has three main components: data, rows and columns. Pandas
DataFrame can be created from existing datasets storage in SQL Databases, CSV
files, Excel files, etc. An example can be observed in the figure below:
Figure 12: Pandas DataFrame example.
24
- Regular expressions: are devices to define and search patterns in texts. They can be
used for text processing such as translating characters, match sequences of
characters, substituting words, or counting them. Python has a module that provides
support for regular expressions.
2.3.2. Working with text data
Text can be understood as either a sequence of characters or a sequence of words, although

is more common to work at word’s level. Deep learning models for processing sequences can
be used for different applications such as sentiment analysis. It is basically pattern recognition
applied to words or sentences.
Like all other neural networks, the models only work with numeric tensors. This is why the text
has to be transformed into numeric tensors. This process is called vectorization. There are
several techniques used to implement it:
- Split the text into words and transform each word into a vector.
- Split the text into characters and transform each character into a vector.
- Extract N-grams of words or characters, and transform each N-gram into a vector. N-
grams are groups of multiple consecutive words or characters.
The resulting units into which the text is split (characters, words or N-grams) are called tokens,
which names the process of breaking down the text into such tokens, tokenization. Some
techniques to implement tokenization are one-hot encoding and word embeddings.
2.3.2.1. One-hot encoding
One-hot encoding is the most common and basic way to transform a token into a vector. It
uses a vocabulary, which is a set of unique words in the whole corpus. It associates a unique
integer index with every word in the vocabulary and then turns this integer index into a binary
vector of size N (the size of the vocabulary). The vector representing the token with the
corresponding index i is all zeros except for the i th position, which has value one.
Figure 13 introduces an example of one-hot encoding. The corpus contains two different
sentences, sentence1 and sentence2 respectively. The token_index represents the
vocabulary, which associates each unique word in the corpus to an integer. And finally, the
binary vector representation for both sentences.
25
Figure 13: One-hot encoding example.
2.3.2.2. Word embeddings
Another powerful way to tokenize is using the dense of word vectors, also called word
embeddings. The resulting vectors from one-hot encoding are binary, mostly made of zeros
and very high-dimensional (their size is the number of unique words in the corpus). Word
embeddings provide us low-dimensional floating-point vectors, which are more useful due to
that they contain more information into fewer dimensions.
There are two different ways to use word embeddings:

- Start with random word vectors and learn them in the same way the network learns its
weights in each iteration of the training process.
- Load word embeddings that were pre-computed using a different machine learning
task. These are called pre-trained word embeddings. This method is the one used for
the project.
2.3.3. Recurrent Neural Networks
Recurrent Neural Networks (RNN) are a type of neural network that processes sequences by
iterating through the elements of the sequence and keep a state containing information
relative to what the network has seen until the moment. This state is updated each time the
network process an input sequence. To simplify, we can say that RNN are networks that have
an internal loop.
Figure 14 shows an example of RNN, where each timestep is the output of the loop at time t.
W and U represent the weight matrices and Input represents the different input features.
26
Figure 14: Basic example of a RNN.
To implement the recurrent neural networks in the project we will use different Keras layers.
2.3.3.1. SimpleRNN
Keras has a layer that implements a simple recurrent neural network, the SimpleRNN layer.
2.3.3.2. LSTM and GRU
LSTM and GRU are other recurrent layers in Keras. Long Short-Term Memory (LSTM) layer is
a variant of recurrent neural networks. This layer, adds a way to carry information across
many timesteps, this means that information from a sequence can jump onto any point of the
network and be transported to a later timestep. To simplify, what it does is to save information
for later.
The figure below details an example of a LSTM. It can be observed that a new data flow is
added, it carries information across timesteps. This information will be combined with the input
connection and the recurrent connection and it will affect the state being sent to the next
timestep.
Figure 15: Basic example of LSTM.
27
Gated recurrent unit (GRU) layers use the same principle as LSTM, but they are more efficient
and cheaper to run.
2.3.4. Convolutional Neural Networks
The main characteristic of Convolutional Neural Networks is that they learn local patterns
instead of global patterns. A pattern learned at a certain position in a sentence can later be
recognized in a different position. It has been observed that they have a successful
performance for Natural Language Processing (NLP).
As their name says, they use convolution filters that are applied to the local features. The
filters are applied to windows of words, which can have different sizes. The result of applying
one filter to one window of words produces a new feature, and the resulting features from
applying this filter to all the possible windows of words is called feature map.
Convolutional networks use pooling operations, such as average and max pooling in order to
spatially downsample the feature maps. The idea is to keep the most important features (the
ones with the highest values or an average of all of them), for each feature map.
A simple example of how CNN work can be observed in the figure below.
Figure 16: Basic example of convolution.
For the example in Figure 16, two different filters are applied to the window sizes two and
three. The result of applying a filter to each window size gives two feature maps for each
28
window size. Finally, max pooling operation is applied and the resulting features are
concatenated forming the final feature vector.
2.3.5. Ensembling models
Another powerful technique for complex problems in deep learning models is ensembling
models. Ensembling consists of combining the predictions of a set of different models to
produce better predictions. It is important that every single model has good performance
independently. Generally, different models look at slightly different features of the data, so this
is why combining them provides knowledge for a better prediction.
29
3. Methodology / project development:
The following section presents the steps of the project development and a description of the
different trained models.
3.1. Task description
In previous sections it was mentioned that the project is broken down into three different
subtasks, depending on the level of offense. The three subtasks are described next.
The first subtask, subtask A, consists of classifying the tweets between Offensive and Not
Offensive. Offensive posts include insults, threats, and posts containing any form of
untargeted profanity. The labels for this task are the following:
- NOT, Not Offensive
- OFF, Offensive
In subtask B, the goal is to predict the type of offense, meaning if the offense is to an
individual, group or others or if it is non-targeted. Only posts labeled as Offensive in subtask A
are included in this task. The different classes in the task are:
- TIN, Targeted Insult
- UNT, Untargeted
Finally, the subtask C aims to classify the type of target offense. For the Targeted Insult posts,
identify the corresponding target. The possible labels are:
- IND, Individual
- GRP, Group
- OTH, Others
3.2. Dataset
The dataset used in the task OffensEval is the Offensive Language Identification Dataset
(OLID) [3]. This dataset was created specifically for this task. It contains 14,100 English
tweets, 13,240 provided as the training data and 860 as the testing data. Each tweet in the
dataset has an id and the corresponding labels for each subtask. An example of the first and
second tweets on the training dataset is the following:
30
Figure 17: Example of the first two tweets in the training dataset.
Figure 18 provides an analysis of the data distribution in the train and test datasets.
Figure 18: Data distribution in OLID dataset.
OLID is annotated following the hierarchical three-level annotation schema that takes both the
target and the type of offensive content into account. The dataset was annotated using the
crowdsourcing platform Figure Eight [10].
The training dataset is split into a training and validation set. The validation set is the 10% of
the original training data.
3.3. Data pre-processing
Before feeding the data into the machine learning models, some pre-processing was applied.
Most of the steps were done using the Python libraries mentioned in the section Preparing the
text data.
31
The pre-processing steps used are the following:
- Cleaning:
Remove all the special characters and all the instances of USER and URL.
- Hashtags:
The hashtags are split into words. For example, #DeepStateCorruption is split into
Deep State Corruption.
- Lowercasing:
Lowercase all the tweets.
- Tokenization:
The text is broken into words and associates each word with a numeric vector.
- Embeddings:
Pre-trained word embeddings are used. They are word embeddings that were pre-
computed using a different machine learning task that the one we want to solve. There
are many different pre-trained word embeddings. In our project, we use GloVe pre-
trained word vectors [11].
3.4. Measurement
The official measure used to evaluate the performance of the different models established in
the competition is macro-averaged F1-score, due to the imbalance between the number of
instances in the different classes in the tasks.
3.5. Creating the baseline
The initial phase consists of creating the baseline. The first model implemented, is based on
Logistic Regression. In this case, our data is encoded using one-hot encoding, see an
example in the section One-hot encoding. To encode the tweets, these are split into words
and each tweet is represented as a vector of length the total number of words in the whole
corpus. This vector was initialized to 0 and then set to 1's in all the positions in which the word
is in the tweet. For this, it’s necessary to create an index of the whole unique words in the
corpus.
To encode the labels, a binary vector was obtained in which the value was '0' in case of 'NOT'
and '1' in case of 'OFF'.
Finally, the result was a binary matrix of size: #tweets x #tokens, and a label vector of size:
#tweets, which were the input data to the Logistic Regression classifier.
32
3.6. Training deep learning models
The next step after creating the baseline is to implement and train different deep learning
models to compare their performance with the baseline. The trained deep learning models are
outlined in this section.
3.6.1. Used parameters
After trying some variations of different parameters in all the trained models, it has been
observed that the ones that give the best results are the following, so for all the models used
we applied these same parameters:
- The maximum length of the tweets is fixed to 200, so the input sequences are padded
in order to make them all have the same length.
- 10,000 words are considered as features.
- The tweets are encoded using pre-trained word embeddings. There are several pre-
trained word vector dimensions in GloVe. For our project, we used the 200-
dimensional vectors. This means that one word is represented as a vector with 200
elements.
- And finally, all the models are trained with 10 epochs, which are the iterations in the
training process.
3.6.2. Deep learning models (I)
For all the models, an Embedding layer is applied at the beginning, and the final layer consists
of a Dense layer with 2 units and 'softmax' as the activation function. The trained deep
learning models are presented below:
3.6.2.1. Simple RNN
Recurrent Neural Networks (RNN) have a good performance in cases of language because
each neuron or unit can use its internal memory to maintain information about the previous
input. As mentioned in the section Recurrent Neural Networks, RNN have loops in them that
allow information to be carried across neurons while reading an input. This allows the network
to have context from the beginning of a sentence, which will allow more accurate predictions
of a word at the end of a sentence.
33
This model consisted of a simple Recurrent Neural Network layer using 32 units.
3.6.2.2. Simple LSTM
Long Short Term Memory (LSTM) are capable of learning long-term dependencies,
remembering information for long periods of time. More detailed information of LSTM was
noted in the section LSTM and GRU.
This model is build of a LSTM layer with 32 neurons.
3.6.2.3. LSTM
This model is more complex than the previous one. After the Embedding layer, we add a
Dropout layer with a dropout rate of 0.4. As mentioned in the section Dropout, The goal of a
Dropout layer is to prevent the model from overfitting. It randomly sets the outgoing neurons to
0 at each update of the training phase.
In this case, the LSTM layer has 196 units, and a Dropout layer with a rate of 0.25 is applied
before the final layer.
3.6.2.4. BiLSTM + Dropout
This model consisted of mixing Bidirectional LSTM (BiLSTM) with Dropout layers. BiLSTM are
bidirectional variants of LSTM. With them, the learning algorithm is fed with the original data
once from the beginning to the end and once from the end to the beginning.
We use 196 LSTM units wrapped by a Bidirectional layer, and 0.25 being the dropout rate,
followed by another 196 LSTM units wrapped by a Bidirectional layer, 0.25 being, as well, the
dropout rate.
A Flatten layer is applied before the final layer. The goal of the Flatten layer is to reshape the
tensor, removing all the dimensions except for one.
34
3.6.3. Deep learning models (II)
In order to improve the performance of our models, we decided to investigate what the teams
in the competition last year (OffensEval 2019) implemented. Most of the best-score teams in
the competition used the BERT model [9]. BERT is a new technique for Natural Language
Processing (NLP) that was open-sourced by the researchers at Google AI Language at the
end of 2018. This technique is quite new and complex, so we skip it and look at the teams that
obtained the best results without using BERT. These teams used the models presented below,
so we tried an approximation of them, to compare and evaluate our results.
3.6.3.1. BiLSTM + BiGRU
This model consists of a BiLSTM layer and a BiGRU layer. BiGRU are Bidirectional Gated
Recurrent Units, they can also be considered as a variation on the BiLSTM. GRU are
mentioned in the chapter LSTM and GRU.
For this model, we use a Bidirectional LSTM layer with 196 units with dropout rate 0.3,
followed by a Bidirectional GRU layer with 64 GRU units also with 0.3 as dropout rate.
Then, a Max Pooling and Average Pooling layers are used and the results are concatenated.
Pooling layers provide an approach to downsampling feature maps by summarizing the
presence of features in the feature map.
3.6.3.2. CNN + Global Max Pooling
This model is based on Convolutional Neural Network (CNN), which are presented in the
section Convolutional Neural Networks, with a Global Max Pooling layer.
Before the final layer, a hidden Dense layer is applied with 256 units and 'relu' as the
activation function.
3.6.3.3. CNN (3 filters)
In this model, we use a Dropout layer after the embedding with a dropout rate of 0.3. The
convolutional network applied consists of three different filters with window sizes 2, 3 and 4.
For each different window size, 256 filters are used and a Max Pooling layer is then applied to
the feature maps. The resulting vectors are concatenated.
35
A Dropout layer with dropout rate 0.3 is applied before the input to the final Dense layer with
256 neurons for classification.
3.6.3.4. Ensemble
A useful option to solve complicated problems in deep learning is Ensembling models,

mentioned in the section Ensembling models. Aiming to improve the performance of our
system, the models that give us the best results in each subtask are combined building a new
ensemble model.
36
4. Results
As reported in the previous section Measurement, the evaluation of the final results is done
with the values of accuracy and macro F1 score. The final evaluation is done with the results
of the model using the test dataset.
This section only shows the best results for each subtask. The complete results for all the
models in each subtask are presented in the appendix Appendix I: Detailed results of the
machine learning models.
4.1. Baseline results
The results obtained using the Logistic Regression classifier in the different subtasks are
displayed in the table below.
Subtask Accuracy Macro F1 score
Subtask A 0.7640 0.6616
Subtask B 0.8917 0.5885
Subtask C 0.6432 0.5043

Table 2: Results for the Logistic Regression model on the test dataset, for the different subtasks.
4.2. Deep learning results
4.2.1. Subtask A
The models that perform better on the test dataset for the first subtask are the following:
LSTM, BiLSTM with Dropout and BiLSTM combined with BiGRU. The best one being BiLSTM
with Dropout, which gives us the maximum macro F1 score of 0.7739.
Aiming to obtain even better results, we tried to ensemble these three models. After trying
some different combinations of ensembles, the best result was given by ensembling LSTM
and BiLSTM + Dropout.
37
Validation dataset Test dataset
Model
Accuracy Macro F1 score Accuracy Macro F1 score
BiLSTM + Dropout
0.7689 0.7446 0.8337 0.7807
LSTM
Table 3: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset, for subtask A.
4.2.2. Subtask B
For the subtask B, the models based in RNN, LSTM and CNN combining three different filters
give the best results on the test dataset. The best macro F1 score is 0.6634, given by the last
named model.
In this case, the best ensemble model results for combining the three models that give the
best performance on the test set; Simple RNN, LSTM and CNN (3 filters). However, this
ensemble reach a macro F1 score of 0.5953, which is worse than the performance of the CNN
(3 filters) model. In conclusion, the best model for the subtask B is based on CNN with three
different filters.

Model
CNN (3 filters) 0.8708 0.5222 0.8833 0.6634

Table 4: Results for CNN (3 filters) on the train and test dataset, for subtask B.
4.2.3. Subtask C
Finally, for the last subtask, the models that provide the best performance on the test dataset
are LSTM, CNN with a Global Max Pooling layer and CNN combining three different filters.
The second model, CNN with Global Max Pooling, gives the best result and it reaches a
macro F1 score of 0.5943.
Ensembling the two models that performed best; CNN + Global Max Pooling and CNN (3
filters) even better results are obtained.

Model
CNN + Global Max Pooling

0.4139 0.3347 0.6901 0.6062
CNN (3 filters)
Table 5: Results for Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the train and test dataset, for
subtask C.
38
4.3. Comparison with OffensEval 2019
In last year’s competition, most of the best results were obtained using BERT model [9].
For subtask A, the team that performed best without using BERT used an ensemble of three
different models based on CNN, BiGRU and BiLSTM with Attention. This team was ranked
sixth out of 104 teams, submitting a final macro F1 score of 0.806 [5].
For the second subtask, 76 teams participated and the best teams used rule-based
approaches and ensembling of deep learning (including BERT) and non-neural machine
learning models. The best team reached a macro F1 score of 0.755 [7].
Finally, in subtask C, where 66 teams participated, ensembles were used by five of the top-10
teams. The best team obtained a macro F1 score of 0.660, using BERT model [6].
The final best results obtained with our models, for each subtask are the following:
Subtask Model Macro F1 score Competition ranking
BiLSTM + Dropout
A 0.7807 24th out of 104 teams
LSTM
B CNN (3 filters) 0.6634 between 15-24 th out of

76 teams

C 0.6062 7th out of 66 teams
CNN (3 filters)
Table 6: Final best results for each subtask on the test dataset.
As expected, it is very complicated to simulate the same models that the best teams
implemented, since there are many facts that affect the model performance. This can be
influenced by the different parameters applied in the deep learning models, or also the pre-
processing techniques used. Some of the teams in the competition also applied techniques
such as normalization and lemmatization to the initial tweets.
39
5. Budget
To carry out this project it is only necessary an engineer with Python skills and a computer.
The cost breakdown for the project is shown in Table 7.
Concept
Project duration (weeks) 20
Average dedication hours (hours/week) 23
Total dedication (hours) 460
Junior engineer (€/hour) 10
Engineer cost (€) 4,600
Computer price (€) 1000
Computer duration (years) 4
Computer cost (€/year) 250
Computer cost for the project (€) 96
TOTAL 4,696 €
Table 7: Cost breakdown of the project.
40
6. Conclusions and future development:
To sum up, we have demonstrated that recurrent and convolutional neural networks have a
good performance working with text data and that ensembling different deep learning models
is a successful technique to solve classification problems.
The most important limitation lies in the fact of how slow can be the training process of each
model. To find the optimal parameters for each model, the model has to be trained one time
for every possible combination of parameters. As it is impossible to try all the possible
combinations, we proceed trying some sets that should perform well.
However, given the number of times we trained our models, we have managed to obtain
satisfactory results that can be compared with the ones obtained by the professional teams
that participated in last year’s competitions. We believe our work could be a starting point for
future implementations.
To further research, we plan to train and test the system with bigger datasets and implement
more pre-processing techniques such as normalization of the tokens, converting emojis to text
and lemmatization, etc.
Finally, it would be really interesting to implement the BERT model. Nowadays, it is

considered a new era in NLP, it broke several records for difficult language-based tasks. This
could potentially lead to a huge improvement in the results of our project.
41
Bibliography:
[1] P. Nugues, An Introduction to Language Processing with Perl and Prolog. An Outline of Theories,
Implementation, and Application with Special Consideration of English, French, and German. Series:
Cognitive Technologies, Springer, 2006, ISBN: 3-540-25031-X.
[2] F. Chollet, Deep learning with Python. Manning, 2017, ISBN: 9781617294433.
[3] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar. “Predicting the Type and Target of
Offensive Posts in Social Media”. In Proceedings of NAACL. 2019a.
[4] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar. “SemEval-2019 Task 6: Identifying
and Categorizing Offensive Language in Social Media (OffensEval)”. In Proceedings of The 13th
International Workshop on Semantic Evaluation (SemEval). 2019b.
[5] D. Mahata, H. Zhang, K. Uppal, Y. Kumar, R. Ratn Shah, S. Shahid, L. Mehnaz, S. Anand. “MIDAS at
SemEval-2019 Task 6: Identifying offensive posts and targeted offense from Twitter”. In Proceedings of
The 13th International Workshop on Semantic Evaluation (SemEval). 2019.
[6] A. Nikolov and V. Radivchev. “NikolovRadivchev at SemEval-2019 Task 6: Offensive tweet classification
with BERT and ensembles”. In Proceedings of The 13th International Workshop on Semantic Evaluation
(SemEval). 2019.
[7] J. Han, X. Liu, S. Wu. “jhan014 at SemEval-2019 Task 6: Identifying and categorizing offensive language
in social media”. In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).
2019.
[8] Y. Kim. “Convolutional neural networks for sentence classification”. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.
[9] J. Delvin, M. Chang, K. Lee, K. Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding”. 2018.
[10] Figure Eight, The Essential High-Quality Data Annotation Platform. [Online] Available: https://www.figure-
eight.com/
[11] J. Pennington, R. Socher, C.D. Manning. “Glove: Global vectors for word representation”. 2014. In EMNLP.
42
Appendices:
Appendix I: Detailed results of the machine learning models.
Results for Logistic Regression
Subtask A:
Model Accuracy Macro F1 score
Logistic Regression 0.7640 0.6616

Table 8: Logistic Regression results on the test dataset, for subtask A.
Figure 19: Confusion matrix of Logistic Regression on the test dataset, for subtask A.
Subtask B:

Table 9: Logistic Regression results on the test dataset, for subtask B.
43
Figure 20: Confusion matrix of Logistic Regression on the test dataset, for subtask B.
Subtask C:

Table 10: Logistic Regression results on the test dataset, for subtask C.
Figure 21: Confusion matrix of Logistic Regression on the test dataset, for subtask C.
44
Results for deep learning models
Subtask A:

Model Accuracy Macro F1 score Accuracy Macro F1 score
Simple RNN 0.7221 0.6906 0.7512 0.6876
Simple LSTM 0.7440 0.7039 0.7767 0.6873
LSTM 0.7591 0.7299 0.8267 0.7693
BiLSTM + Dropout 0.7666 0.7453 0.8221 0.7739
BiLSTM + BiGRU 0.7787 0.7525 0.8244 0.7698
CNN + Global Max Pooling 0.7515 0.7104 0.8081 0.7345
CNN (3 filters) 0.7372 0.7229 0.7977 0.7570

Table 11: Results for all the trained models, for subtask A.

Model
BiLSTM + Dropout
0.7689 0.7446 0.8337 0.7807
LSTM
Table 12: Results for Ensemble (BiLSTM+Dropout and LSTM) on the train and test dataset, for subtask A.
Figure 22: Confusion matrix of Ensemble (BiLSTM+Dropout and LSTM models) on the test dataset, for subtask A.
45
Subtask B:

Simple RNN 0.8225 0.4829 0.8500 0.6470
Simple LSTM 0.8595 0.5061 0.8708 0.5866
LSTM 0.9207 0.5305 0.8667 0.5636
BiLSTM + Dropout 0.8965 0.5189 0.8625 0.5600
BiLSTM + BiGRU 0.9381 0.5383 0.8875 0.5607
CNN (3 filters) 0.8708 0.5222 0.8833 0.6634

Table 13: Results for all the trained models, for subtask B.

Model
Simple RNN
0.9131 0.5382 0.8792 0.5953
Simple LSTM
CNN (3 filters)
Simple RNN
0.8860 0.5228 0.8625 0.5785
CNN (3 filters)
CNN (3 filters) 0.8708 0.5222 0.8833 0.6634

Table 14: Results for Ensembles and CNN (3 filters) on the train and test dataset, for subtask B.
46
Figure 23: Confusion matrix of CNN (3 filters) on the test dataset, for subtask B.
Subtask C:

Simple RNN 0.3369 0.2536 0.5869 0.4661
Simple LSTM 0.4116 0.3188 0.6385 0.5140
LSTM 0.4048 0.3233 0.6714 0.5301
BiLSTM + Dropout 0.3754 0.2833 0.6385 0.4866
BiLSTM + BiGRU 0.4366 0.3119 0.6526 0.5032
CNN (3 filters) 0.3814 0.2992 0.6714 0.5860

Table 15: Results for all the trained models, for subtask C.

Model

0.4139 0.3347 0.6901 0.6062
CNN (3 filters)
Table 16: Results for Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the train and test dataset, for
subtask C.
47
Figure 24: Confusion matrix of Ensemble (CNN+GlobalMaxPooling and CNN (3 filters)) on the test dataset, for
subtask C.
48
Appendix II: Previous project
This thesis is the continuation of a previous project carried out in the Project in Language
Technology course at Lund University. The final report for this course is included in this
appendix. Notice that the paper is based on subtask A and that the results reported are
slightly different than the ones obtained in the thesis development.
49
Identifying and Categorizing Offensive Language in Social Media
(OffensEval)
Berta Viñas Redondo

Lunds Tekniska Högskola, LTH
bertavr@hotmail.com
Abstract The offensive content is broken down into the

following three sub-tasks taking the type and tar-
In this paper, we present the development of get of offenses into account. The different sub-
a system for identifying and categorizing of-
tasks are:
fensive language in social media, specifically
Twitter. The project is based on SemEval 2020
Task 6, which is the second iteration of the • Sub-task A - Offensive language identifica-
OffensEval task organized at SemEval 2019. tion.
For this task, the Offensive Language Identi-
fication dataset (OLID) is used. The dataset • Sub-task B - Automatic categorization of of-
contains English tweets annotated using a hi- fense types.
erarchical three-level annotation model. For
the development of the project, some different
models were trained. We provide a detailed • Sub-task C - Offense target identification.
analysis and evaluation of the different mod-
els. The best results were obtained using an This project is based on the Sub-task A, classi-
ensemble of two different deep learning mod- fying tweets in Offensive and Not Offensive.
els, resulting a final macro F1 score of 0.7626, During the development of the project, different
which would be ranked 32th out of 104 teams deep learning models were trained and evaluated.
in last year’s competition OffensEval 2019.
The results will be compared with the results ob-
1 Introduction tained in the last competition (OffensEval 2019).
The remainder of this paper is organized as fol-
Nowadays, due to the huge increasing of the use lows. Section 2 discusses similar work in the
of social media, lots of offensive language can be field. Section 3 describes the dataset used in the
seen in social media platforms. As manual filter- project. Section 4 presents the description of the
ing is very hard, there have been many researches project, and explains the important steps and dif-
aiming at automating this process. ferent trained models. Section 5 analyzes the re-
This topic is one of the tasks proposed in the sults and performances of the models. Section 6
SemEval 2020 competition. Semantic Evaluation concludes and considers possible ways to improve
(SemEval) is a series of workshops focused on and continue working in the future.
the evaluation and comparison of systems that can
analyze diverse semantic phenomena in text with 2 Related Work
the aim of extending the current state of art in se-
mantic analysis and creating high quality datasets. Different problems related to identify offensive
This organization provides a really interesting fo- language have been explored, such as aggression,
rum for researches. It proposes challenging re- cyber bullying, hate speech, toxic comment and
search problems in the field of semantics and to offensive language.
build new techniques to solve such research prob- This topic has attracted significant attention last
lems. Every year, many tasks are proposed, so this years, due to the increase of the use of social plat-
project will be based on the task 12 of the SemEval forms. Some work in the field of sentiment anal-
2020: OffensEval 2, Identifying and Categorizing ysis is also related to categorizing reviews, like
Offensive Language in Social Media. movie or customer service reviews.
Additional related work is presented in work- data. For this and all the project development, the
shops such as Kaggle1 , TRAC 2 (Kumar et al., Keras library was used.
2018) and related shared tasks such as GermEval The preprocessing steps used are the following:
(Wiegand et al., 2018) and SemEval. Cleaning: Remove all the special characters
As this project is based on the SemEval 2020 and all the instances of USER and URL.
task 12, OffensEval 2, some of the directly related Hashtags: The hashtags were split into words.
works, are the papers of some teams participating For example, #DeepStateCorruption was split into
in the last iteration of this task, SemEval 2019 task Deep State Corruption.
6, OffenEval, (Mahata et al., 2019; Nikolov and Lower Casing: Lowercase all the tweets.
Radivchev, 2019; Seganti et al., 2019), and the Tokenization: The text is broken into words
final report of the competition in Zampieri et al. and associate to each word a numeric vector.
(2019b). Embeddings: Pre-trained word embeddings
The dataset for this competition is explained in will be used. They are word embeddings that were
Zampieri et al. (2019a) and different approaches to precomputed using a different machine-learning
the same problem were reported in Zampieri et al. task that the one we want to solve. There are
(2019b). many different pre-trained word embeddings. In
our project we will use GloVe pre-trained word
3 Dataset
vectors (Pennington et al., 2014).
The dataset used in the task OffensEval is the Of-
fensive Language Identification Dataset (OLID) 4.2 Creating the baseline
(Zampieri et al., 2019a). This dataset was created The initial phase consists of creating the baseline.
specifically for this task. It contains 14,100 En- The first model implemented, consisted on a Lo-
glish tweets, 13,240 provided as the training data gistic Regression classifier. In this case, our data
and 860 as the testing data. were encoded using one-hot encoding. To encode
OLID is annotated following the hierarchical the tweets, these were split into words and each
three-level annotation schema that takes both the tweet was represented as a vector of length the to-
target and the type of offensive content into ac- tal number of words in the whole corpus. This
count. The dataset was annotated using the crowd- vector was initialized to 0 and then set to 1’s in all
sourcing platform Figure Eight. the positions which the word is in the tweet. For
The training dataset was split into a training and this, its necessary to create an index of the whole
validation set. The validation set was the 10% of unique words in the corpus.
the original training data. The models with the To encode the labels, a binary vector was ob-
best results on the validation test, will be evalu- tained in which the value was ’0’ in case of ’NOT’
ated on the test set. and ’1’ in case of ’OFF’.
Finally, the result was a binary matrix of size:
4 Task Description #tweets x #tokens, and a label vector of size:
The Sub-task A consists of classifying the tweets #tweets, which were the input data to the Logis-
between Offensive and Not Offensive. Offensive tic Regression classifier.
posts include insults, threats, and posts containing
any form of untargeted profanity. The labels for 4.3 Training Deep Learning Models
this task will be the following: After trying some variations of different parame-
ters in all the trained models, it has been observed
• NOT, Not Offensive that the ones that gave the best results are the fol-
lowing, so for all the models used we applied these
• OFF, Offensive
same parameters:
4.1 Data preprocessing
• The maximum length of the tweets was fixed
Before feeding the dataset to the machine learn-
to 200, so the input sequences were padded in
ing models, some preprocessing was applied to the
order to make them all have the same length.
1
https://www.kaggle.com/
2
https://sites.google.com/view/trac1/home • We considered as features 10,000 words.
• The tweets were encoded using pre-trained layer is to reshape the tensor, removing all the di-
word embeddings. There are several pre- mensions except for one.
trained word vector dimensions in GloVe. At some point, we decided to investigate about
For our project we used the 200-dimensional what the teams in the competition last year (Offen-
vectors. This means that one word is repre- sEval 2019) implemented. The best-score teams
sented as a vector with 200 elements. in the competition for the sub-task A, were using
BERT model. BERT is a new technique for Nat-
• And finally, all the models were trained with ural Language Processing (NLP) that was open-
10 epochs. sourced by the researchers at Google AI Language
at the end of 2018. This technique is quite com-
All the models used an Embedding layer at the
plex, so we skip it and look at the teams that ob-
beginning, and a final Dense layer with 2 units and
tained the best results without using BERT (Ma-
’softmax’ as the activation function. The trained
hata et al., 2019). The team that obtained the best
deep learning models are explained bellow:
results without using BERT, also used these mod-
Simple Recurrent Neural Network: Recur-
els explained next. So we decided to try an ap-
rent Neural Networks (RNN) have a good perfor-
proximation of them:
mance in cases of language, because each neuron
BiLSTM + BiGRU: This model uses BiLSTM
or unit can use its internal memory to maintain
and BiGRU. BiGRU are Bidirectional Gated Re-
information about the previous input. RNN have
current Units, they can also be considered as a
loops in them that allow information to be carried
variation on the BiLSTM. For this model, we used
across neurons while reading an input. This allows
a Bidirectional LSTM layer with 196 units with
the network to have context from the beginning of
dropout rate 0.3, followed by a Bidirectional GRU
a sentence, which will allow more accurate pre-
layer with 64 GRU units also with 0.3 as dropout
dictions of a word at the end of a sentence. This
rate. Then, a Max Pooling and Average Pool-
model consisted of a simple Recurrent Neural Net-
ing layers were used and the results were con-
work layer using 32 units.
catenated. Pooling layers provide an approach to
Simple LSTM: Long Short Term Memory
down sampling feature maps by summarizing the
(LSTM) are a special kind of RNN. They are ca-
presence of features in the feature map. We used
pable of learning long-term dependencies, remem-
two different types of Pooling: Average and Max-
bering information for long periods of time. We
imum.
specified the output dimensionality of the LSTM
layer in 32 units. CNN + Global Max Pooling: Convolutional
LSTM: This model was more complex than Neural Network (CNN) with a Global Max Pool-
the previous one. After the Embedding layer we ing layer and a hidden Dense layer with 256 units
added a Dropout layer with rate 0.4. The goal of and ’relu’ as the activation function.
a Dropout layer is to prevent the model from over- CNN (3 filters): Convolutional Neural Net-
fitting. It randomly sets the outgoing neurons to 0 works use convolution filters that generates differ-
at each update of the training phase. In this case, ent feature maps. They are effective in text clas-
the LSTM layer had 196 units, and a dropout with sification because they are able to pick out the
rate 0.25 was applied before the final layer. salient features in a way that is invariant to their
BiLSTM + Dropout: This model consisted position within the input sequence of words. In
of mixing Bidirectional LSTM (BiLSTM) with this model, we used a Dropout layer after the em-
Dropout layers. BiLSTM are bidirectional vari- bedding with dropout rate 0.3. It consisted in three
ants of LSTM. With them, the learning algo- different filters with sizes 2, 3 and 4. For each dif-
rithm is fed with the original data once from ferent filter size, 256 filters are used and a Max
the beginning to the end and once from the end Pooling layer was then applied. The obtained vec-
to the beginning. We used 196 LSTM units tors were concatenated. A Dropout layer with
wrapped by a Bidirectional layer, and 0.25 being dropout rate 0.3 was applied before the input to
the dropout rate, followed by another 196 LSTM the Multi Layer Perceptron with 256 neurons for
units wrapped by a Bidirectional layer, 0.25 being, classification.
as well, the dropout rate. A Flatten layer was ap- A useful solution for complex problems in deep
plied before the final layer. The goal of the Flatten learning models is Ensembling models. This is
a powerful technique which consists in combin- Model Accuracy macro F1
ing different deep learning models, with the aim of Simple RNN 0.7183 0.6866
improving the performance. A new model resulted Simple LSTM 0.7258 0.6978
LSTM 0.7500 0.7338
from ensembling two of the best models explained
BiLSTM+Dropout 0.7417 0.7112
before. In this case, our ensemble combined the BiLSTM+BiGRU 0.7636 0.7338
trained deep learning models: CNN (3 filters) and CNN+GlobalMaxPooling 0.7470 0.7206
BiLSTM+BiGRU. CNN(3 filters) 0.7644 0.7385
Ensemble 0.7644 0.7439
5 Results
Table 2: Results for all the trained models on the vali-
The official measure used to evaluate the perfor-
dation dataset.
mance of the different models established in the
competition is macro-averaged F1-score, due to Model Accuracy macro F1
the imbalance between the number of instances in BiLSTM+BiGRU 0.7884 0.7422
the different classes in the tasks. CNN(3 filters) 0.8093 0.7593
Ensemble 0.8140 0.7626
5.1 Baseline results
The results obtained using the Logistic Regression Table 3: Results for the three best models on the test
classifier are displayed below. We also provided dataset.
the resulted confusion matrix.
The scores of accuracy and F1-score of the 5.3 Comparison with OffensEval 2019
model are the following:
The best results obtained in the competition with-
Model Accuracy macro F1
out using the BERT model were the following:
Model Accuracy macro F1
CNN(3 filters) 0.8395 0.7964
Table 1: Results for Logistic Regression classifier on Ensemble 0.8407 0.8066
the test dataset.
Table 4: Results obtained by MIDAS team on the test
dataset.
This model was ranked 6th out of 104 teams in

OffensEval 2019 (Mahata et al., 2019), as men-
tioned before, it was the best model without using
BERT model.
Our results, would be ranked 32th in last year’s
competition, but this can be due to the data prepro-
cessing. Some of these teams apart from the tech-
niques we used, they also applied normalization
and lemmatization to the initial tweets. Other facts
that can influence on the performance of the sys-
Figure 1: Confusion matrix of Logistic Regression on tem are the different parameters in the deep learn-
test dataset. ing models.
6 Conclusion and Future Work

5.2 Deep learning models results
To conclude, this was an interesting beginning
The results obtained with each model using the and introduction of learning and experiencing how
validation dataset were the following: deep learning models work.
We used the three best models to evaluate each Furthermore, it is important to highlight the dif-
model on the test dataset. The performance of ficulty of configuring all the parameters of the neu-
these models on the testing dataset was the follow- ral networks, and how slow the training process
ing: can be.
Krumholc, and Krystian Koziel. 2019. Nlpr@srpol
at semeval-2019 task 6 and task 5: Linguisyically
enhaced deep learning offensive sentence classifier.
Proceedings of The 13th International Workshop on
Semantic Evaluation (SemEval).
Michael Wiegand, Melanie Siegel, and Josef Ruppen-
hofer. 2018. Overview of the germeval 2018 shared
task on the identification of offensive language. Pro-
ceedings of GermEval.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019a. Predicting the type and target of offensive
Figure 2: Confusion matrix of Ensemble model on the posts in social media. Proceedings of NAACL.
test dataset.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019b. Semeval-2019 task 6: Identyfying and cat-
And facing the future, the project can be im-
egorizing offensive language in social media (offen-
proved in many ways, such as trying the system seval). Proceedings of the 13th International Work-
with a bigger dataset and implementing more pre- shop on Semanti Evaluation (SemEval).
processing techniques, such as normalization of
the tokens, converting emojis to text, and lemma-
tization, etc.
Specially, it would be also interesting to use the
BERT model, learn how does it work and com-
pare the results with the ones obtained using deep
learning models.
Finally, we would like to solve the rest of sub-
tasks (Sub-task B and Sub-task C) proposed in
the competition to evaluate the performance of the
trained models for the different sub-tasks.
References
Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and
Marcos Zampieri. 2018. Benchmarking aggression
identification in social media. Proceeding of the
First Workshop on Trolling, Aggression and Cyber-
bulling (TRAC).
Debanjan Mahata, Haimin Zhang, Karan Uppal, Ya-
man Kumar, Rajiv Ratn Shah, Simra Shahid, Laiba
Mehnaz, and Sarthak Anand. 2019. Midas at
semeval-2019 task 6: Identifying offensive posts and
targeted offense from twitter. Proceedings of The
13th International Workshop on Semantic Evalua-
tion (SemEval).
Alex Nikolov and Victor Radivchev. 2019. Nikolov-
radivchev at semeval-2019 task 6: Offensive tweet
classification with bert and ensembles. Proceed-
ings of The 13th International Workshop on Seman-
tic Evaluation (SemEval).
Jefffrey Pennington, Richard Socher, and Christofer D.
Manning. 2014. Glove: Global vectors for word
representation. In EMNLP.
Alessandro Seganti, Helena Sobol, Iryna Orlova,
Hannam Kim, Jakub Staniszewski, Tymoteusz
Appendix III: Installation guide
The baseline of the project consists of implementing a Logistic Regression classifier, which
corresponds to the file logistic_regression.py. The rest of the project consists of implementing
different deep learning models, which corresponds to the jupyter
notebook deep_learning.ipynb.
Getting started
- Logistic Regression:
To run the program it is necessary to install the sklearn Python module and install the
Anaconda distribution.
- Deep Learning models:

To run the program, it is necessary to run the different sections in the jupyter notebook.
It imports all the needed modules to run the system.
This code was run in Google CoLab, during the implementation phase. It is
recommended to run the code in CoLab due to the high computational cost.
Prerequisites
Both files are programmed with Python 3.
It is necessary to install:
o Sklearn Python module.
o Anaconda distribution.

It is necessary to install:
o Keras Python module.
o Sklearn Python module.
o You also need to download the dataset and the embedding files. To download
the embedding used in the project you need to download them from this
webpage: https://nlp.stanford.edu/projects/glove/. Then underneath “Download
pre-trained word vectors”, you can choose any of the four options for different
sizes or training datasets. In this project, we use the Wikipedia 2014 +
Gigaword 5 vectors.
50
Running the program
Run the Python file. Note that it is necessary to set the paths for all the needed files.

The Jupyter notebook is organized in different sections:
1. First steps:
It imports all the need modules for the program. Sets the Google Drive directory for
the required files and initializes the main variables.
2. Setting the environment:
It defines all the functions that implement data pre-processing, importing data
(vectorization and tokenization), load pre-trained word embeddings and prediction
and evaluation of the models.
3. Initializing deep learning models:
It contains all the functions that initialize the different deep learning models.
4. SUBTASK A:
It contains all the code that has to be run to try one of the deep learning models for
the subtask A.
5. SUBTASK B:
the subtask B.
6. SUBTASK C:
the subtask C.
Example of execution:
The steps to train and evaluate one deep learning for one subtask are the following. Let’s
show an example for trying the model CNN (3 filters) for the subtask B.
a) Run all the cells in the section 1. First steps.
b) Run all the cells in the section 2. Setting the environment.
c) Run all the cells in the section 3. Initializing deep learning models.
d) Run the cells in the section 5. SUBTASK B. Note that in the second cell of the section
you need to discomment the line of the model you want to try, in this case, the line
calling to the function initialize_CNN_3().
As you can see in the previous example, the first three sections 1-3 need to be run always in
order to initialize all the functions required.
51
Appendix IV: Code
The code used for the project is detailed in this appendix and it is organized as follows:
- Code for the baseline: Logistic Regression -> Python file.

o Code used for Subtask A.
o Code used for Subtask B.
o Code used for Subtask C.
- Code for the deep learning models. -> Jupyter notebook.
o Code used for Subtask A.
o Code used for Subtask B.
o Code used for Subtask C.
52
Identifying and Categorizing Offensive Language in tweets using Deep Learning
models
1. First steps
The next cell, imports the keras module that will be essential to run the code.
import keras
keras.__version__
For the project, the code was run using Google CoLab. It is recommended to run the code using CoLab, due to the high computational cost that
the program requires. Then, it is necessary to import the directory where the dataset is stored on the drive.
from google.colab import drive
drive.mount('/content/drive')
The next cell imports all the necessary modules to run the program.
# IMPORT MODULES
import numpy as np
import re
import pandas as pd

from keras.models import Sequential, Model Fl
from keras.layers import Dense, SimpleRNN, Embedding, concatenate, Input, LSTM, SpatialDropout1D, Dropout, GRU, Bidirectional, Fl
from keras.utils import plot_model
from keras.utils.np_utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from sklearn.metrics import confusion_matrix,classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
In the cell below, all the necessary constants of the program are initialized. NOTE that the les path should be set.
# INITIALIZATION

train_file = 'drive/My Drive/Dataset/olid-training-v1.0.tsv' # set corresponding path for the train dataset file

test_file_a = 'drive/My Drive/Dataset/testset-levela.tsv' # set corresponding path for the test dataset file
test_file_b = 'drive/My Drive/Dataset/testset-levelb.tsv' # set corresponding path for the test dataset file
test_file_c = 'drive/My Drive/Dataset/testset-levelc.tsv' # set corresponding path for the test dataset file

test_labels_a = 'drive/My Drive/Dataset/labels-levela.csv' # set corresponding path for the labels of the test dataset file
test_labels_b = 'drive/My Drive/Dataset/labels-levelb.csv' # set corresponding path for the labels of the test dataset file
test_labels_c = 'drive/My Drive/Dataset/labels-levelc.csv' # set corresponding path for the labels of the test dataset file

maxlen = 200
max_fatures = 10000

# GloVe Pretrained Embeddings
embedding_dim = 200

models_a = list()
models_b = list()
models_c = list()
ensemble_models_a = list()
ensemble_models_b = list()
c =
ensemble_models_c = list()
2. Seting the environment
In this section you can nd all the functions that have to be initialized before training and the models.
Data Preprocessing
The data preprocessing is applied at the cell below. The techniques applied are:
Remove all the special characters and instances of USER and URL.
Split the hashtag into words.
Lowercase all the tweets.
Tokenization.
NOTE that in this cell, the train dataset is split into a train and validation dataset. The validation dataset is the 10% of the original train dataset.
# UTIL FUNCTIONS

# SPLIT HASHTAG INTO WORDS UTIL FUNCTIONS
def replace_hashtag(tweet):
hashtags = find_hashtag(tweet)
for hashtag in hashtags:
split = split_hashtag(hashtag)
tweet = tweet.replace(hashtag, split)
tweet = tweet.replace('#', '')
return tweet

def find_hashtag(tweet):
hashtags = re.findall(r"#(\w+)", tweet)
return hashtags

def split_hashtag(hashtag):
fo = re.compile(r'#[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
fi = fo.findall(hashtag)
result = ''
for var in fi:
result += var + ' '
return result

# DATA PREPROCESSING STEPS
def data_preprocessing(data):
for i, j in data.iterrows():
data.at[i,'tweet'] = replace_hashtag(j['tweet']) # split hashtags into words
data['tweet'] = data['tweet'].apply(lambda x: x.lower()) # lowercase
data['tweet'] = data['tweet'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # remove special characters
data['tweet'] = data['tweet'].str.replace('user','') # remove 'user' tokens
data['tweet'] = data['tweet'].str.replace('url','') # remove 'url' tokens

# TRAINING DATA (FOR ALL THE SUBTASKS)
def import_data(train_file, test_file, test_labels, subtask_name):
# IMPORT DATASET
data = pd.read_csv(train_file, sep='\t', header=0)
data = data[['id','tweet', 'subtask_a', 'subtask_b', 'subtask_c']]

# DATA PREPROCESSING
data_preprocessing(data)

# TOKENIZER
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['tweet'].values)
X = tokenizer.texts_to_sequences(data['tweet'].values)
X = pad_sequences(X, maxlen=maxlen)

# REAL TEST DATASET
data_test = pd.read_csv(test_file, sep='\t', header=0)

# DATA PREPROCESSING
# split hashtags into words
for i, j in data_test.iterrows():
data_test.at[i,'tweet'] = replace_hashtag(j['tweet'])
data_test['tweet'] = data_test['tweet'].apply(lambda x: x.lower()) # lowercase
data_test['tweet'] = data_test['tweet'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # remove special characters
data_test['tweet'] = data_test['tweet'].str.replace('user','') # remove 'user' tokens
data_test['tweet'] = data_test['tweet'].str.replace('url','') # remove 'url' tokens

labels_test = pd.read_csv(test_labels, sep=',', header=0)
labels_test = labels_test[['id', subtask_name]]
data_test = pd.merge(data_test, labels_test, on='id')

Y = pd.get_dummies(data[subtask_name]).values

# Testing with validation dataset
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)

# Testing with original test dataset
X_test_real = tokenizer.texts_to_sequences(data_test['tweet'].values)
X_test_real = pad_sequences(X_test_real, maxlen=maxlen)
Y_test_real = pd.get_dummies(data_test[subtask_name]).values

return X train X test Y train Y test X test real Y test real tokenizer
return X_train, X_test, Y_train, Y_test, X_test_real, Y_test_real, tokenizer
Pretrained word embeddings
In the next cell, the pre-trained word embeddings are loaded, and the embedding matrix is obtained. NOTE that the le path for the embeddings
has to be set.
# PRETRAINED EMBEDDINGS

def pretrained_embeddings(tokenizer):
#embedding_file = open('drive/My Drive/embeddings/glove.6B.100d.txt') # 100-dimensional pre-trained word embeddings - set the f
embedding_file = open('drive/My Drive/embeddings/glove.6B.200d.txt') # 200-dimensional pre-trained word embeddings - set the fi

embeddings_index = {}

for line in embedding_file:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
embedding_file.close()

print('Found %s word vectors.' % len(embeddings_index))

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

embedding_matrix = np.zeros((max_fatures, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if i < max_fatures:
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
r =
embedding_matrix[i] = embedding_vector
return embedding_matrix
Prediction and evaluation functions
The next functions predict and plot the results of the prediction for the models using the validation and the test dataset.
Prediction for only one model:
def prediction_and_results(model, X_test, Y_test, X_test_real, Y_test_real):

# PREDICTION FOR ONE MODEL

# PREDICTION USING VALIDATION DATASET
print('Model:', model_name)
print('Model prediction with validation dataset:')

# Prediction for class Model
Y_pred = model.predict(X_test,batch_size = batch_size)
Y_pred = np.argmax(Y_pred,axis=1)

df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred': Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))

print("Confusion matrix:", confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred, digits=4))

# PREDICTION USING REAL TEST DATASET
print('Model prediction with test dataset:')

# Prediction for class Model
Y_pred_real = model.predict(X_test_real,batch_size = batch_size)
Y_pred_real = np.argmax(Y_pred_real,axis=1)

df_test_real = pd.DataFrame({'true': Y_test_real.tolist(), 'pred':Y_pred_real})
df_test_real['true'] = df_test_real['true'].apply(lambda x: np.argmax(x))

print("Confusion matrix", confusion_matrix(df_test_real.true, df_test_real.pred))
print(classification_report(df_test_real.true, df_test_real.pred, digits=4))
Prediction for ensembling models.
def prediction_and_results_ensemble(ensemble_model, X_test, Y_test, X_test_real, Y_test_real):
# PREDICTION FOR ENSEMBLE MODELS

y_combine = [model.predict(X_test, batch_size = batch_size) for model in ensemble_model]
y_combine = np.array(y_combine)
# sum across ensembles
summed = np.sum(y_combine, axis=0)
# argmax across classes
Y_pred = np.argmax(summed, axis=1)

df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred': Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))

print("Confusion matrix:", confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred, digits=4))

# PREDICTION USING REAL TEST DATASET
y_combine = [model.predict(X_test_real, batch_size = batch_size) for model in ensemble_model]
y_combine = np.array(y_combine)
# sum across ensembles
summed = np.sum(y_combine, axis=0)
# argmax across classes
Y_pred_real = np.argmax(summed, axis=1)

df_test_real = pd.DataFrame({'true': Y_test_real.tolist(), 'pred':Y_pred_real})
df_test_real['true'] = df_test_real['true'].apply(lambda x: np.argmax(x))

print("Confusion matrix", confusion_matrix(df_test_real.true, df_test_real.pred))
print(classification_report(df_test_real.true, df_test_real.pred, digits=4))

3. Initializing Deep Learning models
The following cells are divided into the different functions that initialize the deep learning models (each cell initializes a different model).
SimpleRNN
def initialize_SimpleRNN(output):
# Simple RNN
model_name = 'Simple RNN'
de

model = Sequential()
#model.add(Embedding(max_fatures, embedding_dim,input_length = maxlen)) # without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) # with pret
model.add(SimpleRNN(32))
model.add(Dense(output, activation='softmax'))
return model, model_name
Simple LSTM
def initialize_SimpleLSTM(output):
# Simple LSTM
model_name = 'Simple LSTM'

#model.add(Embedding(max_features, 32)) # without pretrained embeddings
model.add(LSTM(32))
me
LSTM
def initialize_LSTM(output):
# LSTM
model_name = 'LSTM'

lstm_out = 196

#model.add(Embedding(max_fatures, embedding_dim,input_length = maxlen)) # without pretrained embeddings
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
a
BiLSTM + DROPOUT
def initialize_BiLSTM_Dropout(output):
# BILSTM + DROPOUT
model_name = 'BILSTM + DROPOUT'

lstm_out = 196

#model.add(Embedding(max_fatures, embedding_dim, input_length = maxlen, trainable = True)) #without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) #with pretr
model.add(Dropout(0.25))
model.add(Bidirectional(LSTM(lstm_out, return_sequences=True, recurrent_dropout=0.25)))
model.add(Bidirectional(LSTM(lstm_out, return_sequences=True, recurrent_dropout=0.25)))
model.add(Flatten())
(L
CNN + GLOBAL MAX POOLING
def initialize_CNN_GMP(output):
# CNN + GLOBAL MAX POOLING
model_name = 'CNN + GLOBAL MAX POOLING'

#model.add(Embedding(max_fatures, embedding_dim, input_length = maxlen, trainable = False)) #without pretrained embeddings
model.add(Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)) #with pretr
model.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(256, activation='relu'))
a
CNN (3 lters)
def initialize_CNN_3(output):
# CNN (3 filters)
model_name = 'CNN'

embedding_dim = 200

i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = Embedding(max_fatures, embedding_dim ,weights = [embedding_matrix], input_length=maxlen, trainable=True)(i)
x = Dropout(0.4)(x)

def get_conv_pool(x_input, max_len, sufix, n_grams=[2,3,4], feature_maps=256):
branches = []
for n in n_grams:
branch = Conv1D(filters=feature_maps, kernel_size=n, activation='relu', name='Conv_'+sufix+'_'+str(n))(x_input)
branch = MaxPooling1D(pool_size=max_len-n+1, strides=None, padding='valid', name='MaxPooling_'+sufix+'_'+str(n))(bran
branch = Flatten(name='Flatten_'+sufix+'_'+str(n))(branch)
branches.append(branch)
return branches

branches = get_conv_pool(x, maxlen, 'dynamic')
z = concatenate(branches, axis=-1)
z1 = Dropout(0.3)(z)
z2 = Dense(256, activation='relu')(z1)
o = Dense(2, activation='softmax')(z2)

model = Model(inputs=i, outputs=o)
BiLSTM + BiGRU
def initialize_BiLSTM_BiGRU(output):
# BILSTM + BIGRU
model_name = 'BILSTM + BIGRU'

lstm_units = 196
gru_units = 64

i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = Embedding(max_fatures, embedding_dim, weights = [embedding_matrix], input_length=maxlen, trainable=True)(i)
x = Dropout(0.4)(x)
x1 = Bidirectional(LSTM(lstm_units, return_sequences=True, recurrent_dropout=0.3))(x)
x2 = Dropout(0.3)(x1)
x3 = Bidirectional(GRU(gru units, return sequences=True))(x2)
_ _
x4 = Dropout(0.3)(x3)

max_pooling = MaxPooling1D()(x4)
max_pooling = Flatten()(max_pooling)

average_pooling = AveragePooling1D()(x4)
average_pooling = Flatten()(average_pooling)

z1 = concatenate([max_pooling, average_pooling], axis=-1)
z2 = Dense(128, activation='relu')(z1)
o = Dense(2, activation='softmax')(z2)

model = Model(inputs=i, outputs=o)
me
4. SUBTASK A
This section has the code for the implementation of the Subtask A.
# SUBTASK A
# IMPORT DATA
print('IMPORT DATA...')
X_train, X_test_a, Y_train_a, Y_test_a, X_test_real_a, Y_test_real_a, tokenizer = import_data(train_file, test_file_a, test_label
embedding_matrix = pretrained_embeddings(tokenizer)
The following cell initializes, compiles, plots and trains the model. You can only try one model each time.
NOTE that you need to comment the lines that initialize the models that are not used.
# INITIALIZE MODELS
mod
#model, model_name = initialize_SimpleRNN(2)
#model, model_name = initialize_SimpleLSTM(2)
#model, model_name = initialize_LSTM(2)
#model, model_name = initialize_BiLSTM_Dropout(2)
#model, model_name = initialize_CNN_GMP(2)
#model, model_name = initialize_CNN_3(2)
#model, model_name = initialize_BiLSTM_BiGRU(2)

# COMPILE MODEL
print('COMPILE THE MODEL...')
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

# MODEL SCHEME
print('PLOT THE MODEL...')
print(model.summary())
plot_model(model, show_shapes=True, show_layer_names=True)

# SUBTASK A: FIT THE MODEL
print('FIT THE MODEL...')
batch_size = 128
model.fit(X_train, Y_train_a, epochs = 10, batch_size=batch_size)
When training different deep learning models is useful to save them, in this case we use a list. This will be used also to implement the model
ensembling.
# SAVE THE MODEL TO A LIST
models_a.append(model) # save all the trained models
ensemble_models_a.append(model) # save models for later ensembling
The following cell executes the prediction and evaluation for the model that was trained last, which is saved in the variable model.
print('PREDICT FOR ONE MODEL...')
prediction_and_results(model, X_test_a, Y_test_a, X_test_real_a, Y_test_real_a)
The next cell only needs to be run, if you want to try ensembling models. To ensemble different deep learning models is necessary to train them
separately and save them in the list ensemble_models.
# PREDICTION FOR ENSEMBLES
print('PREDICT FOR ENSEMBLE...')
prediction_and_results_ensemble(ensemble_models_a, X_test_a, Y_test_a, X_test_real_a, Y_test_real_a)
5. SUBTASK B
This section has the code for the implementation of the Subtask B.
# SUBTASK B
# IMPORT DATA
X_train, X_test_b, Y_train_b, Y_test_b, X_test_real_b, Y_test_real_b, tokenizer = import_data(train_file, test_file_b, test_label
# INITIALIZE MODELS
#model, model_name = initialize_SimpleRNN(2)
mod

# COMPILE MODEL
model.add(Dense(2, activation='softmax'))

# MODEL SCHEME

# SUBTASK B: FIT THE MODEL
batch_size = 128
model.fit(X_train, Y_train_b, epochs = 10, batch_size=batch_size)
ensembling.
models_b.append(model) # save all the trained models
ensemble_models_b.append(model) # save models for later ensembling
prediction_and_results(model, X_test_b, Y_test_b, X_test_real_b, Y_test_real_b)
prediction_and_results_ensemble(ensemble_models_b, X_test_b, Y_test_b, X_test_real_b, Y_test_real_b)
6. SUBTASK C
This section has the code for the implementation of the Subtask C.
# SUBTASK C
# IMPORT DATA
X_train, X_test_c, Y_train_c, Y_test_c, X_test_real_c, Y_test_real_c, tokenizer = import_data(train_file, test_file_c, test_label
# INITIALIZE MODELS
#model, model name = initialize SimpleRNN(2)
mod

# COMPILE MODEL
model.add(Dense(3, activation='softmax'))

# MODEL SCHEME

# SUBTASK C: FIT THE MODEL
batch_size = 128
model.fit(X_train, Y_train_c, epochs = 10, batch_size=batch_size)
ensembling.
models_c.append(model) # save all the trained models
ensemble_models_c.append(model) # save models for later ensembling
prediction_and_results(model, X_test_c, Y_test_c, X_test_real_c, Y_test_real_c)
prediction_and_results_ensemble(ensemble_models_c, X_test_c, Y_test_c, X_test_real_c, Y_test_real_c)
Glossary
A list of all acronyms and the meaning they stand for.
AI: Artificial intelligence

SemEval: International Workshop on Semantic Evaluation
OLID: Offensive Language Identification Dataset
CSV: Comma Separated Values
RNN: Recurrent Neural Networks
LSTM: Long-Short Term Memory
BiLSTM: Bidirectional Long-Short Term Memory
GRU: Gated Recurrent Unit
BiGRU: Bidirectional Gated Recurrent Unit
CNN: Convolutional Neural Networks
NLP: Natural Language Processing
53

Degree Thesis BertaVinas.V0.5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Degree Thesis BertaVinas.V0.5

Uploaded by

Copyright:

Available Formats

Identifying and Categorizing Offensive Language in tweets

using Machine Learning

Advisor: Jose Antonio Lázaro Villa

Barcelona, January 2020

L’objectiu d’aquesta tesi és el desenvolupament d’un sistema per identificar i categoritzar el

Per al desenvolupament del projecte, s'apliquen diferents models d'aprenentatge automàtic.

El objetivo de esta tesis es el desarrollo de un sistema para identificar y categorizar el

Para el desarrollo del proyecto, se aplican diferentes modelos de aprendizaje automático. La

Revision Date Purpose

0 10/12/2019 Document creation

1 22/12/2019 Document revision

2 10/01/2019 Document revision

3 14/01/2020 Document revision

4 25/01/2020 Document revision

DOCUMENT DISTRIBUTION LIST

Berta Viñas Redondo bertavr@hotmail.com

Jose Antonio Lázaro Villa subdir.internacional@etsetb.upc.edu

Pierre Nugues pierre.nugues@cs.lth.se

Written by: Reviewed and approved by:

Date 10/12/2019 Date 21/01/2020

Name Berta Viñas Redondo Name Jose Antonio Lázaro Villa

Position Project Author Position Project Supervisor

Figure 1: Work packages structure. .........................................................................................15

Table 1: Final work plan. ..........................................................................................................14

1.1. Project background

1.2. Statement of purpose

1.3. Project requirements

1.4. Project specifications

The project specifications are the following:

The final work plan is distributed in the following work packages:

WP# Short title Milestone / deliverable Date (week)

11 Training Trained system. 15/10/2019

Figure 1: Work packages structure.

Figure 4: Internal structure of WP6 for subtask C.

Figure 5: Final Gantt diagram.

Figure 6: Distribution of the hours dedicated to the project.

2.1. Introduction to machine learning

Figure 7: AI, Machine learning and Deep learning

Machine learning algorithms need three essential things:

2.1.1. Logistic Regression classifier

Figure 8: Logistic function.

Figure 9: Data split.

Figure 10: Input data structure used in the project.

2.2. Introduction to deep learning

2.2.1. Introduction to neural networks

Figure 11: Basic scheme of machine learning. [2]

2.2.2. Data representation for neural networks

2.3. Deep learning for text sequences

2.3.1. Preparing the text data

Figure 12: Pandas DataFrame example.

2.3.2. Working with text data

Text can be understood as either a sequence of characters or a sequence of words, although

2.3.2.1. One-hot encoding

2.3.2.2. Word embeddings

There are two different ways to use word embeddings:

2.3.3. Recurrent Neural Networks

2.3.3.2. LSTM and GRU

Figure 15: Basic example of LSTM.

2.3.4. Convolutional Neural Networks

Figure 16: Basic example of convolution.

2.3.5. Ensembling models

3.1. Task description

Figure 18: Data distribution in OLID dataset.