You are on page 1of 13

Universität Hamburg

Department Informatik
Knowledge Technology, WTM

Exploring Homoscedastic Uncertainty for


Weighing Losses in Multitask Learning

Seminar Paper

Neural Networks

Anirban Bhowmick
Matr.Nr. 7251843
anirban.bhowmick@informatik.uni-hamburg.de

08.05.2020
Anirban Bhowmick

Abstract
Multitask learning (MTL) is a broad spectrum of neural network where it is used
in solving machine learning tasks, Natural Language Processing (NLP) problems
like parts-of-speech (POS) tagging or even in Computer Vision problems like road
object detection for autonomous driving systems, object detection from images.
In this paper we will discuss the popular MTL methodologies used and the recent
advances which can help us by giving idea how MTL works and how it can be
implemented. To demonstrate MTL we have taken an Sentiment Analysis model
from our previous implementation seminar paper 1 . We then performed Parts-of-
Speech(POS) Tagging along with sentiment analysis in parallel to determine how
carrying out simultaneous task using shared Neural Model eventually improved
the model’s performance.
Keywords: Sentiment Analysis, Multitask Learning(MTL), Natural Language
Processing(NLP), Computer Vision

Contents
1 Introduction 2

2 Related Work 2

3 MTL Methods Descriptions 3


3.1 Hard parameter sharing . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Soft parameter sharing . . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Dataset 4
4.1 Standford Sentiment Treebank (SST) . . . . . . . . . . . . . . . . . 4
4.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5 Word Embeddings 5
5.1 FastText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6 Long Short Term Memory (LSTM) 5

7 Experiment 7

8 Conclusion 8

Bibliography 9

1
Anirban Bhowmick and Syed Saif Hasan. Investigating effect of different pre-trainedword
embeddings on Sentiment Analysis

1
Anirban Bhowmick

1 Introduction
While solving any task (for example a classification) task via neural network we
train a model to detect an object as classes what if we need to identify multiple
classes. Instead of using a separate neural network for each class we can train
a single neural network and use the learning gained from a single task to solve
other tasks ( detecting other classes in this example). This can improve the overall
performance of the network. This approach is called multitask learning (MTL).
In traditional Machine learning (ML), we are typically focused on optimizing a
particular metric by training a single model or a group of models and then fine-
tuning the models until their performance reaches the optimum. While we may
generally achieve considerable performance this way, by being focused on our single
task, but on the other hand, we ignore information that might help us improve
the metric we care about. Specifically, this information comes from the training
signals of related tasks. In MTL By sharing representations between related tasks,
we can enable our model to generalize better on our original task. MTL is used
in all areas of ML-like Natural Language Processing [4], speech recognition [5],
computer vision [6]. MTL is i also known by different names: joint learning,
learning to learn and learning with auxiliary tasks. In our study POS tagging is
the auxiliary task along with sentiment analysis being the main task. Whenever
we are optimizing more than one loss function we are technically MTL under the
hood (although MTL encompasses more than just optimizing losses). Throughout
this paper, I will talk about MTL with respect to deep neural networks, some
frequently used MTL methodologies. Subsequently, I will talk about my previous
sentiment analysis task and integrating POS tagging auxiliary task and how it
improves the overall learning process.

2 Related Work
Multitask Learning(MTL) has the objective of improving the performance of a
model for a set of tasks when compared to using a separate model for each task
[2]. The performance can be defined in terms of learning efficiency and predic-
tion accuracy. It can be considered an approach of inductive knowledge transfer
which improves generalization by sharing the domain information between com-
plementary tasks. It does this by using a shared representation to learn multiple
tasks [8]. MTL based computer vision approaches often share the convolution
layers, while learning of the task-specific fully-connected layers is done separately.
[9] improved these models by proposing Deep Relationship Networks.Many other
computer vision tasks focus on semantic tasks, such as classification and semantic
segmentation [12] or classification and detection [13]. MultiNet [14] is an architec-
ture for detection, classification and semantic segmentation. While Cross Stitch
networks [15] combine multi-task neural activation. On the other hand instead of
learning the structure of sharing layers, Kendall et al., 2017 [11] took a radical ap-
proach by considering the uncertainty of each task. They then adjusted each task’s

2
Anirban Bhowmick

relative weight in the cost function by deriving a multi-task loss function based
on maximizing the Gaussian likelihood with task-dependent uncertainty. In NLP,
recent work focused on finding better task hierarchies for multi-task learning: [10]
demonstrated that low-level NLP tasks, i.e. such as part-of-speech tagging and
named entity recognition, should be supervised at lower layers when used as an
auxiliary task. Our model learns the process of analysis of movie review sentiments
over fine-grained Stanford Sentiment Dataset(SST) and combining that with POS
tagging of each review document. Finally, our goal is to demonstrate that using an
auxiliary task like POS tagging improves the overall learning accuracy compared
to the original task of just predicting sentiments as depicted in 1 .

3 MTL Methods Descriptions


Here we discuss the methodologies used to implement MTL. They include ap-
proaches like Hard Parameter sharing and Soft parameter sharing.

3.1 Hard parameter sharing


It is the most common approach of MTL in neural networks and it dates back to the
work by [?]. It involves sharing hidden layers for all tasks while keeping separate
task-specific output layers. Hard parameter sharing significantly reduces the risk
of over-fitting. [17] showed that the risk of over-fitting the shared parameters is
an order N(where N is the number of tasks). Thus this implies the more tasks we
are learning simultaneously, the more our model has to find a representation that
captures all of the tasks and the less is our chance of over-fitting on our original
task.

3.2 Soft parameter sharing


In Soft parameter sharing each task has its own dedicated layers. The parameters
of each model are then regularized to make the parameters similar. [18] for instance
used l2 distance for regularization, while [19] used the trace norm.

Figure 1: Hard parameter sharing and Soft parameter sharing block diagram for
multi-task learning in deep neural networks as mentioned in [1]

3
Anirban Bhowmick

4 Dataset
4.1 Standford Sentiment Treebank (SST)
In this paper, we will use the Stanford Sentiment Treebank (SST) dataset [20].
This dataset is one of the most popular datasets for evaluating NLP models due
to its novelty and granularity.2
This dataset was prepared from [21] dataset from Rotten tomatoes and it con-
tains 11,855 single sentences that were extracted from movie reviews. The sen-
tences were further parsed into 214,154 unique phrases using a Stanford parser of
Klein and Manning, 2003, and they were annotated by 3 human judges and labeled
using Amazon mechanical trunk. Instead of the sentiment binary classes (posi-
tive/negative), there are a total of 5 classes of set 1-Very bad,2-bad,3-neutral,4-
good,5-very good where 1 being the most negative sentiment and 5 being the most
positive sentiment.

Figure 2: Categories of Labels from negative to positive

4.2 Pre-Processing
The pre-processing starts with the extraction of the root sentences from the dataset.
The sentences are then encoded into numerical values and the labels are categorized
into a binary one-hot representation.
The dataset was split into three subsets comprising of 8544 training data, 1101
validation data, and 2210 testing data.
The data was processed into its encodings. First, we tokenized the text into
a sequence of words since the embedding models read individual words instead of
complete sentences. The tokens are then encoded into numbers which are easily
readable by our model. We defined a standard encoding length equal to the largest
text encoding, which is 50 in our case, to have a standardized encoding represen-
tation of every word. These word encodings were then standardized by padding
them with zeros until they reached the standard encoding length. This dataset
contains 5 output labels, as shown in figure 1, which were converted into one-hot
encoding. We also created a dictionary of words from the training set to fetch their
respective embeddings from the pre-trained embedding model. There were a total
of 17,386 unique words in the SST training set dictionary.
For the POS task, we generate for POS tags for each word tokens for a review.
Vocabulary for the tokens is generated based on their frequencies or count. Each
token is then represented as a one-hot vector representation.
2
https://nlp.stanford.edu/sentiment/code.html.

4
Anirban Bhowmick

5 Word Embeddings
Word embedding is a popular representation of document vocabulary in a machine-
readable format. We know that a network cannot understand the semantic and
syntactic relationships between words fro their textual representation so we need
to provide them numerical representations of word vocabulary. Word embeddings
serve that purpose by creating vector representations of a particular word. There
exist many popular words embedding like GLoVe(Global Vectors) by Manning et
al.,[26], fast-text by Mikolov et al.,[27]. In this task, we are using fast-text pre-
trained with Wikipedia datasets(including subwords).

5.1 FastText
FastText is developed by Facebook. It is an efficient library for word represen-
tations and sentence classification. It is an extension of the word2Vec model in
which it divides words converted into a bag of character n-grams and sums these
character n-grams to form an overall word embedding [22]. It also performs well
with rare words since their character n-grams are shared with other words. It can
also get embedding for a word that was not present in the training vocabulary set.
Fast text is built upon the Word2Vec model [23] which implements a skip-gram
neural network model or continuous bag of words(CBoW). because it solves the
vanishing gradient problem with the introduction of Tan-h activation functionThe
fast text uses the skip-gram model or the CBoW for making word representations.
Each has their ways of doing so. The skip-gram model predicts a target word from
a nearby word. On the other hand, the CBoW model predicts the target word
according to its context. The context is represented as a bag of the words con-
tained in a fixed size window around the target word. For example, let us consider
the sentence ”The salesman was selling fine leather jackets”. If the target word
is ”fine”. A skip-gram model tries to predict the target using a random close-by
word like ”salesman”, ”leather”. While the CBoW model predicts the target word
considering all the surrounding words like ”was”, ”selling”, ”leather”, ”jacket”.

6 Long Short Term Memory (LSTM)


LSTM is a version of RNN (figure 3) 1 and it has been chosen over classical RNN
in this problem statement because it solves the vanishing gradient problem with
the introduction of Tan-h activation function. To understand the theory behind
choosing LSTM we have to begin by understanding how neural networks deal with
certain problems where previous knowledge is required to predict in the next stage.
Let’s take a real-life example, when we humans read a book up to a certain
page, leave it, and again continue reading we start from the line where we left
by relating from the knowledge we gathered when we last read and then use this
knowledge to connect with the ideas in the next line. But unfortunately, classical
neural networks cannot remember memory, what it came across. For that, we have

5
Anirban Bhowmick

Recurrent Neural Networks(RNN) [24] that have loops in it.


At a stage in the timestamp, an RNN passes the output as input into the
next stage. But suppose when we have a sentence “Ben is from Berlin and he
currently lives in Hamburg. He has a friend Paul whom he met in the international
movie festival in Berlin. Both Ben and Paul grew up in Berlin, studying in the
same school. Paul told him that he has a craze for ? ”. Here the network
can determine that the phrase is indicating some hobby (in this case ”movies”)
but it cannot determine which hobby specifically. For this, it needs the context
of a football match from a certain number of stages backward. Thus as the gap
between the required relevant information and the position where it is needed
increases, the RNN cannot remember this. So for that, we have LSTM(Long Short
Term Memory) [25]. An LSTM is a type of RNN capable of learning from long
term dependencies i.e remembering information for a long time.

Figure 3: A typical RNN sequence and with the Long term dependency problem.
The insights of an RNN unit shown with a simpler structure with just a Tanh
function 1

As shown in figure 4, each cell in an LSTM can add or remove certain infor-
mation via certain structures called gates. They are formed of the sigmoid layer,
tan-h layer, and some arithmetic functions. At first, the sigmoid layer looks into
the current input Xt and previous output ht−1 and decides which information to
keep and which to throw away by its output 0 or 1(0 means discard and 1 means
to keep). Owing to this the sigmoid layer is called ‘forget gate layer’. This can be
represented mathematically as

ft = σ(Wf .[ht−1 , xt ] + bf ) (1)

So from the example above the sigmoid layer can ignore the words ‘Berlin’,’
school’, and keep the words ‘movies’. Tanh creates a vector of all the input values.

6
Anirban Bhowmick

Figure 4: A LSTM unit having much complex configuration(Sigmoid(Si) and Tanh


function) compared to a typical RNN 1

It is then multiplied with Sigmoid function to determine which data to keep and
which ones to discard. The result of the product is then added with the memory
of the previous state to generate the memory of the current state.

7 Experiment
The main goal of our experiment is to analyze homoscedastic uncertainty for weigh-
ing losses between two tasks(which is still to be implemented, Currently at this
point of time we are performing multiple tasks in parallel). For this purpose, we are
performing Sentiment analysis as Task1 and performing sentiment analysis along
with POS tagging(auxiliary task) as task2. We then compare both task1 and task2
to analyze whether multitask learning improves the accuracy of the original task.
Finally, we integrate the uncertainty.
For task1, we initialized a Keras embedding layer with the word vectors of
the training dictionary. The dictionary has a total of 15334 words and each word
represented by a 300 dimension vector(given by ’n dim model’). The embedding
layer feeds the data into the LSTM layer has 50 units and each unit accepting a 300
dimension vector. 50 units correspond to the standardized size of each review of
50 tokens. The output of the LSTM layer is 300. To handle overfitting a dropout
ratio of 20% is used. The final dense layer uses a softmax classifier to classify
among 5 review categories.
For Task2 the sentiment analysis is done in the same way as task1. For POS
tagging a dictionary of POS tag is made and their one-hot encoding is passed to
the model as a training output targets. At each timestep output of the LSTM
layer, the POS tagging layer takes the word and predicts the POS tag for that
word. The experiment architecture is mentioned in figure 5.

7
Anirban Bhowmick

Figure 5: Proposed diagram of the model

8 Conclusion
We have shown that MTL is a powerful and promising approach. The automatic
detection of objects regarding autonomous driving can benefit just as much as
email spam filters, both at quality and efficiency level. However, the technology
should also be viewed with caution, as it is not mature. In many examples, we
noticed that poor implementation could quickly lead to more failures than success.

8
Anirban Bhowmick

References
[1] Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Net-
works. https://arxiv.org/pdf/1706.05098.pdf

[2] S. Thrun. Is learning the n-th thing any easier than learning the first?. In
Advances in neural information processing systems, pages 640–646. MORGAN
KAUFMANN PUBLISHERS, 1996

[3] J. Baxter et al. A model of inductive bias learning.. J. Artif. Intell. Res.(JAIR),
12(149-198):3, 2000.

[4] Collobert, R. and Weston, J. A unified architecture for natural. Proceed-


ings of the 25th international conference on Machine learning - ICML ’08,
20(1):160–167.

[5] Deng, L., Hinton, G. E., and Kingsbury, B. (2013). New types of deep neural
network learning for speech recognition and related applications: An overview.
International Conference on Acoustics, Speech and Signal Processing, pages
8599–8603.

[6] Deng, L., Hinton, G. E., and Kingsbury, B. (2013). New types of deep neural
network learning for speech recognition and related applications: An overview.
International Conference on Acoustics, Speech and Signal Processing, pages
8599–8603.

[7] Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE International


Conference on Computer Vision, pages 1440–1448.

[8] R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer,


1998.

[9] Long, M. and Wang, J. (2015). Learning Multiple Tasks with Deep Relationship
Networks. arXiv preprint arXiv:1506.02117.

[10] Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level
tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, pages 231–235.

[11] Kendall, A., Gal, Y., and Cipolla, R. (2017). Multi-Task Learning Us-
ing Uncertainty to Weigh Losses for Scene Geometry and Semantics.
https://arxiv.org/pdf/1705.07115.pdf

[12] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu. Understand scene cat-
egories by objects: A semantic regularized scene classifier using convolutional
neural networks. In 2016. IEEE International Conference on Robotics and Au-
tomation (ICRA), pages 2318–2325. IEEE, 2016.

9
Anirban Bhowmick

[13] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun.


Overfeat: Integrated recognition, localization and detection using convolutional
networks. International Conference on Learning Representations (ICLR), 2014.

[14] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun. Multi-


net: Real-time joint semantic reasoning for autonomous driving. arXiv preprint
arXiv:1612.07695, 2016.

[15] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Crossstitch networks for


multi-task learning.. In Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 3994–4003, 2016

[16] Caruana, R. (1993). Multitask learning: A knowledge-based source of induc-


tive bias. In Proceedings of the Tenth International Conference on Machine
Learning.

[17] Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn


via multiple task sampling. Machine Learning, 28:7–39.

[18] Duong, L., Cohn, T., Bird, S., and Cook, P. (2015). Low Resource Depen-
dency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser.
Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language
Processing (Short Papers), pages 845–850.

[19] Yang, Y. and Hospedales, T. (2017a). Deep Multi-task Representation Learn-


ing: A Tensor Factorisation Approach. In Proceedings of ICLR 2017.

[20] Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D.
Manning, Andrew Y. Ng and Christopher Potts. Recursive Deep Models for
Semantic Compositionality Over a Sentiment Treebank. Association for Com-
putational Linguistics (ACL 2013).

[21] Pang and L. Lee Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales. In ACL, pages 115–124.

[22] Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas


Mikolov Enriching Word Vectors with Subword Information.
https://arxiv.org/pdf/1607.04606.pdf

[23] Tomas Mikolov, Kai Chen, Jeffrey Dean and Greg Corrado.
Efficient Estimation of Word Representations in Vector Space.
https://arxiv.org/pdf/1301.3781.pdf

[24] Razvan Pascanu1, Caglar Gulcehre1, Kyunghyun Cho2 and Yoshua


Bengio How to Construct Deep Recurrent Neural Networks .
https://arxiv.org/pdf/1312.6026.pdf

10
Anirban Bhowmick

[25] Sepp Hochreiter and Jurgen Schmidhuber LONG SHORT-TERM MEMORY.


Neural Computation 9(8):1735-1780, 1997

[26] Jeffrey Pennington, Richard Socher, Christopher D.Manning. GloVe: Glob-


alVectors for Word Representation.. Association for Computational Linguis-
tics(ACL), 2014.

[27] Armand Joulin, Edouard Grave, Piotr Bojanowski and Tomas Mikolov. Bag
ofTricks for Efficient Text Classification.. Facebook AI Research, 2016.

11

You might also like