You are on page 1of 19

Text

Classificatio
n
-by
TV
Harshawardhan(COE17B
005)
Problem Statement
Using pretrained different word embeddings available we perform text
classification for binary class and multi class classification
Datasets used
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS
messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.

This corpus has been collected from free or free for research sources at the Internet:
● A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell
phone users make public claims about SMS spam messages, most of them without reporting the very spam message received.
The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully
scanning hundreds of web pages. The Grumbletext Web site is: [Web Link].

● A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000
legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The
messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected
from volunteers who were made aware that their contributions were going to be made publicly available.
Datasets used
The BBC datases are made available for non-commercial and research purposes only, and all data is
provided in pre-processed matrix format. This corpus has been collected from free or free for
research sources at the Internet: ⇒ Consists of 2225 documents from the BBC news website
corresponding to stories in five topical areas from 2004-2005. ⇒ Class Labels: 5 (business,
entertainment, politics, sport, tech)
Datasets used
IMDB dataset for binary sentiment classification containing substantially more data than previous
benchmark datasets. It provides a set of 25,000 highly polar movie reviews for training and 25,000
for testing
Datasets used
EmoBank comprises 5k sentences balancing multiple genres. It is special for having two kinds of
double annotations: Each sentence was annotated according to both the emotion which is
expressed by the writer, and the emotion which is perceived by the readers. Also, a subset of the
corpus have been previously annotated according to Ekmans 6 Basic Emotions (Strapparava and
Mihalcea, 2007) so that mappings between both representation formats become possible.The data
has the following classes
Anger
⇒ joy
⇒ Sad
⇒ Fear
⇒ love
⇒ Surprise
Embeddings Used
BERT

GPT-2

XLNET

T5

FLAIR
Transformer Model
BERT
BERT makes use of Transformer, an attention mechanism that learns contextual relations between
words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms
— an encoder that reads the text input and a decoder that produces a prediction for the task. Since
BERT’s goal is to generate a language model, only the encoder mechanism is necessary.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-
left), the Transformer encoder reads the entire sequence of words at once. Therefore it is
considered bidirectional, though it would be more accurate to say that it’s non-directional. This
characteristic allows the model to learn the context of a word based on all of its surroundings (left
and right of the word).

BERT is trained on Wikipedia and Book Corpus.BERT is better used in small datasets or short
length text
BERT
In technical terms, the prediction of the output words requires:

1. Adding a classification layer on top of the encoder output.


2. Multiplying the output vectors by the embedding matrix, transforming them into the
vocabulary dimension.
3. Calculating the probability of each word in the vocabulary with softmax.
Tasks Datasets Accuracy Precision Recall F1-Score Distribution

Binary Spam 99% 0-1 0-1 0-1 0 - 13%


Classification Classification 1 - 0.99 1-1 1-1 1 - 87%

Multi-Class BBC-text 97.52% 0 - 0.97 0- 0 - 0.97 0 - 22%


Classification 1-1 0.96 1 - 0.99 1 - 22%
2 - 0.94 1- 2 - 0.96 2 - 18%
3 - 0.98 0.99 3 - 0.98 3 - 18%
4 - 0.99 2- 4 - 0.98 4 - 20%
0.98
3-
0.99
4-
0.96
GPT-2
The GPT-2 architecture is very similar to the decoder-only transformer. The GPT2 was, however, a very large,
transformer-based language model trained on a massive dataset.In bert we might have seen that it is encoder
only model whereas GPT-2 is a decoder only model so the use of this in Classification is very less as it gives
less accuracy but this model is very much used in next word prediction.

GPT2 as it is a Decoder model is more used in next word prediction or machine translation etc.
Tasks Datasets Accuracy Precision Recall F1-Score Distribution

Binary Spam 99% 0 - 0.99 0 - 0.99 0 - 0.99 0 - 13%


Classification Classification 1 - 0.96 1 - 0.95 1 - 0.95 1 - 87%

Multi-Class BBC-text 97% 0 - 0.94 0 - 0.92 0 - 0.93 0 - 22%


Classification 1 - 0.99 1 - 0.99 1 - 0.99 1 - 22%
2 - 0.93 2 - 0.98 2 - 0.95 2 - 18%
3 - 1.00 3 - 0.99 3 - 1.00 3 - 18%
4 - 0.97 4 - 0.95 4 - 0.96 4 - 20%
XLNET
XLNET is a generalized autoregressive model where next token is dependent on all previous
tokens. XLNET is “generalized” because it captures bi-directional context by means of a
mechanism called “permutation language modeling”. It integrates the idea of auto-regressive
models and bi-directional context modeling, yet overcoming the disadvantages of BERT. It
outperforms BERT on 20 tasks, often by a large margin in tasks such as question answering,
natural language inference, sentiment analysis, and document ranking.

XLNET is trained on BooksCorpus [40] and English Wikipedia


Tasks Datasets Accuracy Precision Recall F1-Score Distribution

Binary Spam 99% 0 - 0.99 0 - 0.99 0 - 0.99 0 - 13%


Classification Classification 1 - 0.99 1 - 0.99 1 - 0.99 1 - 87%

Multi-Class BBC-text 97% 0-1 0 - 0.89 0 - 0.94 0 - 22%


Classification 1-1 1-1 1-1 1 - 22%
2-1 2-1 2-1 2 - 18%
3-1 3 - 0.97 3 - 0.98 3 - 18%
4-1 4-1 4-1 4 - 20%
Text to Text Transfer transformer
Consider the example of a BERT-style architecture that is pre-trained on a Masked LM and Next Sentence
Prediction objective and then, fine-tuned on downstream tasks (for example predicting a class label in
classification or the span of the input in QnA). Here, we separately fine-tune different instances of the
pre-trained model on different downstream tasks.

The text-to-text framework on the contrary, suggests using the same model, same loss function, and the
same hyperparameters on all the NLP tasks. In this approach, the inputs are modeled in such a way that
the model shall recognize a task, and the output is simply the “text” version of the expected

If you wish to train the model from scratch using a large dataset T5 can be used
Tasks Datasets Accuracy Precision Recall F1-Score Distribution

Binary IMDB 97% 0 - 0.95 0 - 0.94 0 - 0.95 0 - 13%


Classification 1 - 0.94 1 - 0.95 1 - 0.95 1 - 87%

Multi-Class Emotion 94% 0-0.40 0 - 0.92 0 - 0.93 0 - 13%


Classification Dataset 1-1 1 - 0.90 1 - 0.88 1 - 11%
2-0.71 2 - 0.94 2 - 0.95 2 - 34%
3- 0.94 3 - 0.86 3 - 0.84 3 - 8%
4-0.96 4 - 0.97 4 - 0.97 4 - 29%
5-0.81 5 - 0.97 5 - 0.72 6 - 3.3%
Flair
Context is so vital when working on NLP tasks. Learning to predict the next character based on previous characters forms
the basis of sequence modeling.

Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of
word embedding. In simple terms, it uses certain internal principles of a trained character model, such that words can have
different meaning in different sentences.

Note: A language and character model is a probability distribution of Words / Characters such that every new word or
character depends on the words or characters that came before it.

There are two primary factors powering contextual string embeddings:

1. The words are trained as characters (without any notion of words). Aka, it works similar to character embeddings
2. The embeddings are contextualised by their surrounding text. This implies that the same word can have different
embeddings depending on the context. Quite similar to natural human language, isn’t it? The same word may have
different meanings in different situations
Tasks Datasets Accuracy Precision Recall F1-Score Distribution

Binary Spam 99% 0 - 0.99 0 - 0.98 0 - 0.99 0 - 13%


Classification Classification 1 - 0.98 1 - 0.92 1 - 0.90 1 - 87%

Multi-Class BBC-text 98% 0 - 0.97 0 - 0.92 0 - 0.94 0 - 22%


Classification 1 - 0.97 1-1 1 - 0.98 1 - 22%
2 - 0.94 2 - 0.98 2 - 0.96 2 - 18%
3-1 3 - 0.98 3 - 0.99 3 - 18%
4 - 0.95 4 - 0.96 4 - 0.95 4 - 20%

You might also like