Professional Documents
Culture Documents
Text Classificatio N: - by TV Harshawardhan (COE17B 005)
Text Classificatio N: - by TV Harshawardhan (COE17B 005)
Classificatio
n
-by
TV
Harshawardhan(COE17B
005)
Problem Statement
Using pretrained different word embeddings available we perform text
classification for binary class and multi class classification
Datasets used
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS
messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.
This corpus has been collected from free or free for research sources at the Internet:
● A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell
phone users make public claims about SMS spam messages, most of them without reporting the very spam message received.
The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully
scanning hundreds of web pages. The Grumbletext Web site is: [Web Link].
● A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000
legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The
messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected
from volunteers who were made aware that their contributions were going to be made publicly available.
Datasets used
The BBC datases are made available for non-commercial and research purposes only, and all data is
provided in pre-processed matrix format. This corpus has been collected from free or free for
research sources at the Internet: ⇒ Consists of 2225 documents from the BBC news website
corresponding to stories in five topical areas from 2004-2005. ⇒ Class Labels: 5 (business,
entertainment, politics, sport, tech)
Datasets used
IMDB dataset for binary sentiment classification containing substantially more data than previous
benchmark datasets. It provides a set of 25,000 highly polar movie reviews for training and 25,000
for testing
Datasets used
EmoBank comprises 5k sentences balancing multiple genres. It is special for having two kinds of
double annotations: Each sentence was annotated according to both the emotion which is
expressed by the writer, and the emotion which is perceived by the readers. Also, a subset of the
corpus have been previously annotated according to Ekmans 6 Basic Emotions (Strapparava and
Mihalcea, 2007) so that mappings between both representation formats become possible.The data
has the following classes
Anger
⇒ joy
⇒ Sad
⇒ Fear
⇒ love
⇒ Surprise
Embeddings Used
BERT
GPT-2
XLNET
T5
FLAIR
Transformer Model
BERT
BERT makes use of Transformer, an attention mechanism that learns contextual relations between
words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms
— an encoder that reads the text input and a decoder that produces a prediction for the task. Since
BERT’s goal is to generate a language model, only the encoder mechanism is necessary.
As opposed to directional models, which read the text input sequentially (left-to-right or right-to-
left), the Transformer encoder reads the entire sequence of words at once. Therefore it is
considered bidirectional, though it would be more accurate to say that it’s non-directional. This
characteristic allows the model to learn the context of a word based on all of its surroundings (left
and right of the word).
BERT is trained on Wikipedia and Book Corpus.BERT is better used in small datasets or short
length text
BERT
In technical terms, the prediction of the output words requires:
GPT2 as it is a Decoder model is more used in next word prediction or machine translation etc.
Tasks Datasets Accuracy Precision Recall F1-Score Distribution
The text-to-text framework on the contrary, suggests using the same model, same loss function, and the
same hyperparameters on all the NLP tasks. In this approach, the inputs are modeled in such a way that
the model shall recognize a task, and the output is simply the “text” version of the expected
If you wish to train the model from scratch using a large dataset T5 can be used
Tasks Datasets Accuracy Precision Recall F1-Score Distribution
Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of
word embedding. In simple terms, it uses certain internal principles of a trained character model, such that words can have
different meaning in different sentences.
Note: A language and character model is a probability distribution of Words / Characters such that every new word or
character depends on the words or characters that came before it.
1. The words are trained as characters (without any notion of words). Aka, it works similar to character embeddings
2. The embeddings are contextualised by their surrounding text. This implies that the same word can have different
embeddings depending on the context. Quite similar to natural human language, isn’t it? The same word may have
different meanings in different situations
Tasks Datasets Accuracy Precision Recall F1-Score Distribution