Dealing With Textual Data

NLP-Issues and Challenges
“If it is true that there is always more than

one way of construing a text, it is not true
that all interpretations are equal.”
Dr. Saptarsi Goswami

Assistant Professor - Comp Sc Bangabasi Morning College
Importance Document Word Sequence
of Text Vectorization Vectorization Models
Processing
Text Processing
Text (natural language) is the most natural way of encoding human
knowledge. Text is by far the most common type of information encountered
by people. Text is the most expressive form of information
Some Statistics ( As of 2017)

 656 million tweets per day!
 4.3 BILLION Facebook messages posted daily
 5.2 BILLION daily Google Searches
 22 billion texts sent every day.
The Initial casual Talks:
Sentiment analysis
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.

Question answering (QA)
Spam Q. How effective is ibuprofen in reducing
fever in patients with acute febrile
detection Coreference resolution illness?
Let’s go to Agra!
✓ Carter told Mubarak he shouldn’t run
again. Paraphrase
Buy a Sports Cycle ✗
Word sense disambiguation (WSD)
XYZ acquired ABC yesterday
I need new batteries for my ABC has been taken over by XYZ
Part-of-speech (POS) tagging
mouse.
ADJ ADJ NOUN
VERB ADV Summarization
Colorless green ideas sleep Parsing The Dow Jones is up
furiously. Economy
I can see Alcatraz from the window!
The S&P500 jumped is good
Named entity recognition (NER) Housing prices rose
PERSON ORG Machine translation (MT)
LOC
第 13 届上海国际电影节开幕… Dialog Where is Citizen Kane playing in
Einstein met with UN officials in Princeton The 13th Shanghai International Film SF?
Festival…
Castro Theatre at 7:30. Do
Information extraction you want a ticket?
Party
(IE) May
You’re invited to our dinner 27
party, Friday May 27 at 8:30 add
The not so rosy but imp. work
 regular expression: Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “I need X” in ELIZA above, to defining strings
like $199 or $24.99 for extracting tables of prices from a document.
 text normalization, in which text normalization regular expressions play an important part.
Normalizing text means converting it to a more convenient, standard form
 Tokenization
 Hashtags, Emoticons
 lemmatization, the task of determining that two words have the same root, despite their
surface differences. For example, the words sang, sung, and sings are forms of the verb
sing
 Stemming is a simpler form of lemmatization

Text
Classification
Is this spam?
What is the subject of this

article?
Male or female author?

By 1925 present-day Vietnam was divided into three parts under
French colonial rule.
Clara never failed to be astonished by the extraordinary felicity of
her own name.
Language Models
• Models that assign probabilities to sequences of words are called language
models or LLMs
Probability of a Sentence
Document Vectorization
 What is structure of data ?
 Is text Structured?
 How can I get structure out of it?

Bag of Words
 A word is just a single token, often known as a unigram or 1-gram
 We already know that the Bag of Words model doesn’t consider order of words
 An N-gram is basically a collection of word tokens from a text document such
that these tokens are contiguous and occur in a sequence.
 Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of
order 3 (three words), and so on.
The tf-idf weighting (the ‘-’ here is a hyphen, not a minus sign) is the product
of two terms, each term capturing one of these two intuitions:
#Python 3
#And 7
tf-idf
#Python 7.972732
#And 7.670779
TF – IDF Vectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict',

strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word',
stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1,
max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True,
smooth_idf=True, sublinear_tf=False
 Let’s talk about, news articles. Based on what topic the news is about,
the words contained in the article will be different.
 A article of Sports will have some words, which will be different from
entertainment
• Each document is a mixture of topics.
In a 3 topic model we could assert that a document is 70% about topic A, 30

about topic B, and 0% about topic C.
• Every topic is a mixture of words.

A topic is considered a probabilistic distribution over multiple words.
Term to Term Co- occurrence
the ratio gives us an estimate of how much more the two words co-occur than we
expect by chance
How can we find the document vector now ?

Non-negative matrix factorization (NMF): latent representation and latent
features are nonnegative.
LDA - Behind the hood
 Latent Dirichlet Allocation (LDA) is a probabilistic model used in the field of

natural language processing (NLP) for topic modeling. Developed by David Blei,
Andrew Ng, and Michael Jordan in 2003,
 LDA assumes that documents are mixtures of topics, and topics are mixtures of
words.
 The goal of LDA is given a collection of documents, the algorithm tries to infer
the likely topics and the distribution of topics for each document.
Dipanjan Sarkar
 Topic Coherence is a measure
used to evaluate topic models:
methods that automatically
generate topics from a collection
of documents, using latent
variable models.
 Each such generated topic
consists of words, and the topic
coherence is applied to the top N
words from the topic. It is defined
as the average / median of the
pairwise word-similarity scores of
the words in the topic (e.g. PMI).
 A good model will generate
coherent topics, i.e., topics with
high topic coherence scores. Good
topics are topics that can be
described by a short label,
therefore this is what the topic
coherence measure should
capture.
Word Vectorization
 To compare words, we also need to represent a word in form a vector
 This process is known as embedding
 One of the simplest scheme to do the same

 Let’s for a toy example, assume that our corpus has only 8 words
 The third word is Apple, fifth word is Orange and the eight word is
School
 Apple 00100000
 Orange 00001000
 School 00000001
 Length of the vector

 Redundancy
 No notion of similarity or distance
Using TF-IDF & PMI
New Representation of Data (PCA)
The Neural World
The full story
Neural World at a Glance
vector[Queen] = vector[King] - vector[Man] +
vector[Woman]
the quick brown fox jumps over the
lazy dog
Context Target
[quick, fox] Brown
[Brown,jumps] Fox
[Fox,over] Jumps
Soft Max and Probability
distribution
skip-gram model’s aim is to predict the context from the target word
the quick brown fox jumps over the

lazy dog
Target Context
Quick The
Quick Brown
Brown Quick
Brown Fox
 It turns, out that treating it as a milti - class classification
problem, makes it very computationally inefficient to train
 So, it is converted to a binary classification problem
 The valid context, target pair belongs to 1 and invalid ones

belong to 0 class
Target Context Label

Quick The 1
Quick Brown 1
Brown Quick 1
Brown Fox 1
Brown Sky 0
Quick biscuit 0
 Creating a general corpus is extremely difficult
 Running the model for learning the embedding is a mammoth

task
 Once these embeddings are learned typically they are now

ready to be used on some down stream application like
Sentiment analysis, Question- Answering
 One solution is using pre-trained models
 The embeddings are learned on a large generic corpus
 https://nlp.stanford.edu/projects/glove/.
 glove-wiki-gigaword-50 (65 MB)
 LSA based models utilize statistical information (Global) and does
poorly at analogy task
 Prediction based models focus on context ) Local

 The FastText model was first introduced by Facebook in 2016 as an
extension and supposedly improvement of the vanilla Word2Vec model.
 Problem with traditional models is they do not give good embedding for
rare words and ignore morphological structure of a word
 The FastText model considers each word as a Bag of

Character n- grams. This is also called as a subword model
 Taking the word where and n=3 (tri-grams) as an example, it will be

represented by the character n-grams: <wh, whe, her, ere, re>
and the special sequence <where> representing the whole
word. Note that the sequence , corresponding to the
word <her> is different from the tri-gram her from the
word where.
Dipanjan Sarkar
The movie is not good.
Geron
Geron
 As some of the previous information is being stored, this brings
the concept of memory
 A more generalized structure will allow to differentiate what is
being stored and what is being fed to the next layer
4
6
 A new component C t is introduced which is the long term memory.
 Ht is the short term memory part and it is used for C t
 X t is the current input
 C t - 1 and Ht-1 are the inputs
 There are a total of 4 operations that are happening g works similar

to vanila RNN Cell
 The other three works as kind of controllers

• Classic example is Machine Translation
Given the following two sentences, how do you determine if Teddy is a
person or not? “Teddy bears are on sale!” and “Teddy Roosevelt was a
great President!”
History of LLM
Foundational Models
WHAT IS A FOUNDATION MODEL?

In recent years, a new successful paradigm for building AI systems has emerged:
Train one model on a huge amount of data and adapt it to many applications. We
call such a model a foundation model.
Model architecture ( Standard Blocks, Generic, Reusable, Parallel)
Data ( Huge amount of data)
Computing power ( FLOPs)
Prompt engineering ( A meta language)

Foundational Model Continued
• A foundation model is thus a transformer model that has been trained

on supercomputers on billions of records of data and billions of
parameters.
• The model can then perform a wide range of tasks with no further
fine-tuning.
• Thus, the scale of foundation models is unique.
• These fully trained models are often called engines. Only GPT-3,
Google BERT, and a handful of transformer engines can thus qualify
as foundation models
Typical NLP Applications
Sentiment Analysis
NER
Question Answering
Summarization
Translation
Text Generation
LLM Leaderboard
Few of our NLP Papers
Ansar W, Goswami S, Chakrabarti A, Chakraborty B. A novel selective learning based transformer encoder
architecture with enhanced word representation. Applied Intelligence. 2023 Apr;53(8):9424-43.
Ansar, W., Goswami, S., Chakrabarti, A. and Chakraborty, B., 2023. TexIm: A Novel Text-to-Image
Encoding Technique Using BERT. In Computer Vision and Machine Intelligence: Proceedings of CVMI
2022 (pp. 123-139). Singapore: Springer Nature Singapore.
Roy, P., Brahma, A., Goswami, S. and Sen, S., 2022, January. Effective Sentiment Analysis of Bengali
Corpus by Using the Machine Learning Approach. In International Conference on Data Management,
Analytics & Innovation (pp. 61-74). Singapore: Springer Nature Singapore.
Ansar, W. and Goswami, S., 2021. Combating the menace: A survey on characterization and detection of
fake news from a data science perspective. International Journal of Information Management Data
Insights, 1(2), p.100052.
Ansar, W., Goswami, S., Chakrabarti, A. and Chakraborty, B., 2021. An efficient methodology for aspect-
based sentiment analysis using BERT through refined aspect extraction. Journal of Intelligent & Fuzzy
Systems, 40(5), pp.9627-9644.
 Dey Sarkar, S., Goswami, S., Agarwal, A. and Aktar, J., 2014. A novel feature selection technique for text
classification using Naive Bayes. International scholarly research notices, 2014.


Dealing With Textual Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dealing With Textual Data

Uploaded by

Copyright:

Available Formats

NLP-Issues and Challenges

“If it is true that there is always more than

Dr. Saptarsi Goswami

Some Statistics ( As of 2017)

The waiter ignored us for 20 minutes.

 Stemming is a simpler form of lemmatization

What is the subject of this

Male or female author?

 How can I get structure out of it?

class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict',

In a 3 topic model we could assert that a document is 70% about topic A, 30

• Every topic is a mixture of words.

How can we find the document vector now ?

 Latent Dirichlet Allocation (LDA) is a probabilistic model used in the field of

 This process is known as embedding

 One of the simplest scheme to do the same

 Length of the vector

the quick brown fox jumps over the

 So, it is converted to a binary classification problem

 The valid context, target pair belongs to 1 and invalid ones

Target Context Label

 Running the model for learning the embedding is a mammoth

 Once these embeddings are learned typically they are now

 One solution is using pre-trained models

 The embeddings are learned on a large generic corpus

 Prediction based models focus on context ) Local

 The FastText model considers each word as a Bag of

 Taking the word where and n=3 (tri-grams) as an example, it will be

 Ht is the short term memory part and it is used for C t

 X t is the current input

 C t - 1 and Ht-1 are the inputs

 There are a total of 4 operations that are happening g works similar

 The other three works as kind of controllers

WHAT IS A FOUNDATION MODEL?

Model architecture ( Standard Blocks, Generic, Reusable, Parallel)

Data ( Huge amount of data)

Computing power ( FLOPs)

Prompt engineering ( A meta language)

• A foundation model is thus a transformer model that has been trained

• Thus, the scale of foundation models is unique.

You might also like