Professional Documents
Culture Documents
Let’s go to Agra!
✓ Carter told Mubarak he shouldn’t run
again. Paraphrase
Buy a Sports Cycle ✗
Word sense disambiguation (WSD)
XYZ acquired ABC yesterday
I need new batteries for my ABC has been taken over by XYZ
Part-of-speech (POS) tagging
mouse.
ADJ ADJ NOUN
VERB ADV Summarization
Colorless green ideas sleep Parsing The Dow Jones is up
furiously. Economy
I can see Alcatraz from the window!
The S&P500 jumped is good
Named entity recognition (NER) Housing prices rose
PERSON ORG Machine translation (MT)
LOC
第 13 届上海国际电影节开幕… Dialog Where is Citizen Kane playing in
Einstein met with UN officials in Princeton The 13th Shanghai International Film SF?
Festival…
Castro Theatre at 7:30. Do
Information extraction you want a ticket?
Party
(IE) May
You’re invited to our dinner 27
party, Friday May 27 at 8:30 add
The not so rosy but imp. work
regular expression: Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “I need X” in ELIZA above, to defining strings
like $199 or $24.99 for extracting tables of prices from a document.
text normalization, in which text normalization regular expressions play an important part.
Normalizing text means converting it to a more convenient, standard form
Tokenization
Hashtags, Emoticons
lemmatization, the task of determining that two words have the same root, despite their
surface differences. For example, the words sang, sung, and sings are forms of the verb
sing
Is text Structured?
#Python 3
#And 7
tf-idf
#Python 7.972732
#And 7.670779
TF – IDF Vectorizer
A article of Sports will have some words, which will be different from
entertainment
• Each document is a mixture of topics.
the ratio gives us an estimate of how much more the two words co-occur than we
expect by chance
Apple 00100000
Orange 00001000
School 00000001
Context Target
[quick, fox] Brown
[Brown,jumps] Fox
[Fox,over] Jumps
Soft Max and Probability
distribution
skip-gram model’s aim is to predict the context from the target word
Target Context
Quick The
Quick Brown
Brown Quick
Brown Fox
It turns, out that treating it as a milti - class classification
problem, makes it very computationally inefficient to train
https://nlp.stanford.edu/projects/glove/.
glove-wiki-gigaword-50 (65 MB)
LSA based models utilize statistical information (Global) and does
poorly at analogy task
Problem with traditional models is they do not give good embedding for
rare words and ignore morphological structure of a word
Dipanjan Sarkar
The movie is not good.
Geron
Geron
As some of the previous information is being stored, this brings
the concept of memory
A more generalized structure will allow to differentiate what is
being stored and what is being fed to the next layer
4
6
A new component C t is introduced which is the long term memory.
• The model can then perform a wide range of tasks with no further
fine-tuning.
• These fully trained models are often called engines. Only GPT-3,
Google BERT, and a handful of transformer engines can thus qualify
as foundation models
Typical NLP Applications
Sentiment Analysis
NER
Question Answering
Summarization
Translation
Text Generation
LLM Leaderboard
Few of our NLP Papers
Ansar W, Goswami S, Chakrabarti A, Chakraborty B. A novel selective learning based transformer encoder
architecture with enhanced word representation. Applied Intelligence. 2023 Apr;53(8):9424-43.
Ansar, W., Goswami, S., Chakrabarti, A. and Chakraborty, B., 2023. TexIm: A Novel Text-to-Image
Encoding Technique Using BERT. In Computer Vision and Machine Intelligence: Proceedings of CVMI
2022 (pp. 123-139). Singapore: Springer Nature Singapore.
Roy, P., Brahma, A., Goswami, S. and Sen, S., 2022, January. Effective Sentiment Analysis of Bengali
Corpus by Using the Machine Learning Approach. In International Conference on Data Management,
Analytics & Innovation (pp. 61-74). Singapore: Springer Nature Singapore.
Ansar, W. and Goswami, S., 2021. Combating the menace: A survey on characterization and detection of
fake news from a data science perspective. International Journal of Information Management Data
Insights, 1(2), p.100052.
Ansar, W., Goswami, S., Chakrabarti, A. and Chakraborty, B., 2021. An efficient methodology for aspect-
based sentiment analysis using BERT through refined aspect extraction. Journal of Intelligent & Fuzzy
Systems, 40(5), pp.9627-9644.
Dey Sarkar, S., Goswami, S., Agarwal, A. and Aktar, J., 2014. A novel feature selection technique for text
classification using Naive Bayes. International scholarly research notices, 2014.