You are on page 1of 10

NATURAL LANGUAGE

PROCESSING -
IMPLEMENTING

- Ms. Akanksha Dhamija


Major TASKS performed by NLP agent
• Input data
• Segmentation
• Tokenization/Lemmatization
• Part of Speech Tagging
• Stemming
• Vectorization
Segmentation
• The process of deciding from where the sentences actually
start or end in NLP or we can simply say that here we are
dividing a paragraph based on sentences. This process is
known as Sentence Segmentation. 

• “……..”, “……”, ………..


Tokenize/Lemmatize
• tokenize your pieces of text into its individual words/tokens which is used to
create so-called vocabularies that will be used in the language model you
plan to build.

• Lemmatization is the process where we take individual tokens from a sentence


and we try to reduce them to their base form. The process that makes this
possible is having a vocabulary and performing morphological analysis.

• Un-planned-------- [“un”, “-”, “planned”……], [………..]


• [“I”, “want”,……], […..],
• Are, is -------be
Stemming
• Stemming is the process for reducing inflected (or sometimes derived)
words to their stem, base or root form.

• car, cars -> car


• run, ran, running -> run
• stemmer, stemming, stemmed -> stem
POS tagging
• Part of speech (POS) recognition – Done by transition networks and
parser tree.

• “ Today is a beautiful day.


Today is a beautiful day
Noun Verb Article Adjective Noun
Vectorization
• Word vectorization is a methodology in NLP to map words or phrases from
vocabulary to a corresponding vector of real numbers which is used to find
word predictions, word similarities/semantics.
• Process of converting words into vectors to analyze frequency
• For this, tf-idf is used
• Tf—term frequency(how many times word occurs in document)
• Idf—inverse document frequency(importance of that word) log(d/D)

• I am running in the running zone of the park.


• Running--2
Tf-Idf
• Tf-idf used by Count vectorizer function to convert collection of text doc to
matrix of token counts

doc1 doc 2 doc 3 doc 4 doc5


• Matrix---- fly 1 0 2 56 x5
word 2 .. .. ..
word 3
.
.
word N .. .. ..
• From nltk.tokenize import word_tokenize
• From nltk.corpus import stopwords
• From nltk.stem import PortStemmer, WordNetLemmatizer
• From sklearn.feature_extraction.text import TfidfVectorizer
• From sklearn.feature_extraction.text import CountVectorizer

coreNLP, spaCy,…….
Applications of NLP
• Translation
• Spam detection
• Text summarization
• Question answering
• Sentiment analysis

You might also like