Major TASKS performed by NLP agent • Input data • Segmentation • Tokenization/Lemmatization • Part of Speech Tagging • Stemming • Vectorization Segmentation • The process of deciding from where the sentences actually start or end in NLP or we can simply say that here we are dividing a paragraph based on sentences. This process is known as Sentence Segmentation.
• “……..”, “……”, ………..
Tokenize/Lemmatize • tokenize your pieces of text into its individual words/tokens which is used to create so-called vocabularies that will be used in the language model you plan to build.
• Lemmatization is the process where we take individual tokens from a sentence
and we try to reduce them to their base form. The process that makes this possible is having a vocabulary and performing morphological analysis.
• [“I”, “want”,……], […..], • Are, is -------be Stemming • Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.
• car, cars -> car
• run, ran, running -> run • stemmer, stemming, stemmed -> stem POS tagging • Part of speech (POS) recognition – Done by transition networks and parser tree.
• “ Today is a beautiful day.
Today is a beautiful day Noun Verb Article Adjective Noun Vectorization • Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which is used to find word predictions, word similarities/semantics. • Process of converting words into vectors to analyze frequency • For this, tf-idf is used • Tf—term frequency(how many times word occurs in document) • Idf—inverse document frequency(importance of that word) log(d/D)
• I am running in the running zone of the park.
• Running--2 Tf-Idf • Tf-idf used by Count vectorizer function to convert collection of text doc to matrix of token counts
doc1 doc 2 doc 3 doc 4 doc5
• Matrix---- fly 1 0 2 56 x5 word 2 .. .. .. word 3 . . word N .. .. .. • From nltk.tokenize import word_tokenize • From nltk.corpus import stopwords • From nltk.stem import PortStemmer, WordNetLemmatizer • From sklearn.feature_extraction.text import TfidfVectorizer • From sklearn.feature_extraction.text import CountVectorizer
coreNLP, spaCy,……. Applications of NLP • Translation • Spam detection • Text summarization • Question answering • Sentiment analysis