NLP Cheat Sheet for Tokenization, Word2Vec, Stop Words & POS Tagging

NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/
Tokenization Tokenization (cont) Bag Of Words & TF-IDF (cont) Bag Of Words & TF-IDF (cont)
Tokenization breaks the [word for word in doc] X = cv.fit_transform(c‐ A 2-gram (or bigram) is
raw text into words, # Keras ounters).toarray() a two-word sequence of
sentences called tokens. from keras.preprocessin‐ Term Frequency-Inverse words, like “I love”,
These tokens help in g.text import text_to_w‐ Document Frequency (TF- “love reading”, or
understanding the ord_sequence IDF): “Analytics Vidhya”.
context or developing text_to_word_sequence‐ Term frequency–in‐ And a 3-gram (or
the model for the NLP. (paragraph) verse document trigram) is a three-word
... If the text is split # genis frequency, is a sequence of words like
into words using some from gensim.summarizati‐ numerical statistic that “I love reading”, “about
separation technique it on.textcleaner import is intended to reflect data science” or “on
is called word split_sentences how important a word is Analytics Vidhya”.
tokenization and same split_sentences(parag‐ to a document in a
separation done for raph) collection or corpus. Stemming & Lemmatization
sentences is called from gensim.utils import T.F = No of rep of From Stemming we will
sentence tokenization. tokenize words in setence/No of process of getting the
# NLTK list(tokenize(para‐ words in sentence root form of a word. We
import nltk graph)) IDF = No of sentences would create the stem
nltk.download('punkt') / No of sentences words by removing the
paragraph = "write Bag Of Words & TF-IDF containing words prefix of suffix of a
paragaraph here to Bag of Words model is from sklearn.feature_ex‐ word. So, stemming a
convert into tokens." used to preprocess the traction.text import word may not result in
sentences = nltk.sent‐ text by converting it TfidfVectorizer actual words.
_tokenize(paragraph) into a bag of words, cv = TfidfVectorizer() paragraph = ""
words = nltk.word_token‐ which keeps a count of X = cv.fit_transform(c‐ # NLTK
ize(paragraph) the total occurrences of ounters).toarray() from nltk.stem import
# Spacy most frequently used N-gram Language Model: PorterStemmer
from spacy.lang.en words An N-gram is a sequence from nltk import sent_t‐
import English # counters = List of of N tokens (or words). okenize
nlp = English() stences after pre A 1-gram (or unigram) is from nltk import word_t‐
sbd = nlp.create_pipe‐ processing like tokeni‐ a one-word sequence.the okenize
('sentencizer') zation, stemming/lemmat‐ unigrams would simply stem = PorterStemmer()
nlp.add_pipe(sbd) ization, stopwords be: “I”, “love”, sentence = sent_tokeniz‐
doc = nlp(paragraph) from sklearn.feature_ex‐ “reading”, “blogs”, e(paragraph)[1]
[sent for sent in traction.text import “about”, “data”, words = word_tokenize(s‐
doc.sents] CountVectorizer “science”, “on”, “Analy‐ entence)
nlp = English() cv = CountVectorizer(ma‐ tics”, “Vidhya”. [stem.stem(word) for
doc = nlp(paragraph) x_features = 1500) word in words]
# Spacy
By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 1 of 3. http://crosswordcheats.com
NLP Cheat Sheet
Stemming & Lemmatization Word2Vec Stop Words Stop Words (cont)

(cont)
In BOW and TF-IDF Stopwords are the most for word in token_list:
No Stemming in spacy approach semantic common words in any lexeme = nlp.vocab‐
# Keras information not stored. natural language. For [word]
No Stemming in Keras TF-IDF gives the purpose of if lexeme.is_stop ==
Lemmatization: importance to uncommon analyzing text data and False:
As stemming, lemmat‐ words. There is building NLP models, filtered_senten‐
ization do the same definitely chance of these stopwords might ce.append(word)
but the only overfitting. not add much value to # Gensim
difference is that In W2v each word is the meaning of the from gensim.parsing.prepro‐
lemmatization ensures basically represented document. cessing import remove_st‐
that root word belongs as a vector of 32 or # NLTK opwords
to the language more dimension instead from nltk.corpus import remove_stopwords(paragraph)
# NLTK of a single number. stopwords
from nltk.stem import Here the semantic from nltk.tokenize Tokenization
WordNetLemmatizer information and import word_tokenize NLTK Spacy Keras Tensorlfow
lemma = WordNetLemma‐ relation between words stopwords = set(stopw‐
dfdfd
tizer() is also preserved. ords.words('english'))
sentence = sent_toke‐ Steps: word_tokens = word_t‐ Parts of Speech (POS) Tagging,
nize(paragraph)[1] 1. Tokenization of the okenize(paragraph) Chunking & NER
words = word_tokeniz‐ sentences [word for word in
The pos(parts of speech)
e(sentence) 2. Create Histograms word_tokens if word not
explain you how a word is
[lemma.lemmatize(word) 3. Take most frequent in stopwords]
used in a sentence. In the
for word in words] words # Spacy
sentence, a word have
# Spcay 4. Create a matrix with from spacy.lang.en
different contexts and
import spacy as spac all the unique words. import English
semantic meanings. The
sp = spac.load('en_c‐ It also represents the from spacy.lang.en.s‐
basic natural language
ore_web_sm') occurence relation top_words import
processing(NLP) models like
ch = sp(u'warning between the words STOP_WORDS
bag-of-words(bow) fails to
warned') from gensim.models nlp = English()
identify these relation
for x in ch: import Word2Vec my_doc = nlp(paragraph)
between the words. For that
print(ch.lemma_) model = Word2Vec(sen‐ # Create list of word
we use pos tagging to mark a
# Keras tences, min_count=1) tokens
word to its pos tag based on
No lemmatization or words = model.wv.vocab token_list =
its context in the data.
stemming vector = model.wv['fr‐ [token.text for token
Pos is also used to extract
eedom'] in my_doc]
rlationship between the
similar = model.wv.mos‐ # Create list of word
words
t_similar['freedom'] tokens after removing
# NLTK
stopwords
filtered_sentence =[]

NLP Cheat Sheet
Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,

Chunking & NER (cont) Chunking & NER (cont)
from nltk.tokenize import word_pos = pos_tag(word‐

word_tokenize _tokens)
from nltk import pos_tag chunkParser = nltk.Rege‐
nltk.download('averaged_‐ xpParser(grammar)
perceptron_tagger') tree = chunkParser.par‐
word_tokens = word_toke‐ se(word_pos)
nize('Are you afraid of Named Entity Recogniza‐
something?') tion:
pos_tag(word_tokens) It is used to extract
# Spacy information from unstru‐
nlp = spacy.load("en_c‐ ctured text. It is used
ore_web_sm") to classy the entities
doc = nlp("Coronavirus: which is present in the
Delhi resident tests text into categories like
positive for coronavirus, a person, organization,
total 31 people infected event, places, etc. This
in India") will give you a detail
[token.pos_ for token in knowledge about the text
doc] and the relationship
Chunking: between the different
Chunking is the process entities.
of extracting phrases # Spacy
from the Unstructured import spacy
text and give them more nlp = spacy.load("en_c‐
structure to it. We also ore_web_sm")
called them shallow doc = nlp("Coronavirus:
parsing.We can do it on Delhi resident tests
top of pos tagging. It positive for coronavirus,
groups words into chunks total 31 people infected
mainly for noun phrases. in India")
chunking we do by using for ent in doc.ents:
regular expression. print(ent.text,
# NLTK ent.start_char, ent.en‐
word_tokens = word_toke‐ d_char, ent.label_)
nize(text)


NLP Cheat Sheet for Tokenization, Word2Vec, Stop Words & POS Tagging

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Cheat Sheet for Tokenization, Word2Vec, Stop Words & POS Tagging

Uploaded by

Copyright:

Available Formats

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Stemming & Lemmatization Word2Vec Stop Words Stop Words (cont)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,

from nltk.tokenize import word_pos = pos_tag(word‐

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

You might also like

NLP Cheat Sheet for Tokenization, Word2Vec, Stop Words & POS Tagging

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Cheat Sheet for Tokenization, Word2Vec, Stop Words & POS Tagging

Uploaded by

Copyright:

Available Formats

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Stemming & Lemmat​ization Word2Vec Stop Words Stop Words (cont)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,

from nltk.t​okenize import word_pos = pos_ta​g(w​ord​‐

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

You might also like

Stemming & Lemmatization Word2Vec Stop Words Stop Words (cont)

from nltk.tokenize import word_pos = pos_tag(word‐