You are on page 1of 3

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

Tokeni​zation Tokeni​zation (cont) Bag Of Words & TF-IDF (cont) Bag Of Words & TF-IDF (cont)

Tokenization breaks the [word for word in doc] X = cv.fit​_tr​ans​for​m(c​‐ A 2-gram (or bigram) is
raw text into words, # Keras oun​ter​s).t​oa​rray() a two-word sequence of
sentences called tokens. from keras.p​re​pro​ces​sin​‐ Term Freque​ncy​-In​verse words, like “I love”,
These tokens help in g.text import text_t​o_w​‐ Document Frequency (TF- “love reading”, or
understanding the ord​_se​quence IDF): “Analytics Vidhya”.
context or developing text_t​o_w​ord​_se​que​nce​‐ ​ ​ ​ ​ ​ Term freque​ncy​–in​‐ And a 3-gram (or
the model for the NLP. (pa​rag​raph) verse document trigram) is a three-word
... If the text is split # genis frequency, is a sequence of words like
into words using some from gensim.su​mma​riz​ati​‐ numerical statistic that “I love reading”, “about
separation technique it on.t​ex​tcl​eaner import is intended to reflect data science” or “on
is called word split_​sen​tences how important a word is Analytics Vidhya”.
tokenization and same split_​sen​ten​ces​(pa​rag​‐ to a document in a
separation done for raph) collection or corpus. Stemming & Lemmat​ization
sentences is called from gensim.utils import ​ ​ T.F = No of rep of From Stemming we will
sentence tokenization. tokenize words in setence/No of process of getting the
# NLTK list(t​oke​niz​e(p​ara​‐ words in sentence root form of a word. We
import nltk graph)) ​ ​ IDF = No of sentences would create the stem
nltk.d​own​loa​d('​punkt') / No of sentences words by removing the
paragraph = "​write Bag Of Words & TF-IDF containing words prefix of suffix of a
paragaraph here to Bag of Words model is from sklear​n.f​eat​ure​_ex​‐ word. So, stemming a
convert into tokens." used to preprocess the tra​cti​on.text import word may not result in
sentences = nltk.s​ent​‐ text by converting it TfidfV​ect​orizer actual words.
_to​ken​ize​(pa​rag​raph) into a bag of words, cv = TfidfV​ect​ori​zer() paragraph = "​"
words = nltk.w​ord​_to​ken​‐ which keeps a count of X = cv.fit​_tr​ans​for​m(c​‐ # NLTK
ize​(pa​rag​raph) the total occurrences of oun​ter​s).t​oa​rray() from nltk.stem import
# Spacy most frequently used N-gram Language Model: Porter​Stemmer
from spacy.l​ang.en words An N-gram is a sequence from nltk import sent_t​‐
import English # counters = List of of N tokens (or words). okenize
nlp = English() stences after pre A 1-gram (or unigram) is from nltk import word_t​‐
sbd = nlp.cr​eat​e_p​ipe​‐ processing like tokeni​‐ a one-word sequen​ce.the okenize
('s​ent​enc​izer') zation, stemmi​ng/​lem​mat​‐ unigrams would simply stem = Porter​Ste​mmer()
nlp.ad​d_p​ipe​(sbd) iza​tion, stopwords be: “I”, “love”, sentence = sent_t​oke​niz​‐
doc = nlp(pa​rag​raph) from sklear​n.f​eat​ure​_ex​‐ “reading”, “blogs”, e(p​ara​gra​ph)[1]
[sent for sent in tra​cti​on.text import “about”, “data”, words = word_t​oke​niz​e(s​‐
doc.sents] CountV​ect​orizer “science”, “on”, “Analy​‐ ent​ence)
nlp = English() cv = CountV​ect​ori​zer​(ma​‐ tics”, “Vidhya”. [stem.s​te​m(word) for
doc = nlp(pa​rag​raph) x_f​eatures = 1500) word in words]
# Spacy

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 1 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Stemming & Lemmat​ization Word2Vec Stop Words Stop Words (cont)


(cont)
In BOW and TF-IDF Stopwords are the most for word in token_​list:
No Stemming in spacy approach semantic common words in any ​ ​ ​ ​lexeme = nlp.vo​cab​‐
# Keras information not stored. natural language. For [word]
No Stemming in Keras TF-IDF gives the purpose of ​ ​ ​ if lexeme.is​_stop ==
Lemmat​iza​tion: importance to uncommon analyzing text data and False:
As stemming, lemmat​‐ words. There is building NLP models, ​ ​ ​ ​ ​ ​ ​ ​fil​ter​ed_​sen​ten​‐
ization do the same definitely chance of these stopwords might ce.a​pp​end​(word)
but the only overfitting. not add much value to # Gensim
difference is that In W2v each word is the meaning of the from gensim.pa​rsi​ng.p​re​pro​‐
lemmat​ization ensures basically repres​ented document. cessing import remove​_st​‐
that root word belongs as a vector of 32 or # NLTK opwords
to the language more dimension instead from nltk.c​orpus import remove​_st​opw​ord​s(p​ara​graph)
# NLTK of a single number. stopwords
from nltk.stem import Here the semantic from nltk.t​okenize Tokeni​zation
WordNe​tLe​mma​tizer inform​ation and import word_t​okenize NLTK Spacy Keras Tensorlfow
lemma = WordNe​tLe​mma​‐ relation between words stopwords = set(st​opw​‐
dfdfd
tizer() is also preserved. ord​s.w​ord​s('​eng​lish'))
sentence = sent_t​oke​‐ Steps: word_t​okens = word_t​‐ Parts of Speech (POS) Tagging,
niz​e(p​ara​gra​ph)[1] 1. Tokeni​zation of the oke​niz​e(p​ara​graph) Chunking & NER
words = word_t​oke​niz​‐ sentences [word for word in
The pos(parts of speech)
e(s​ent​ence) 2. Create Histograms word_t​okens if word not
explain you how a word is
[lemma.le​mma​tiz​e(word) 3. Take most frequent in stopwords]
used in a sentence. In the
for word in words] words # Spacy
sentence, a word have
# Spcay 4. Create a matrix with from spacy.l​ang.en
different contexts and
import spacy as spac all the unique words. import English
semantic meanings. The
sp = spac.l​oad​('e​n_c​‐ It also represents the from spacy.l​an​g.e​n.s​‐
basic natural language
ore​_we​b_sm') occurence relation top​_words import
processing(NLP) models like
ch = sp(u'w​arning between the words STOP_WORDS
bag-of-words(bow) fails to
warned') from gensim.models nlp = English()
identify these relation
for x in ch: import Word2Vec my_doc = nlp(pa​rag​raph)
between the words. For that
​ ​ ​ ​pri​nt(​ch.l​emma_) model = Word2V​ec(​sen​‐ # Create list of word
we use pos tagging to mark a
# Keras tences, min_co​unt=1) tokens
word to its pos tag based on
No lemmat​ization or words = model.w​v.v​ocab token_list =
its context in the data.
stemming vector = model.w​v[​'fr​‐ [token.text for token
Pos is also used to extract
eedom'] in my_doc]
rlationship between the
similar = model.w​v.m​os​‐ # Create list of word
words
t_s​imi​lar​['f​ree​dom'] tokens after removing
# NLTK
stopwords
filter​ed_​sen​tence =[]

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 2 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,


Chunking & NER (cont) Chunking & NER (cont)

from nltk.t​okenize import word_pos = pos_ta​g(w​ord​‐


word_t​okenize _to​kens)
from nltk import pos_tag chunkP​arser = nltk.R​ege​‐
nltk.d​own​loa​d('​ave​rag​ed_​‐ xpP​ars​er(​gra​mmar)
per​cep​tro​n_t​agger') tree = chunkP​ars​er.p​ar​‐
word_t​okens = word_t​oke​‐ se(​wor​d_pos)
niz​e('Are you afraid of Named Entity Recogn​iza​‐
someth​ing?') tion:
pos_ta​g(w​ord​_to​kens) It is used to extract
# Spacy inform​ation from unstru​‐
nlp = spacy.l​oa​d("e​n_c​‐ ctured text. It is used
ore​_we​b_s​m") to classy the entities
doc = nlp("Co​ron​avirus: which is present in the
Delhi resident tests text into categories like
positive for corona​virus, a person, organi​zation,
total 31 people infected event, places, etc. This
in India") will give you a detail
[token.pos_ for token in knowledge about the text
doc] and the relati​onship
Chunking: between the different
Chunking is the process entities.
of extracting phrases # Spacy
from the Unstru​ctured import spacy
text and give them more nlp = spacy.l​oa​d("e​n_c​‐
structure to it. We also ore​_we​b_s​m")
called them shallow doc = nlp("Co​ron​avirus:
parsing.We can do it on Delhi resident tests
top of pos tagging. It positive for corona​virus,
groups words into chunks total 31 people infected
mainly for noun phrases. in India")
chunking we do by using for ent in doc.ents:
regular expres​sion. ​ ​ ​ ​pri​nt(​ent.text,
# NLTK ent.st​art​_char, ent.en​‐
word_t​okens = word_t​oke​‐ d_char, ent.la​bel_)
niz​e(text)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 3 of 3. http://crosswordcheats.com

You might also like