Professional Documents
Culture Documents
Tokenization Tokenization (cont) Bag Of Words & TF-IDF (cont) Bag Of Words & TF-IDF (cont)
Tokenization breaks the [word for word in doc] X = cv.fit_transform(c‐ A 2-gram (or bigram) is
raw text into words, # Keras ounters).toarray() a two-word sequence of
sentences called tokens. from keras.preprocessin‐ Term Frequency-Inverse words, like “I love”,
These tokens help in g.text import text_to_w‐ Document Frequency (TF- “love reading”, or
understanding the ord_sequence IDF): “Analytics Vidhya”.
context or developing text_to_word_sequence‐ Term frequency–in‐ And a 3-gram (or
the model for the NLP. (paragraph) verse document trigram) is a three-word
... If the text is split # genis frequency, is a sequence of words like
into words using some from gensim.summarizati‐ numerical statistic that “I love reading”, “about
separation technique it on.textcleaner import is intended to reflect data science” or “on
is called word split_sentences how important a word is Analytics Vidhya”.
tokenization and same split_sentences(parag‐ to a document in a
separation done for raph) collection or corpus. Stemming & Lemmatization
sentences is called from gensim.utils import T.F = No of rep of From Stemming we will
sentence tokenization. tokenize words in setence/No of process of getting the
# NLTK list(tokenize(para‐ words in sentence root form of a word. We
import nltk graph)) IDF = No of sentences would create the stem
nltk.download('punkt') / No of sentences words by removing the
paragraph = "write Bag Of Words & TF-IDF containing words prefix of suffix of a
paragaraph here to Bag of Words model is from sklearn.feature_ex‐ word. So, stemming a
convert into tokens." used to preprocess the traction.text import word may not result in
sentences = nltk.sent‐ text by converting it TfidfVectorizer actual words.
_tokenize(paragraph) into a bag of words, cv = TfidfVectorizer() paragraph = ""
words = nltk.word_token‐ which keeps a count of X = cv.fit_transform(c‐ # NLTK
ize(paragraph) the total occurrences of ounters).toarray() from nltk.stem import
# Spacy most frequently used N-gram Language Model: PorterStemmer
from spacy.lang.en words An N-gram is a sequence from nltk import sent_t‐
import English # counters = List of of N tokens (or words). okenize
nlp = English() stences after pre A 1-gram (or unigram) is from nltk import word_t‐
sbd = nlp.create_pipe‐ processing like tokeni‐ a one-word sequence.the okenize
('sentencizer') zation, stemming/lemmat‐ unigrams would simply stem = PorterStemmer()
nlp.add_pipe(sbd) ization, stopwords be: “I”, “love”, sentence = sent_tokeniz‐
doc = nlp(paragraph) from sklearn.feature_ex‐ “reading”, “blogs”, e(paragraph)[1]
[sent for sent in traction.text import “about”, “data”, words = word_tokenize(s‐
doc.sents] CountVectorizer “science”, “on”, “Analy‐ entence)
nlp = English() cv = CountVectorizer(ma‐ tics”, “Vidhya”. [stem.stem(word) for
doc = nlp(paragraph) x_features = 1500) word in words]
# Spacy