You are on page 1of 42

Text Pre Processing with

NLTK
Text mining
 It refers to data mining using text
documents as data.
 There are many special techniques for
pre-processing text documents to make
them suitable for mining.
 Most of these techniques are from the field
of “Information Retrieval”.
Information Retrieval (IR)
 Conceptually, information retrieval (IR) is the
study of finding needed information. I.e., IR helps
users find information that matches their
information needs.
 Historically, information retrieval is about
document retrieval, emphasizing document as
the basic unit.
 Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution of
information.
 IR has become a center of focus in the Web era.
Text Processing
 Word (token) extraction
 Stop words
 Stemming
 Frequency counts
Tokenization

 Split text into a list of words(tokens)


 NLTK contains a module
called tokenize() which further classifies into
two sub-categories:
 Word tokenize: We use the word_tokenize()
method to split a sentence into tokens or words
 Sentence tokenize: We use the
sent_tokenize() method to split a document or
paragraph into sentences
Word Tokenization

//from nltk import word tokenizer


import nltk
from nltk.tokenize import word_tokenize
word_tokenize(“Hellow this is UET,CS and IT
“)
Stop words
 The process of converting data to something a
computer can understand is referred to as pre-
processing. One of the major forms of pre-processing
is to filter out useless data. In natural language
processing, useless words (data), are referred to as
stop words.

 Stop Words: A stop word is a commonly used word


(such as “the”, “a”, “an”, “in”) that a search engine has
been programmed to ignore, both when indexing
entries for searching and when retrieving them as the
result of a search query.
 We would not want these words to take up space in our database,
or taking up valuable processing time. For this, we can remove them
easily, by storing a list of words that you consider to stop words.
NLTK(Natural Language Toolkit) in python has a list of stopwords
stored in 16 different languages. 

 Many of the most frequently used words in English are worthless


in text mining – these words are called stop words.
 the, of, and, to, ….

 Typically about 400 to 500 such words

 For an application, an additional domain specific stop words list

may be constructed

Why do we need to remove stop words?
 Reduce indexing (or data) file size
 stopwords accounts 20-30% of total word counts.

 Improve efficiency
 stop words are not useful for searching or text mining

 stop words always have a large number of hits


Stop Word Cont..

To check the stop words of a English, write


the following code in python editor.

import nltk
from nltk.corpus import stopwords
print(stopwords.words(‘english'))
Stemming
 Techniques used to find out the root/stem of a
word:
 E.g.,
 user engineering
 users engineered
 used engineer
 using
 stem: use engineer
Usefulness
 improving effectiveness of IR and text mining
 matching similar words
 reducing indexing size
 combing words with same roots reduce size of the
corpus as much as 40-50%.
Stemming Algorithms

 Porter Stemmer
 Snowball Stemmer
 Lancaster Stemmer
 Regex-based Stemmer
NLTK Code for Stemmer

import nltk
from nltk.stem import porterStemmer
ps= porterStemmer()// Create an object
print(ps.stem(‘coder’)
print(ps.stem(‘coding’)
print(ps.stem(‘code’)
Basic stemming methods
 remove ending
 if a word ends with a consonant other than s,
followed by an s, then delete s.
 if a word ends in es, drop the s.
 if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of th.
 If a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single letter.
 …...
 transform words
 if a word ends with “ies” but not “eies” or “aies” then
“ies --> y.”
Term Frequency Inverse Document Frequency

 TF-IDF is a statistical measure that evaluates


how relevant a word is to a document in a
collection of documents. This is done by
multiplying two metrics: how many times a
word appears in a document, and the inverse
document frequency of the word across a set
of documents.
 It has many uses, most importantly in
automated text analysis, and is very useful
for scoring words in machine learning
algorithms for Natural Language Processing
 (NLP).
 TF-IDF (term frequency-inverse document
frequency) was invented for document search
and information retrieval.

 So, words that are common in every
document, such as this, what, and if
rank low even though they may appear many
times, since they don’t mean much to that
document in particular.
How is TF-IDF calculated of a document?

The process to find meaning of documents using TF-IDF is


1.Clean data / Preprocessing — Clean data (standardise data) ,
Normalize data( all lower case) , lemmatize data ( all words to
root words ).
2.Tokenize words with frequency

3.Find TF for words

4.Find IDF for words

5.Vectorize vocab
 To put it in more formal mathematical terms,
the TF-IDF score for the word t in the
document d from the document set D is
calculated as follows:
 Where:

Example

 Doc 1: BEN studies about computers in Computer


Lab.
Doc 2: Steve teaches at Brown University.
Doc 3: Data Scientists work on large datasets.
 Let’s say we are doing a search on these
documents with the following query: Data
Scientists
Step 1: Computing the Term Frequency(tf)
Frequency indicates the number of occurences of a
particular term t in document d. Therefore,

Tf(t , d)=N(t , d) wherein:


tf(t , d)= term frequency for a term t in document d.
N(t, d)=Number of times a term t occurs in document d
Given below are the terms and their frequency on each of
the document.
[N(t, d)]
tf for document 1:
BEN ->1 , STUDIES -> 1 , COMPUTER -> 2 , LAB -> 1
Vector Space Representation for Doc 1 : [1, 1, 2, 1]
 Since we are dealing with the term frequency which rely on
the occurrence counts, thus, longer documents will be favored
more. To avoid this, normalize the term frequency.

 Tf(t, d)= N(t. d) / ||D||

 Wherein: ||D||= Total number of terms in the document

 ||D||=for each document.


 Given below are the normalized term frequency
for all the documents, i.e. [N(t, d) / ||D||].

 Normalized TF for document 1:


 Ben -> 0.143, STUDIES -> 0.143, COMPUTER -> 0.2860,
LAB -> 0.143
 Vector Space Representation for Document 1 : [0.143,
0.143, 0.286, 0.143]
 Normalized TF for document 2:
 Steve-> 0.2, TEACHES -> 0.2, BROWN -> 0.2,

UNIVERSITY -> 0.2


 Vector Space Representation for Document 2 : [0.2, 0.2, 0.2,

0.2]
 Normalized TF for document 3:

DATA -> 0.167, SCIENTISTS ->0.167, WORK -> 0.167,


LARGE -> 0.167, DATASETS -> 0.167
Vector Space Representation for Document 3 : [0.167, 0.167,
0.167, 0.167, 0.167]
Step 2: Compute the Inverse Document
Frequency — idf
 First of all, find the document frequency of a term t by
counting the number of documents containing the term:
df(t)= N(t) wherein:

df(t)= Document frequency of a term t.


N(t) = Number of Document containing the term t.
 Term frequency is the occurrence count of a term in one particular
document only; while document frequency is the number of different
documents the term appears in, so it depends on the whole corpus. 
 The idf of a term is the number of documents in the corpus
divided by the document frequency of a term.
idf(t)= log(N/ df(t) )
 Let’s compute IDF for the term Computer:
 idf(computer) = log(Total Number Of Documents / Number
Of Documents with term Computer in it)
There are 3 documents in all = Document1, Document2, Document3 so:
The term Computer appears in Document1
idf(computer) = log(3 / 1)
= 1.5849
 Given below is the idf for terms occurring in all the documents:
Step 3: tf-idf Scoring
 Now we have defined both tf and idf and now we can combine
these to produce the ultimate score of a term t in document d.
Therefore,
 tf-idf (t, d) = tf(t, d)* idf (t, d)
 For each term in the query multiply its normalized term
frequency with its IDF on each document.
 In Document3 for the term data, the normalized term
frequency is 0.167 and its IDF is 1.5849.
 Multiplying them together we get 0.2646. Given below is
TF * IDF calculations for data and Scientists in all the
documents.
We will use any of the similarity measures (eg, Cosine Similarity method)
to find the similarity between the query and each document. For example,
if we use Cosine Similarity Method to find the similarity, then smallest the
angle, the more is the similarity.
Cosine similarity

  is a metric, helpful in determining, how


similar the data objects are irrespective of
their size. In cosine similarity, data objects in
a dataset are treated as a vector. The
formula to find the cosine similarity between
two vectors is –
 Cos(x, y)= x.y /||x|| * ||y||

Where:
 x . y = dot product of the vectors ‘x’ and ‘y’
 ||x|| and ||y|= Length of the two vectors ‘x’ and ‘y’.
 Example :
Consider an example to find the similarity between two
vectors – ‘x’ and ‘y’, using Cosine Similarity.
 The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
 The formula for calculating the cosine similarity is : Cos(x, y)
= x . y / ||x|| * ||y||
 x . Y= 3 *1 + 2 * 0 + 0* 0+ 5 * 0=3
 ||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16 ||y||

= √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1


∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49
 The cosine similarity between two vectors is measured in ‘θ’.
If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.
Cosine Similarity between two vectors
Some of the popular similarity measures are :
Euclidean Distance.

Manhattan Distance.

Jaccard Similarity.

Minkowski Distance.
Applications of TF-IDF

TD-IDF, is useful in many ways, for example:


Information retrieval

TF-IDF was invented for document search and can be used to


deliver results that are most relevant to what you’re searching
for. It’s likely that every search engine you have ever
encountered uses TF-IDF scores in its algorithm.
Keyword Extraction

TF-IDF is also useful for extracting keywords from text. How?


The highest scoring words of a document are the most relevant to
that document, and therefore they can be considered keywords for
that document.
Vector Space Representation
 A document is represented as a vector:
 (W1, W2, … … , Wn)
 Binary:
 Wi= 1 if the corresponding term i (often a word) is in the
document
 Wi= 0 if the term i is not in the document
 TF: (Term Frequency)
 Wi= tfi where tfi is the number of times the term
occurred in the document
 TF*IDF: (Inverse Document Frequency)
 Wi =tfi*idfi=tfi*log(N/dfi)) where dfi is the number of
documents contains term i, and N the total number of
documents in the collection.
Vector Space and Document Similarity
 Each indexing term is a dimension. A indexing
term is normally a word.
 Each document is a vector
 Di = (ti1, ti2, ti3, ti4, ... tin)
 Dj = (tj1, tj2, tj3, tj4, ..., tjn)
 Document similarity is defined as (cosine
similarity) n

 tik * tjk
Similarity (Di, Dj)  k 1
n n

 t
k 1
ik 
2
 t
k 1
jk
2
Vector Space Representation
 Each doc j is a vector, one component for each
term (= word).
 Have a vector space
 terms are attributes
 n docs live in this space
 even with stop word removal and stemming, we
may have 10000+ dimensions, or even 1,000,000+

You might also like