You are on page 1of 14

PREPROCESSING

IN IR
RIDA HAFEEZ
IMPORTANCE

What is Preprocessing?
Data preprocessing is a data mining
technique that involves transforming raw data
into an understandable/clean format.
Why Preprocessing??
Without preprocessing we will have
Noisy data
Irrelevant features
Inaccurate analysis
Inefficient results

2
TECHNIQUES

Tokenization
Lowercase conversion
Special character removal
Stop Word Removal
Stemming
Treating synonyms
Spell check
Noun Phrase Extraction

3
TOOLS

Stanford NLP:
If you want to work in Java
http://nlp.stanford.edu/software/
NLTK:
If you want to work in python
http://www.nltk.org/

4
TOKENIZATION

Splitting text on the basis of some delimiter [ ,


,, . etc]
HOW to do ??
You can do it in java as
String[ ] tokens= string.split(delimiter);
Or you can use Stanford NLP tool in java
Or you can use NLTK in Python

5
LOWERCASE CONVERSION

Convert all letters in lower case so that JAVA


and java can be treated as the same words.

How to do??

Can be done in java as follows.


String.toLowerCase( );

6
SPECIAL CHARACTER
REMOVAL

Remove non alphanumeric (@,#,%,&) or numeric


(1234) characters according to your requirement.
How?
You can use regular expressions in java to remove
useless characters according to your need:
String result = yourString.replaceAll("[-+.^:,]","");

Following link can help you how to make regular


expression.
http://www.regular-
expressions.info/refunicode.html
7
STOP WORD REMOVAL

Removing meaning less words from the text, i.e.,


in, of, but, on, at etc.

How?
You can do it in java or in python using NLTK.
for word in word_list:
if word in stopwords.words('english'):
filtered_word_list.remove(word)

8
STEMMING

Normalize the words to their roots

can be done both in Stanford and NLTK. I


could not found any efficient java library for
that. 9
TREATING SYNONYMS

Replace one word with other if their meanings are


same. This will increase the frequency of a word
in document.
HOW?

Use word net dictionary in java or python.


https://wordnet.princeton.edu/wordnet/documenta
tion/

10
SPELL CHECK

Discard the word if it is not present in dictionary,


i.e., meaningless word.

How??
Again you can use WordNet if you are working
with English database. If word is present in
WordNet, keep it, otherwise discard it.

11
NOUN EXTRACTION

Extract single noun from the text data.

How??
You can use POS(part of speech) tagger in both
Stanford and NLTK. Extract the word tagged as
noun.

12
NOUN PHRASE EXTRACTION

Extracting bi-grams or tri-grams from the text.

How??
Do it by using NLTK, Stanford is not efficient for
this.

13
THE END

Thanks for your kind attention

14

You might also like