Professional Documents
Culture Documents
IN IR
RIDA HAFEEZ
IMPORTANCE
What is Preprocessing?
Data preprocessing is a data mining
technique that involves transforming raw data
into an understandable/clean format.
Why Preprocessing??
Without preprocessing we will have
Noisy data
Irrelevant features
Inaccurate analysis
Inefficient results
2
TECHNIQUES
Tokenization
Lowercase conversion
Special character removal
Stop Word Removal
Stemming
Treating synonyms
Spell check
Noun Phrase Extraction
3
TOOLS
Stanford NLP:
If you want to work in Java
http://nlp.stanford.edu/software/
NLTK:
If you want to work in python
http://www.nltk.org/
4
TOKENIZATION
5
LOWERCASE CONVERSION
How to do??
6
SPECIAL CHARACTER
REMOVAL
How?
You can do it in java or in python using NLTK.
for word in word_list:
if word in stopwords.words('english'):
filtered_word_list.remove(word)
8
STEMMING
10
SPELL CHECK
How??
Again you can use WordNet if you are working
with English database. If word is present in
WordNet, keep it, otherwise discard it.
11
NOUN EXTRACTION
How??
You can use POS(part of speech) tagger in both
Stanford and NLTK. Extract the word tagged as
noun.
12
NOUN PHRASE EXTRACTION
How??
Do it by using NLTK, Stanford is not efficient for
this.
13
THE END
14