Welcome to Scribd!

Skip carousel

Preprocessing in Ir: Rida Hafeez

Uploaded by

Saad Bin Shahid

0% found this document useful (0 votes)

4 views14 pages

Preprocessing

Original Title

PREPROCESSING_

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Preprocessing

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

4 views14 pages

Preprocessing in Ir: Rida Hafeez

Uploaded by

Saad Bin Shahid

Preprocessing

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 14

Search inside document

PREPROCESSING

IN IR
RIDA HAFEEZ
IMPORTANCE

What is Preprocessing?
Data preprocessing is a data mining
technique that involves transforming raw data
into an understandable/clean format.
Why Preprocessing??
Without preprocessing we will have
Noisy data
Irrelevant features
Inaccurate analysis
Inefficient results

2
TECHNIQUES

Tokenization
Lowercase conversion
Special character removal
Stop Word Removal
Stemming
Treating synonyms
Spell check
Noun Phrase Extraction

3
TOOLS

Stanford NLP:
If you want to work in Java
http://nlp.stanford.edu/software/
NLTK:
If you want to work in python
http://www.nltk.org/

4
TOKENIZATION

Splitting text on the basis of some delimiter [ ,

,, . etc]
HOW to do ??
You can do it in java as
String[ ] tokens= string.split(delimiter);
Or you can use Stanford NLP tool in java
Or you can use NLTK in Python

5
LOWERCASE CONVERSION

Convert all letters in lower case so that JAVA

and java can be treated as the same words.

How to do??

Can be done in java as follows.

String.toLowerCase( );

6
SPECIAL CHARACTER
REMOVAL

Remove non alphanumeric (@,#,%,&) or numeric

(1234) characters according to your requirement.
How?
You can use regular expressions in java to remove
useless characters according to your need:
String result = yourString.replaceAll("[-+.^:,]","");

Following link can help you how to make regular

expression.
http://www.regular-
expressions.info/refunicode.html
7
STOP WORD REMOVAL

Removing meaning less words from the text, i.e.,

in, of, but, on, at etc.

How?
You can do it in java or in python using NLTK.
for word in word_list:
if word in stopwords.words('english'):
filtered_word_list.remove(word)

8
STEMMING

Normalize the words to their roots

can be done both in Stanford and NLTK. I

could not found any efficient java library for
that. 9
TREATING SYNONYMS

Replace one word with other if their meanings are

same. This will increase the frequency of a word
in document.
HOW?

Use word net dictionary in java or python.

https://wordnet.princeton.edu/wordnet/documenta
tion/

10
SPELL CHECK

Discard the word if it is not present in dictionary,

i.e., meaningless word.

How??
Again you can use WordNet if you are working
with English database. If word is present in
WordNet, keep it, otherwise discard it.

11
NOUN EXTRACTION

Extract single noun from the text data.

How??
You can use POS(part of speech) tagger in both
Stanford and NLTK. Extract the word tagged as
noun.

12
NOUN PHRASE EXTRACTION

Extracting bi-grams or tri-grams from the text.

How??
Do it by using NLTK, Stanford is not efficient for
this.

13
THE END

Thanks for your kind attention

Natural Language Processing with Java
From Everand
Natural Language Processing with Java
Richard M Reese
No ratings yet
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python
From Everand
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python
Akshay Kulkarni
No ratings yet
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
Document2 pages
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
ROHIT SELVAM6
No ratings yet
09 Rohit Jujaray NLP Experiments
Document24 pages
09 Rohit Jujaray NLP Experiments
NEMAT KHAN
No ratings yet
Vinayaknlp Ile
Document16 pages
Vinayaknlp Ile
Mohit kumar
No ratings yet
NLP Manual (1-12)
Document54 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12)
Document55 pages
NLP Manual (1-12)
sj120cp
No ratings yet
MethodsTranslateCHAT
Document1 page
MethodsTranslateCHAT
Kebong Tumaliuan
No ratings yet
NLP Manual (1-12) 1
Document56 pages
NLP Manual (1-12) 1
sj120cp
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Document37 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Zander Catta Preta
No ratings yet
Seminar On Natural Language Processing
Document21 pages
Seminar On Natural Language Processing
Aman Bajaj
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
No ratings yet
NLP Sentence Boundary Detection
Document32 pages
NLP Sentence Boundary Detection
Mandadapu Swathi
No ratings yet
Term Paper On Regular Expression
Document8 pages
Term Paper On Regular Expression
afmzmajcevielt
100% (1)
UGF2861 Tierney SentimentAnalysis
Document7 pages
UGF2861 Tierney SentimentAnalysis
jeed
No ratings yet
Text Prediction Analysis
Document12 pages
Text Prediction Analysis
roreyis234
No ratings yet
HW 1
Document8 pages
HW 1
Manaswi Gupta
No ratings yet
Practical and Effective Neural NER
Document31 pages
Practical and Effective Neural NER
wcc32
No ratings yet
Perform Textual Sentiment Analysis in Java Using A Deep Learning Model
Document6 pages
Perform Textual Sentiment Analysis in Java Using A Deep Learning Model
Adarsh
No ratings yet
Natural Language Processing With Java - Sample Chapter
Document33 pages
Natural Language Processing With Java - Sample Chapter
Packt Publishing
100% (1)
Title: Prolog - An Introduction: Department of Computer Science and Engineering
Document3 pages
Title: Prolog - An Introduction: Department of Computer Science and Engineering
Sazeda Sultana
No ratings yet
Language Design and Data Types for Introductory Programming
Document10 pages
Language Design and Data Types for Introductory Programming
Rasika Jayawardana
No ratings yet
Pearl
Document49 pages
Pearl
anirbanmanna88320
No ratings yet
Indenting C Programs
Document13 pages
Indenting C Programs
karo
No ratings yet
CORE JAVA LANGUAGE FUNDAMENTALS
Document56 pages
CORE JAVA LANGUAGE FUNDAMENTALS
Ayan Raza
No ratings yet
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
Document33 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
Nam Nguyen
No ratings yet
Java: A Beginner's Guide, Sixth Edition: Ebooks Free
Document5 pages
Java: A Beginner's Guide, Sixth Edition: Ebooks Free
Durgesh
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
Document18 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
03sri03
No ratings yet
Object Oriented Programming - Introduction To Java Language
Document87 pages
Object Oriented Programming - Introduction To Java Language
n4mei31
No ratings yet
Lis 211 Quiz
Document270 pages
Lis 211 Quiz
Ian Alfonso San Martin
No ratings yet
AI Zone: Log in Sign Up
Document24 pages
AI Zone: Log in Sign Up
Anonymous TpYSenLO8a
No ratings yet
Java Beginner Tutorial - JDBC Tutorial - Tech Tutorials
Document13 pages
Java Beginner Tutorial - JDBC Tutorial - Tech Tutorials
tp20165
No ratings yet
Data Exploration and Preparation Techniques
Document58 pages
Data Exploration and Preparation Techniques
esraa saeid sultan
No ratings yet
What Is Jason
Document4 pages
What Is Jason
Anonymous XHoxEUbHKM
No ratings yet
Talend Data Integration
Document5 pages
Talend Data Integration
Bhavuk Chawla
No ratings yet
Advanced NLP With Spacy Chapter2
Document28 pages
Advanced NLP With Spacy Chapter2
Fgpeqw
100% (1)
Research Paper On Regular Expressions
Document4 pages
Research Paper On Regular Expressions
c9sj0n70
100% (3)
Java Course
Document15 pages
Java Course
Balaji M
No ratings yet
Selenium Online Training - Selenium Classroom Training Hyd - USA - UK - Canada
Document12 pages
Selenium Online Training - Selenium Classroom Training Hyd - USA - UK - Canada
Priyanka CH
No ratings yet
DAV_EXP7_56
Document8 pages
DAV_EXP7_56
godizlatan
No ratings yet
How To Remove HTML Tags From A String - Java Sample Programs
Document2 pages
How To Remove HTML Tags From A String - Java Sample Programs
narendramahajangm
No ratings yet
Dell Sample Technical Placement Paper
Document13 pages
Dell Sample Technical Placement Paper
Puli Naveen
No ratings yet
NLP For ML - Spam Classifier
Document14 pages
NLP For ML - Spam Classifier
Thomas West
No ratings yet
Ict 10-1
Document10 pages
Ict 10-1
Keesha Mendoza
No ratings yet
No Warnings $ W Use Strict Use Sigtrap Use Diagnostics: Perl Version 5.10.0 Documentation - Perlstyle
Document4 pages
No Warnings $ W Use Strict Use Sigtrap Use Diagnostics: Perl Version 5.10.0 Documentation - Perlstyle
ramesh4u420
No ratings yet
234 LK 3 y 53 Het 23 LK 428 G 3 Oh 8759
Document4 pages
234 LK 3 y 53 Het 23 LK 428 G 3 Oh 8759
Danny
No ratings yet
Java Documentatie
Document47 pages
Java Documentatie
Marius Gabriel Cseke
No ratings yet
Topic 2 - Java Programming Basics (Part 1)
Document51 pages
Topic 2 - Java Programming Basics (Part 1)
SITI NURDAYANA SAIDIN
No ratings yet
Java Persistence Practice Guide
Document130 pages
Java Persistence Practice Guide
djsamma
No ratings yet
Functional Programming in Scala PDF
Document304 pages
Functional Programming in Scala PDF
josh
No ratings yet
Chapter-1 Introduction To NLP
Document12 pages
Chapter-1 Introduction To NLP
Sruja Koshti
No ratings yet
Python Style Guide - How To Write Neat and Impressive Python Code
Document14 pages
Python Style Guide - How To Write Neat and Impressive Python Code
Khalil Ahmad
No ratings yet
Lecture14-Perl in Bioinformatics
Document19 pages
Lecture14-Perl in Bioinformatics
noor ulain
No ratings yet
Spark NLP Training-Public-April 2020
Document39 pages
Spark NLP Training-Public-April 2020
Xuân Vinh Nguyễn
No ratings yet
Natural Language Processing with Java: Concept Extraction and Sentiment Analysis
Document51 pages
Natural Language Processing with Java: Concept Extraction and Sentiment Analysis
rexilluminati
No ratings yet
Perl Interview
Document21 pages
Perl Interview
asmanu
100% (7)
Learning Java Functional Programming
From Everand
Learning Java Functional Programming
Richard M Reese
No ratings yet
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
From Everand
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
Eric Schmidt
No ratings yet
Learning Cypher
From Everand
Learning Cypher
Onofrio Panzarino
No ratings yet
The Sociological Perspective and Research Process
Document7 pages
The Sociological Perspective and Research Process
Aun Maken
No ratings yet
Rayyan Air PDF
Document60 pages
Rayyan Air PDF
Saad Bin Shahid
0% (1)
146850199822229-Character Certificate Performa With Monogram PDF
Document4 pages
146850199822229-Character Certificate Performa With Monogram PDF
Saad Bin Shahid
No ratings yet
2-3 Trees PDF
Document6 pages
2-3 Trees PDF
Saad Bin Shahid
100% (1)
Directions and everyday items
Document6 pages
Directions and everyday items
Saad Bin Shahid
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
Document32 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
Saad Bin Shahid
No ratings yet
Exercise 1.2010.solutions
Document3 pages
Exercise 1.2010.solutions
marwa mk
83% (24)
Cs-825 Msitcs Ir
Document3 pages
Cs-825 Msitcs Ir
Saad Bin Shahid
No ratings yet
How to Update PDF Viewer for Proper Document Display
Document1 page
How to Update PDF Viewer for Proper Document Display
Saad Bin Shahid
No ratings yet
Preprocessing in Ir: Rida Hafeez
Document14 pages
Preprocessing in Ir: Rida Hafeez
Saad Bin Shahid
No ratings yet
Vs
Document51 pages
Vs
Saad Bin Shahid
No ratings yet
Preprocessing in Ir: Rida Hafeez
Document14 pages
Preprocessing in Ir: Rida Hafeez
Saad Bin Shahid
No ratings yet
Text Preprocessing: Information Retrieval
Document16 pages
Text Preprocessing: Information Retrieval
Saad Bin Shahid
100% (1)
Lecture1 Intro
Document57 pages
Lecture1 Intro
Saad Bin Shahid
No ratings yet
Sharmeen Obaid Chinnoy VS Shad Begum
Document1 page
Sharmeen Obaid Chinnoy VS Shad Begum
Saad Bin Shahid
No ratings yet
Intro
Document15 pages
Intro
Saad Bin Shahid
No ratings yet