Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No.1

Department of Computer Engineering Exp. No.
1
No.
Semester B.E. Semester VIII – Computer Engineering

Subject NLP
Subject Professor In-charge Prof. Mohini Kamat
Assisting Teachers Prof. Mohini Kamat
Laboratory M312B
Student Name Ashish Patil

Roll Number 16102B0010
Grade and Subject
Teacher’s Signature
Experiment Number 1
Experiment Title Preprocessing of Text

Resources Required Hardware: Programming Language :
Computer system Python
Theory Text pre-processing

processing is traditionally an important step for
natural language processing (NLP) tasks. It transforms text
into a more digestible form so that machine learning
algorithms can perform better.
Generally, there are 3 main components:
 Tokenization
 Normalization
 Noise removal
Tokenization is about splitting strings of text into smaller

pieces, or “tokens”. Paragraphs can be tokenized into
sentences and sentences can be tokenized into words.
Normalization aims to put all text on a level playing field,
field
e.g., converting all characters to lowercase.
Noise removal cleans up the text, e.g., remove extra
whitespaces.
List of Text Preprocessing Steps:

 Filtration
 Script Validation
 Stop Word Removal
 Expand contractions
 Remove HTML tags
 Remove extra whitespaces
Department of Computer Engineering Exp. No.1
No.
 Convert number words to numeric form
 Remove numbers
 Lemmatization
 Lowercase all texts
 Remove special characters
 Convert accented characters to ASCII characters
Filtrationis
is about eliminating the punctuation marks of a
sentence.
Script Validation refers to the removal of lines from a
passage written in some language, e.g., deleting all the
English statements from a Marathi passage or vice-versa.
vice
Stop Word Removal: Stop-words words are very common words.
Words like “we” and “are” probably do nott help at all in NLP
tasks such as sentiment analysis or text classifications.
Hence, we can remove stop-words words to save computing time
and efforts in processing large volumes of text.
Expanding Contractions: Contractions are shortened words,
e.g., don’t and can’t. Expanding such words to “do not” and
“cannot” helps to standardize text.
Remove HTML tags:If If the reviews or texts are web scraped,
chances are they will contain some HTML tags. Since these
tags are not useful for our NLP tasks, it is better to remove
remo
them.
Remove extra whitespaces: Words with accent marks like
“latté” and “café” can be converted and standardized to just
“latte” and “cafe”, else our NLP model will treat “latté” and
“latte” as different words even though they are referring to
same thing.
ng. To do this, we use the module unidecode.
Convert numbers words into numeric form:Thisform: step
involves the conversion of number words to numeric form,
e.g., seven to 7, to standardize text.
Remove numbers:Removing
Removing numbers may make sense for
sentiment analysis
ysis since numbers contain no information
about sentiments. However, if our NLP task is to extract the
number of tickets ordered in a message to our chatbot, we
will definitely not want to remove numbers.
Lemmatization: Lemmatization is the process of converting
conv
a word to its base form, e.g., “caring” to “care”.
Program from nltk.tokenize import sent_tokenize,

word_tokenize
import string
import re
import nltk
from nltk.corpus import stopwords
No.
import unidecode
from word2number import w2n
from nltk.stem import WordNetLemmatizer
s = input("Enter a sentence: ")
print("TOKENIZATION ")
token = [w for w in s.split()]
print(token,end='\n\n')
contractions = {
"can't": "cannot",
"It's": "It is",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"didn't": "did not",
"doesn't": "does not",
"he'd": "he had",
"he'll": "he shall",
"how'd": "how did",
"I'd": "I would",
"you've": "you have"
}
l = []
print("REMOVING CONTRACTIONS")
for word in token:
if word in contractions.keys():
l.append(contractions[word])
else:
l.append(word)
s = ' '.join(l)
print(s.split(),end='\n\n')
print("FILTERATION")
s = s.translate(str.maketrans('', '',
string.punctuation))
valid=''
print("SCRIPT VALIDATION")
for word in s.split():
if re.match('[A-Za-z]',word):
valid=valid+word+' '
print(valid.split(),end='\n\n')
s=valid
No.
print("REMOVING ACCENTED WORDS")
s = unidecode.unidecode(s)
print("LOWER CASE WORDS")

s = s.lower()
print("CONVERTING NUMBER WORDS TO NUMERIC FORM")

processed_text = []
try:
processed_text.append(str(w2n.word_to_num(word)))
except ValueError:
processed_text.append(word)
s = ' '.join(processed_text)
print("REMOVING NUMBERS : ")

processed_text = [word for word in processed_text
if not word.isnumeric()]
s = ' '.join(processed_text)
print("REMOVING STOP WORDS")

stop_words = set(stopwords.words('english'))
s = ' '.join([w for w in s.split() if not w in
stop_words] )
print("LEMMATIZED WORDS")
lema = []
lemmatizer = WordNetLemmatizer()
lema.append(lemmatizer.lemmatize(word))
lemmatize(word))
print(lema)
No.
Output
Conclusion 1) Regular expression is a sequence of characters mainly used

to find and replace patterns in a string or file.
2) The most common uses of regular expressions are:
 Search a string (search and match)

 Finding a string (findall)
 Break string into a sub strings (split)
 Replace part of a string (sub)
3) The ‘re’ package provides multiple methods to perform

queries on an input string. Here are the most commonly used
methods:
1. re.match()
No.
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()

Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No.1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No.1

Uploaded by

Copyright:

Available Formats

Department of Computer Engineering Exp. No.

Semester B.E. Semester VIII – Computer Engineering

Student Name Ashish Patil

Experiment Title Preprocessing of Text

Theory Text pre-processing

Tokenization is about splitting strings of text into smaller

List of Text Preprocessing Steps:

Program from nltk.tokenize import sent_tokenize,

print("LOWER CASE WORDS")

print("CONVERTING NUMBER WORDS TO NUMERIC FORM")

print("REMOVING NUMBERS : ")

print("REMOVING STOP WORDS")

Conclusion 1) Regular expression is a sequence of characters mainly used

2) The most common uses of regular expressions are:

 Search a string (search and match)

3) The ‘re’ package provides multiple methods to perform

You might also like