You are on page 1of 6

Department of Computer Engineering Exp. No.

1
No.

Semester B.E. Semester VIII – Computer Engineering


Subject NLP
Subject Professor In-charge Prof. Mohini Kamat
Assisting Teachers Prof. Mohini Kamat
Laboratory M312B

Student Name Ashish Patil


Roll Number 16102B0010
Grade and Subject
Teacher’s Signature

Experiment Number 1

Experiment Title Preprocessing of Text


Resources Required Hardware: Programming Language :
Computer system Python

Theory Text pre-processing


processing is traditionally an important step for
natural language processing (NLP) tasks. It transforms text
into a more digestible form so that machine learning
algorithms can perform better.
Generally, there are 3 main components:
 Tokenization
 Normalization
 Noise removal

Tokenization is about splitting strings of text into smaller


pieces, or “tokens”. Paragraphs can be tokenized into
sentences and sentences can be tokenized into words.
Normalization aims to put all text on a level playing field,
field
e.g., converting all characters to lowercase.
Noise removal cleans up the text, e.g., remove extra
whitespaces.

List of Text Preprocessing Steps:


 Filtration
 Script Validation
 Stop Word Removal
 Expand contractions
 Remove HTML tags
 Remove extra whitespaces
Department of Computer Engineering Exp. No.1
No.
 Convert number words to numeric form
 Remove numbers
 Lemmatization
 Lowercase all texts
 Remove special characters
 Convert accented characters to ASCII characters

Filtrationis
is about eliminating the punctuation marks of a
sentence.
Script Validation refers to the removal of lines from a
passage written in some language, e.g., deleting all the
English statements from a Marathi passage or vice-versa.
vice
Stop Word Removal: Stop-words words are very common words.
Words like “we” and “are” probably do nott help at all in NLP
tasks such as sentiment analysis or text classifications.
Hence, we can remove stop-words words to save computing time
and efforts in processing large volumes of text.
Expanding Contractions: Contractions are shortened words,
e.g., don’t and can’t. Expanding such words to “do not” and
“cannot” helps to standardize text.
Remove HTML tags:If If the reviews or texts are web scraped,
chances are they will contain some HTML tags. Since these
tags are not useful for our NLP tasks, it is better to remove
remo
them.
Remove extra whitespaces: Words with accent marks like
“latté” and “café” can be converted and standardized to just
“latte” and “cafe”, else our NLP model will treat “latté” and
“latte” as different words even though they are referring to
same thing.
ng. To do this, we use the module unidecode.
Convert numbers words into numeric form:Thisform: step
involves the conversion of number words to numeric form,
e.g., seven to 7, to standardize text.
Remove numbers:Removing
Removing numbers may make sense for
sentiment analysis
ysis since numbers contain no information
about sentiments. However, if our NLP task is to extract the
number of tickets ordered in a message to our chatbot, we
will definitely not want to remove numbers.
Lemmatization: Lemmatization is the process of converting
conv
a word to its base form, e.g., “caring” to “care”.

Program from nltk.tokenize import sent_tokenize,


word_tokenize
import string
import re
import nltk
from nltk.corpus import stopwords
Department of Computer Engineering Exp. No.1
No.
import unidecode
from word2number import w2n
from nltk.stem import WordNetLemmatizer
s = input("Enter a sentence: ")

print("TOKENIZATION ")
token = [w for w in s.split()]
print(token,end='\n\n')

contractions = {
"can't": "cannot",
"It's": "It is",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"didn't": "did not",
"doesn't": "does not",
"he'd": "he had",
"he'll": "he shall",
"how'd": "how did",
"I'd": "I would",
"you've": "you have"
}

l = []
print("REMOVING CONTRACTIONS")
for word in token:
if word in contractions.keys():
l.append(contractions[word])
else:
l.append(word)

s = ' '.join(l)
print(s.split(),end='\n\n')

print("FILTERATION")
s = s.translate(str.maketrans('', '',
string.punctuation))
print(s.split(),end='\n\n')

valid=''
print("SCRIPT VALIDATION")
for word in s.split():
if re.match('[A-Za-z]',word):
valid=valid+word+' '
print(valid.split(),end='\n\n')
s=valid
Department of Computer Engineering Exp. No.1
No.
print("REMOVING ACCENTED WORDS")
s = unidecode.unidecode(s)
print(s.split(),end='\n\n')

print("LOWER CASE WORDS")


s = s.lower()
print(s.split(),end='\n\n')

print("CONVERTING NUMBER WORDS TO NUMERIC FORM")


processed_text = []
for word in s.split():
try:

processed_text.append(str(w2n.word_to_num(word)))
except ValueError:
processed_text.append(word)

s = ' '.join(processed_text)
print(s.split(),end='\n\n')

print("REMOVING NUMBERS : ")


processed_text = [word for word in processed_text
if not word.isnumeric()]
s = ' '.join(processed_text)
print(s.split(),end='\n\n')

print("REMOVING STOP WORDS")


stop_words = set(stopwords.words('english'))
s = ' '.join([w for w in s.split() if not w in
stop_words] )
print(s.split(),end='\n\n')

print("LEMMATIZED WORDS")
lema = []
lemmatizer = WordNetLemmatizer()
for word in s.split():
lema.append(lemmatizer.lemmatize(word))
lemmatize(word))

print(lema)
Department of Computer Engineering Exp. No.1
No.
Output

Conclusion 1) Regular expression is a sequence of characters mainly used


to find and replace patterns in a string or file.

2) The most common uses of regular expressions are:

 Search a string (search and match)


 Finding a string (findall)
 Break string into a sub strings (split)
 Replace part of a string (sub)

3) The ‘re’ package provides multiple methods to perform


queries on an input string. Here are the most commonly used
methods:

1. re.match()
Department of Computer Engineering Exp. No.1
No.
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()

You might also like