Professional Documents
Culture Documents
1
No.
Experiment Number 1
Filtrationis
is about eliminating the punctuation marks of a
sentence.
Script Validation refers to the removal of lines from a
passage written in some language, e.g., deleting all the
English statements from a Marathi passage or vice-versa.
vice
Stop Word Removal: Stop-words words are very common words.
Words like “we” and “are” probably do nott help at all in NLP
tasks such as sentiment analysis or text classifications.
Hence, we can remove stop-words words to save computing time
and efforts in processing large volumes of text.
Expanding Contractions: Contractions are shortened words,
e.g., don’t and can’t. Expanding such words to “do not” and
“cannot” helps to standardize text.
Remove HTML tags:If If the reviews or texts are web scraped,
chances are they will contain some HTML tags. Since these
tags are not useful for our NLP tasks, it is better to remove
remo
them.
Remove extra whitespaces: Words with accent marks like
“latté” and “café” can be converted and standardized to just
“latte” and “cafe”, else our NLP model will treat “latté” and
“latte” as different words even though they are referring to
same thing.
ng. To do this, we use the module unidecode.
Convert numbers words into numeric form:Thisform: step
involves the conversion of number words to numeric form,
e.g., seven to 7, to standardize text.
Remove numbers:Removing
Removing numbers may make sense for
sentiment analysis
ysis since numbers contain no information
about sentiments. However, if our NLP task is to extract the
number of tickets ordered in a message to our chatbot, we
will definitely not want to remove numbers.
Lemmatization: Lemmatization is the process of converting
conv
a word to its base form, e.g., “caring” to “care”.
print("TOKENIZATION ")
token = [w for w in s.split()]
print(token,end='\n\n')
contractions = {
"can't": "cannot",
"It's": "It is",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"didn't": "did not",
"doesn't": "does not",
"he'd": "he had",
"he'll": "he shall",
"how'd": "how did",
"I'd": "I would",
"you've": "you have"
}
l = []
print("REMOVING CONTRACTIONS")
for word in token:
if word in contractions.keys():
l.append(contractions[word])
else:
l.append(word)
s = ' '.join(l)
print(s.split(),end='\n\n')
print("FILTERATION")
s = s.translate(str.maketrans('', '',
string.punctuation))
print(s.split(),end='\n\n')
valid=''
print("SCRIPT VALIDATION")
for word in s.split():
if re.match('[A-Za-z]',word):
valid=valid+word+' '
print(valid.split(),end='\n\n')
s=valid
Department of Computer Engineering Exp. No.1
No.
print("REMOVING ACCENTED WORDS")
s = unidecode.unidecode(s)
print(s.split(),end='\n\n')
processed_text.append(str(w2n.word_to_num(word)))
except ValueError:
processed_text.append(word)
s = ' '.join(processed_text)
print(s.split(),end='\n\n')
print("LEMMATIZED WORDS")
lema = []
lemmatizer = WordNetLemmatizer()
for word in s.split():
lema.append(lemmatizer.lemmatize(word))
lemmatize(word))
print(lema)
Department of Computer Engineering Exp. No.1
No.
Output
1. re.match()
Department of Computer Engineering Exp. No.1
No.
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()