Language Engineering

Language Engineering
Prepared by: Abdelrahman M. Safwat
Section (4) – Mini Project

Idea
 We want to create a Python program that takes a file and checks

whether the text indicates a positive sentiment or negative
sentiment.
2
Defining the lists for positive and negative
words
 First, we need to define a list of positive and negative words to

compare words from our text against.
positive_words = ["well", "good", "great", "like", "better", "enough", "happy", "love", "pleasure", "happiness"]
negative_words = ["miss", "poor", "doubt", "object", "sorry", "impossible", "afraid", "scarcely", "bad", "anxious
"]
3
Opening and reading the file
 Next, we need to open the text file and store its content in a
variable. Make sure the text file is in the same directory as the
Python script.
file = open(“1.txt")
text = file.read()
4
Tokenizing the text
 Next, we need to tokenize all the text into words.

from nltk.tokenize import word_tokenize
words = word_tokenize(text)
5
Checking if the text contains positive or
negative words
 Next, we need to loop through all words and check if they’re

included in either of the lists we created.
for word in words:
if word in positive_words:
print("The text is positive")
break
elif word in negative_words:
print("The text is negative")
break
6
Improving the program
 You’ll notice that there’s a problem with the code in the previous
slide, and that if there are a mix of both good and bad words, this
method would be inaccurate.
7
Keeping positive and negative scores
 To solve the problem we mentioned, we can keep score of how

many positive and negative words there are in the text.
8
Keeping positive and negative scores
positive_score = 0
negative_score = 0
for word in words:
if word in positive_words:
positive_score += 1
elif word in negative_words:
negative_score += 1
if positive_score > negative_score:
else:
9
Improving the program even further
 You’ll notice that there’s a problem with the code in the previous
slide, and that some words might be positive that we haven’t put in
our list.
10
Using word similarity
 What we’ll do is that for each word in our text, we’ll check how
similar it is to each word in the positive and negative lists.
 To do so, we’ll also need to remove all irrelevant words.
11
Stop Words
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result of
a search query
Stop words are commonly used in Natural Language Processing
(NLP) to eliminate words that are so commonly used that they carry
very little useful information
12
Example for stop word
13
Removing stop words
from nltk.corpus import stopwords
stop_words= stopwords.words("english")
filtered_words = []
for word in words:
if word not in stop_words:
filtered_words.append(word)
14
 What we’ll do is that for each word in our text, we’ll check how similar it
is to each word in the positive list and then again for the negative list.
 We’ll keep each similarity score in a list, and get the maximum score.
positive_score = 0
negative_score = 0
positive_similarity = []
negative_similarity = []
15
from nltk.corpus import wordnet
for word in words:
for positive_word in positive_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
for negative_word in negative_words:
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
positive_score += max(positive_similarity)
negative_score += max(negative_similarity)
16
Fixing problems in our code
 You’ll notice that there’s an error that complains about a “None” value in our list.
 That’s coming from the “max()” function, because it’s expecting numbers only.
 To fix this, we simply need to use the “filter()” function to remove all “None”
values.
17
for word in words:
positive_similarity = list(filter(None, positive_similarity))
negative_similarity = list(filter(None, negative_similarity))
positive_score += max(positive_similarity)
18
 You’ll notice that there’s another error that complains about an

invalid index.
 That’s because some of the words we’re comparing don’t have an
entry in the WordNet.
 We can simply check first if there are entries or not.
19
for word in words:
if(wordnet.synsets(word) and wordnet.synsets(positive_word)):
positive_similarity = list(filter(None, positive_similarity))
if(wordnet.synsets(word) and wordnet.synsets(negative_word)):
negative_similarity = list(filter(None, negative_similarity))
positive_score += max(positive_similarity) 20
Checking our results
if positive_score > negative_score:
else:
21
Wu-Palmer Similarity
The wup_similarity method is short for Wu-Palmer Similarity, which is a

scoring method based on how similar the word senses are and where the
Synsets occur relative to each other in the hypernym tree.
22
Code #1: Introducing Synsets
from nltk.corpus import wordnet

syn1 = wordnet.synsets('hello')[0]
syn2 = wordnet.synsets('selling')[0]
print ("hello name : ", syn1.name())
print ("selling name : ", syn2.name())
 Output
hello name : hello.n.01
selling name : selling.n.01
23
Code #2: Wu Similarity
syn1.wup_similarity(syn2)
Output :
0.26666666666666666
 hello and selling is apparently 27% similar!
24
Try it out yourself
 Code:
https://colab.research.google.com/drive/19sLiFnHyDzi1M99yRjeHlB
7ekCSYXrmD
25
Task #1
 Use the mini project we did to loop through all text files in a
directory and print the document name and whether it contains
positive or negative text.
 Extra: See if you can improve the mini project even further.
26
Task #2
 Write a python program to check the list of stopwords in Arabic

language.
27
Thank you for your attention!
28
References
 https://www.tidytextmining.com/sentiment.html
29

Language Engineering - Section

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Engineering - Section

Uploaded by

Copyright:

Available Formats

Prepared by: Abdelrahman M. Safwat

Section (4) – Mini Project

 We want to create a Python program that takes a file and checks

 First, we need to define a list of positive and negative words to

 Next, we need to tokenize all the text into words.

 Next, we need to loop through all words and check if they’re

 To solve the problem we mentioned, we can keep score of how

 You’ll notice that there’s another error that complains about an

The wup_similarity method is short for Wu-Palmer Similarity, which is a

from nltk.corpus import wordnet

 Write a python program to check the list of stopwords in Arabic

You might also like