Professional Documents
Culture Documents
2
Defining the lists for positive and negative
words
negative_words = ["miss", "poor", "doubt", "object", "sorry", "impossible", "afraid", "scarcely", "bad", "anxious
"]
3
Opening and reading the file
Next, we need to open the text file and store its content in a
variable. Make sure the text file is in the same directory as the
Python script.
file = open(“1.txt")
text = file.read()
4
Tokenizing the text
words = word_tokenize(text)
5
Checking if the text contains positive or
negative words
6
Improving the program
You’ll notice that there’s a problem with the code in the previous
slide, and that if there are a mix of both good and bad words, this
method would be inaccurate.
7
Keeping positive and negative scores
8
Keeping positive and negative scores
positive_score = 0
negative_score = 0
for word in words:
if word in positive_words:
positive_score += 1
elif word in negative_words:
negative_score += 1
if positive_score > negative_score:
print("The text is positive")
else:
print("The text is negative")
9
Improving the program even further
You’ll notice that there’s a problem with the code in the previous
slide, and that some words might be positive that we haven’t put in
our list.
10
Using word similarity
What we’ll do is that for each word in our text, we’ll check how
similar it is to each word in the positive and negative lists.
To do so, we’ll also need to remove all irrelevant words.
11
Stop Words
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result of
a search query
Stop words are commonly used in Natural Language Processing
(NLP) to eliminate words that are so commonly used that they carry
very little useful information
12
Example for stop word
13
Removing stop words
from nltk.corpus import stopwords
stop_words= stopwords.words("english")
filtered_words = []
for word in words:
if word not in stop_words:
filtered_words.append(word)
14
Using word similarity
What we’ll do is that for each word in our text, we’ll check how similar it
is to each word in the positive list and then again for the negative list.
We’ll keep each similarity score in a list, and get the maximum score.
positive_score = 0
negative_score = 0
positive_similarity = []
negative_similarity = []
15
Using word similarity
from nltk.corpus import wordnet
for word in words:
for positive_word in positive_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
for negative_word in negative_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
positive_score += max(positive_similarity)
negative_score += max(negative_similarity)
16
Fixing problems in our code
You’ll notice that there’s an error that complains about a “None” value in our list.
That’s coming from the “max()” function, because it’s expecting numbers only.
To fix this, we simply need to use the “filter()” function to remove all “None”
values.
17
Fixing problems in our code
for word in words:
for positive_word in positive_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
positive_similarity = list(filter(None, positive_similarity))
for negative_word in negative_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
negative_similarity = list(filter(None, negative_similarity))
positive_score += max(positive_similarity)
negative_score += max(negative_similarity)
18
Fixing problems in our code
19
Fixing problems in our code
for word in words:
for positive_word in positive_words:
if(wordnet.synsets(word) and wordnet.synsets(positive_word)):
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
positive_similarity = list(filter(None, positive_similarity))
for negative_word in negative_words:
if(wordnet.synsets(word) and wordnet.synsets(negative_word)):
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
negative_similarity = list(filter(None, negative_similarity))
positive_score += max(positive_similarity) 20
negative_score += max(negative_similarity)
Checking our results
if positive_score > negative_score:
print("The text is positive")
else:
print("The text is negative")
21
Wu-Palmer Similarity
22
Code #1: Introducing Synsets
23
Code #2: Wu Similarity
syn1.wup_similarity(syn2)
Output :
0.26666666666666666
hello and selling is apparently 27% similar!
24
Try it out yourself
Code:
https://colab.research.google.com/drive/19sLiFnHyDzi1M99yRjeHlB
7ekCSYXrmD
25
Task #1
Use the mini project we did to loop through all text files in a
directory and print the document name and whether it contains
positive or negative text.
Extra: See if you can improve the mini project even further.
26
Task #2
27
Thank you for your attention!
28
References
https://www.tidytextmining.com/sentiment.html
29