You are on page 1of 29

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (4) – Mini Project


Idea

 We want to create a Python program that takes a file and checks


whether the text indicates a positive sentiment or negative
sentiment.

2
Defining the lists for positive and negative
words

 First, we need to define a list of positive and negative words to


compare words from our text against.
positive_words = ["well", "good", "great", "like", "better", "enough", "happy", "love", "pleasure", "happiness"]

negative_words = ["miss", "poor", "doubt", "object", "sorry", "impossible", "afraid", "scarcely", "bad", "anxious
"]

3
Opening and reading the file

 Next, we need to open the text file and store its content in a
variable. Make sure the text file is in the same directory as the
Python script.
file = open(“1.txt")
text = file.read()

4
Tokenizing the text

 Next, we need to tokenize all the text into words.


from nltk.tokenize import word_tokenize

words = word_tokenize(text)

5
Checking if the text contains positive or
negative words

 Next, we need to loop through all words and check if they’re


included in either of the lists we created.
for word in words:
  if word in positive_words:
    print("The text is positive")
    break
  elif word in negative_words:
    print("The text is negative")
    break

6
Improving the program

 You’ll notice that there’s a problem with the code in the previous
slide, and that if there are a mix of both good and bad words, this
method would be inaccurate.

7
Keeping positive and negative scores

 To solve the problem we mentioned, we can keep score of how


many positive and negative words there are in the text.

8
Keeping positive and negative scores

positive_score = 0
negative_score = 0

for word in words:
  if word in positive_words:
    positive_score += 1
  elif word in negative_words:
    negative_score += 1

if positive_score > negative_score:
  print("The text is positive")
else:
  print("The text is negative")
9
Improving the program even further

 You’ll notice that there’s a problem with the code in the previous
slide, and that some words might be positive that we haven’t put in
our list.

10
Using word similarity

 What we’ll do is that for each word in our text, we’ll check how
similar it is to each word in the positive and negative lists.
 To do so, we’ll also need to remove all irrelevant words.

11
Stop Words

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result of
a search query
Stop words are commonly used in Natural Language Processing
(NLP) to eliminate words that are so commonly used that they carry
very little useful information
12
Example for stop word

13
Removing stop words

from nltk.corpus import stopwords

stop_words= stopwords.words("english")
filtered_words = []

for word in words:
  if word not in stop_words:
    filtered_words.append(word)

14
Using word similarity

 What we’ll do is that for each word in our text, we’ll check how similar it
is to each word in the positive list and then again for the negative list.
 We’ll keep each similarity score in a list, and get the maximum score.

positive_score = 0
negative_score = 0
positive_similarity = []
negative_similarity = []

15
Using word similarity

from nltk.corpus import wordnet

for word in words:
  for positive_word in positive_words:
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(positive_word)[0]
      positive_similarity.append(word1.wup_similarity(word2))
  for negative_word in negative_words:
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(negative_word)[0]
      negative_similarity.append(word1.wup_similarity(word2))

  positive_score += max(positive_similarity)
  negative_score += max(negative_similarity)
16
Fixing problems in our code

 You’ll notice that there’s an error that complains about a “None” value in our list.
 That’s coming from the “max()” function, because it’s expecting numbers only.
 To fix this, we simply need to use the “filter()” function to remove all “None”
values.

17
Fixing problems in our code

for word in words:
  for positive_word in positive_words:
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(positive_word)[0]
      positive_similarity.append(word1.wup_similarity(word2))
positive_similarity = list(filter(None, positive_similarity))
  for negative_word in negative_words:
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(negative_word)[0]
      negative_similarity.append(word1.wup_similarity(word2))
negative_similarity = list(filter(None, negative_similarity)) 

  positive_score += max(positive_similarity)
  negative_score += max(negative_similarity)
18
Fixing problems in our code

 You’ll notice that there’s another error that complains about an


invalid index.
 That’s because some of the words we’re comparing don’t have an
entry in the WordNet.
 We can simply check first if there are entries or not.

19
Fixing problems in our code

for word in words:
  for positive_word in positive_words:
    if(wordnet.synsets(word) and wordnet.synsets(positive_word)):
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(positive_word)[0]
      positive_similarity.append(word1.wup_similarity(word2))
      positive_similarity = list(filter(None, positive_similarity)) 
  for negative_word in negative_words:
    if(wordnet.synsets(word) and wordnet.synsets(negative_word)):
      word1 = wordnet.synsets(word)[0]
      word2 = wordnet.synsets(negative_word)[0]
      negative_similarity.append(word1.wup_similarity(word2))
      negative_similarity = list(filter(None, negative_similarity)) 

  positive_score += max(positive_similarity) 20
  negative_score += max(negative_similarity)
Checking our results

if positive_score > negative_score:
  print("The text is positive")
else:
  print("The text is negative")

21
Wu-Palmer Similarity

The wup_similarity method is short for Wu-Palmer Similarity, which is a


scoring method based on how similar the word senses are and where the
Synsets occur relative to each other in the hypernym tree.

22
Code #1: Introducing Synsets

from nltk.corpus import wordnet


syn1 = wordnet.synsets('hello')[0]
syn2 = wordnet.synsets('selling')[0] 
print ("hello name :  ", syn1.name())
print ("selling name :  ", syn2.name())
 Output
hello name : hello.n.01
selling name : selling.n.01

23
Code #2: Wu Similarity 

syn1.wup_similarity(syn2)

Output : 
0.26666666666666666
 hello and selling is apparently 27% similar!

24
Try it out yourself

 Code:
https://colab.research.google.com/drive/19sLiFnHyDzi1M99yRjeHlB
7ekCSYXrmD

25
Task #1

 Use the mini project we did to loop through all text files in a
directory and print the document name and whether it contains
positive or negative text.
 Extra: See if you can improve the mini project even further.

26
Task #2

 Write a python program to check the list of stopwords in Arabic


language.

27
Thank you for your attention!

28
References

 https://www.tidytextmining.com/sentiment.html

29

You might also like