You are on page 1of 20

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (5) – Mini Project


Idea

 We want to create a Python program that acts a small search engine.


 It should accept a term and return which document contains that
term.

2
Extracting all text from text files

 First, we need to extract all text from the text files in the directory
we’re in.
import os

dir_list = os.listdir(".")
dir_list = sorted(dir_list)
text_list = []

for file in dir_list:
  if file.endswith(".txt"):
    text_list.append(open(file).read())
3
Extracting all text from text files

 Once we’ve extracted all the text, we’ll need to tokenize it and make
a list of all words we have, that way we can start storing in which
document does each word occur.

4
Extracting all text from text files

 Since we’re making a search engine, we can’t just extract the text and
make a list of words. A lot of those words will be different forms
of the same word. (i.e. go, going, went, etc.)
 That’s why we need to lemmatize all words.

5
Lemmatizing words

 Lemmatizing words requires that we get the POS tags of each word,
and then use the word and its tag in the lemmatizer.

6
Lemmatizing words

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
7
        return None
Lemmatizing words

def lemmatize_sentence(sentence):
lemmatizer = WordNetLemmatizer()
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:        
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

8
Lemmatizing words

new_text_list = []

for text in text_list:
  new_text_list.append(lemmatize_sentence(text))

9
Tokenizing all text

 Next, we need to create a list of all words.


 As word_tokenize() returns a list, this will create a list of lists, and that’s
why we’ll use itertool’s chain() function to unpack the list of lists into one list.
 As the list will contain duplicate of words, we’ll turn the list into a set, which
from itertools import chain
removes all duplicates.
all_words = []

for text in new_text_list:
  all_words.append(nltk.word_tokenize(text))

all_words = list(chain(*all_words)) 10
all_words_set = set(all_words)
Creating word index

 Next, we’ll loop through all texts and check if the word exists in that
text, and if so, store the index of that text.
words_index = {}

counter = 1

for word in all_words_set:
  words_index[word] = []
  for text in new_text_list:
    if(word in text):
      words_index[word].append(counter)
    counter += 1 11
  counter = 1
Testing our search engine

search = "from"

words_index[search]

12
Using graph of how many times a word occurs
in all the text.

 we stored indexes of word in the documentation in list_Frequency


 we used matplotlib.pyplot to show the list_Frequency in graph
 scatter have two parameters (x-axis , y-axis)
 scatter show graph in points
 bar() show graph as bars
 we used to show size of list_Frequency

13
Using graph of how many times a word occurs
in all the text.

14
Try with me ,,,,,,

 Use the mini project to show a graph of how many times a word
occurs in all the text.

15
Task #1

 Use the mini project to show the closest word in all the text if the
word not found.

16
Task #2

 Use the mini project to show list of words have The same meaning.

17
Try it out yourself

 Code:
https://colab.research.google.com/drive/17g3Co7YGZuhufZPZ99rCR
SH_84zpUNLo?usp=sharing

18
Thank you for your attention!

19
References

 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258

20

You might also like