Professional Documents
Culture Documents
2
Extracting all text from text files
First, we need to extract all text from the text files in the directory
we’re in.
import os
dir_list = os.listdir(".")
dir_list = sorted(dir_list)
text_list = []
for file in dir_list:
if file.endswith(".txt"):
text_list.append(open(file).read())
3
Extracting all text from text files
Once we’ve extracted all the text, we’ll need to tokenize it and make
a list of all words we have, that way we can start storing in which
document does each word occur.
4
Extracting all text from text files
Since we’re making a search engine, we can’t just extract the text and
make a list of words. A lot of those words will be different forms
of the same word. (i.e. go, going, went, etc.)
That’s why we need to lemmatize all words.
5
Lemmatizing words
Lemmatizing words requires that we get the POS tags of each word,
and then use the word and its tag in the lemmatizer.
6
Lemmatizing words
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def nltk_tag_to_wordnet_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
7
return None
Lemmatizing words
def lemmatize_sentence(sentence):
lemmatizer = WordNetLemmatizer()
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
lemmatized_sentence.append(word)
else:
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return " ".join(lemmatized_sentence)
8
Lemmatizing words
new_text_list = []
for text in text_list:
new_text_list.append(lemmatize_sentence(text))
9
Tokenizing all text
for text in new_text_list:
all_words.append(nltk.word_tokenize(text))
all_words = list(chain(*all_words)) 10
all_words_set = set(all_words)
Creating word index
Next, we’ll loop through all texts and check if the word exists in that
text, and if so, store the index of that text.
words_index = {}
counter = 1
for word in all_words_set:
words_index[word] = []
for text in new_text_list:
if(word in text):
words_index[word].append(counter)
counter += 1 11
counter = 1
Testing our search engine
search = "from"
words_index[search]
12
Using graph of how many times a word occurs
in all the text.
13
Using graph of how many times a word occurs
in all the text.
14
Try with me ,,,,,,
Use the mini project to show a graph of how many times a word
occurs in all the text.
15
Task #1
Use the mini project to show the closest word in all the text if the
word not found.
16
Task #2
Use the mini project to show list of words have The same meaning.
17
Try it out yourself
Code:
https://colab.research.google.com/drive/17g3Co7YGZuhufZPZ99rCR
SH_84zpUNLo?usp=sharing
18
Thank you for your attention!
19
References
https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
20