Welcome to Scribd!

Skip carousel

Language Engineering - Section

Uploaded by

asmaa soliman

0% found this document useful (0 votes)

3 views20 pages

Original Title

Language Engineering - Section (5)

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

3 views20 pages

Language Engineering - Section

Uploaded by

asmaa soliman

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (5) – Mini Project

Idea

 We want to create a Python program that acts a small search engine.

 It should accept a term and return which document contains that
term.

2
Extracting all text from text files

 First, we need to extract all text from the text files in the directory
we’re in.
import os

dir_list = os.listdir(".")
dir_list = sorted(dir_list)
text_list = []

for file in dir_list:
if file.endswith(".txt"):
text_list.append(open(file).read())
3
Extracting all text from text files

 Once we’ve extracted all the text, we’ll need to tokenize it and make
a list of all words we have, that way we can start storing in which
document does each word occur.

4
Extracting all text from text files

 Since we’re making a search engine, we can’t just extract the text and
make a list of words. A lot of those words will be different forms
of the same word. (i.e. go, going, went, etc.)
 That’s why we need to lemmatize all words.

5
Lemmatizing words

 Lemmatizing words requires that we get the POS tags of each word,
and then use the word and its tag in the lemmatizer.

6
Lemmatizing words

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
7
        return None
Lemmatizing words

def lemmatize_sentence(sentence):
lemmatizer = WordNetLemmatizer()
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

8
Lemmatizing words

new_text_list = []

for text in text_list:
new_text_list.append(lemmatize_sentence(text))

9
Tokenizing all text

 Next, we need to create a list of all words.

 As word_tokenize() returns a list, this will create a list of lists, and that’s
why we’ll use itertool’s chain() function to unpack the list of lists into one list.
 As the list will contain duplicate of words, we’ll turn the list into a set, which
from itertools import chain
removes all duplicates.
all_words = []

for text in new_text_list:
all_words.append(nltk.word_tokenize(text))

all_words = list(chain(*all_words)) 10
all_words_set = set(all_words)
Creating word index

 Next, we’ll loop through all texts and check if the word exists in that
text, and if so, store the index of that text.
words_index = {}

counter = 1

for word in all_words_set:
  words_index[word] = []
  for text in new_text_list:
    if(word in text):
      words_index[word].append(counter)
    counter += 1 11
  counter = 1
Testing our search engine

search = "from"

words_index[search]

12
Using graph of how many times a word occurs
in all the text.

 we stored indexes of word in the documentation in list_Frequency

 we used matplotlib.pyplot to show the list_Frequency in graph
 scatter have two parameters (x-axis , y-axis)
 scatter show graph in points
 bar() show graph as bars
 we used to show size of list_Frequency

13
Using graph of how many times a word occurs
in all the text.

14
Try with me ,,,,,,

 Use the mini project to show a graph of how many times a word
occurs in all the text.

15
Task #1

 Use the mini project to show the closest word in all the text if the
word not found.

16
Task #2

 Use the mini project to show list of words have The same meaning.

17
Try it out yourself

 Code:
https://colab.research.google.com/drive/17g3Co7YGZuhufZPZ99rCR
SH_84zpUNLo?usp=sharing

18
Thank you for your attention!

19
References

 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258

Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
JavaScript Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
JavaScript Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
No ratings yet
Word Embedding Generation For Telugu Corpus
Document28 pages
Word Embedding Generation For Telugu Corpus
Durga P
No ratings yet
Language Engineering - Section
Document29 pages
Language Engineering - Section
asmaa soliman
No ratings yet
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
PythonQuestionBank ForTest2
Document4 pages
PythonQuestionBank ForTest2
Heena
No ratings yet
Python Chatbot Project
Document10 pages
Python Chatbot Project
Vanitha G
No ratings yet
Python Notes 2022
Document155 pages
Python Notes 2022
Anuj Vishwakarma
100% (1)
Learning Python: From Zero To Hero: by TK
Document23 pages
Learning Python: From Zero To Hero: by TK
Ramesh Kumar
No ratings yet
Interview Question Python
Document14 pages
Interview Question Python
kumarjit pait
No ratings yet
IT Text Book
Document65 pages
IT Text Book
Venkatesh Prasad Boinapalli
No ratings yet
NLP Exps
Document10 pages
NLP Exps
20115016 HICET STUDENT AIML
No ratings yet
Python Cheat Sheet April 2021
Document26 pages
Python Cheat Sheet April 2021
Rajesh Shinde
100% (1)
Project Machine Translation
Document45 pages
Project Machine Translation
Varisa Rahmawati
No ratings yet
Lab1 IR
Document14 pages
Lab1 IR
Pac SaQii
No ratings yet
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
Document39 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
zishankamal
No ratings yet
80 Q and A Data Science
Document20 pages
80 Q and A Data Science
Vikash Rryder
No ratings yet
Python CCE - II by Atul Sadiwal?
Document7 pages
Python CCE - II by Atul Sadiwal?
Harsha Choudhary
No ratings yet
Python Interview Questions
Document27 pages
Python Interview Questions
Arsalan
No ratings yet
Python Programming
Document11 pages
Python Programming
Srinivasa Rao
No ratings yet
Design Patterns in Swift: A Different Approach to Coding with Swift
From Everand
Design Patterns in Swift: A Different Approach to Coding with Swift
Vamshi Krishna
No ratings yet
Language Engineering - Section
Document24 pages
Language Engineering - Section
asmaa soliman
No ratings yet
Python Cheat Sheet PDF
Document26 pages
Python Cheat Sheet PDF
harishrnjic
100% (2)
Rapport Text Mining
Document7 pages
Rapport Text Mining
Lionel Stiven
No ratings yet
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
Rating: 1 out of 5 stars
1/5 (1)
Python programming for beginners: Python programming for beginners by Tanjimul Islam Tareq
From Everand
Python programming for beginners: Python programming for beginners by Tanjimul Islam Tareq
Tanjimul Islam Tareq
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Assignment II - NLP - D
Document4 pages
Assignment II - NLP - D
Sumit Das
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
Python Imp Interview Question PDF
Document37 pages
Python Imp Interview Question PDF
create world
100% (1)
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
Document32 pages
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
Iain McCulloch
No ratings yet
Gebrekidan Yonatan Yakob
Document14 pages
Gebrekidan Yonatan Yakob
Yonatan Yakob
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
No ratings yet
Interview Questions
Document38 pages
Interview Questions
Akash Kr
No ratings yet
Presentation Poorv
Document27 pages
Presentation Poorv
KartikeySingh
No ratings yet
NLP For ML - Spam Classifier
Document14 pages
NLP For ML - Spam Classifier
Thomas West
No ratings yet
NLTK - N-Gram LM
Document13 pages
NLTK - N-Gram LM
Pavan Kumar
No ratings yet
Learning Python: From Zero To Hero: by TK
Document23 pages
Learning Python: From Zero To Hero: by TK
Ali Shah Khawaja
No ratings yet
Python Main Program Set 2
Document18 pages
Python Main Program Set 2
KEERTHI PRASANTH S BSC-CS(2022-23)
No ratings yet
Assingment 2
Document4 pages
Assingment 2
007Adarsh Tiwari
No ratings yet
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
Document6 pages
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
Shreyans Chhajed
No ratings yet
What Is Python
Document5 pages
What Is Python
Sunil MP
No ratings yet
Python Document
Document25 pages
Python Document
EAIESB
No ratings yet
Word Game Lab
Document8 pages
Word Game Lab
Neel Jani
No ratings yet
Lesson 3
Document27 pages
Lesson 3
Marijo Bugarso BSIT-3B
No ratings yet
Quiz 2
Document11 pages
Quiz 2
KSHITIJ SONI 19PE10041
No ratings yet
Mission Ruby
From Everand
Mission Ruby
Sheela Preuitt
No ratings yet
Python Programming ppt0
Document25 pages
Python Programming ppt0
Fiker Fikerwessenu
No ratings yet
Text Summarization Using Natural Language Processing
Document8 pages
Text Summarization Using Natural Language Processing
ANANDA CHATTERJEE
No ratings yet
Programming Notes (Pythun) Chapter Wise
Document41 pages
Programming Notes (Pythun) Chapter Wise
muhammad afzzal
No ratings yet
Python Dev Basic Notes
Document46 pages
Python Dev Basic Notes
shweta
No ratings yet
Dictionaries: 'One' 'Uno'
Document10 pages
Dictionaries: 'One' 'Uno'
Zahid
No ratings yet
Part 4: Implementing The Solution in Python
Document5 pages
Part 4: Implementing The Solution in Python
Huỳnh Đỗ Tấn Thành
No ratings yet
Python Cheat Sheetkkk
Document17 pages
Python Cheat Sheetkkk
Sadiqu Zzaman
100% (1)
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
Document18 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
03sri03
No ratings yet
Top 50 Python Interview Question and Answers Powered by ©
Document12 pages
Top 50 Python Interview Question and Answers Powered by ©
Khadija Lasri
No ratings yet
Introduction To Python
Document13 pages
Introduction To Python
suryakant barkade
No ratings yet
Lab3 IR BIM
Document14 pages
Lab3 IR BIM
Pac SaQii
No ratings yet
Understanding The 10 Most Difficult Python Concepts - by Joanna - Geek Culture
Document24 pages
Understanding The 10 Most Difficult Python Concepts - by Joanna - Geek Culture
fatihdeniz
No ratings yet
Intelligent Database - Section 5
Document24 pages
Intelligent Database - Section 5
asmaa soliman
No ratings yet
IT221-Computer Graphics - Lab 5
Document15 pages
IT221-Computer Graphics - Lab 5
asmaa soliman
No ratings yet
OR Section 5
Document49 pages
OR Section 5
asmaa soliman
No ratings yet
Software Engineering 1 Sec
Document21 pages
Software Engineering 1 Sec
asmaa soliman
No ratings yet
Impact 3 - Unit 1 - Vocabulary
Document31 pages
Impact 3 - Unit 1 - Vocabulary
jenniferjoint
No ratings yet
Simple Past Material de Apoyo
Document3 pages
Simple Past Material de Apoyo
Jose peña ardila
No ratings yet
English Lesson Plan: Year 2
Document1 page
English Lesson Plan: Year 2
Rajkumar
No ratings yet
Edtpa Lesson Plans Dragged 3
Document4 pages
Edtpa Lesson Plans Dragged 3
api-549231419
No ratings yet
Language Essay
Document2 pages
Language Essay
api-532272493
No ratings yet
Fill in The Blanks Using "HE, SHE, IT, WE, TheY"Vi
Document2 pages
Fill in The Blanks Using "HE, SHE, IT, WE, TheY"Vi
Jacob Allen
No ratings yet
Preposiciones de Lugar
Document16 pages
Preposiciones de Lugar
alfredo
No ratings yet
Ted Talk Rubric
Document3 pages
Ted Talk Rubric
Vale Revilla
No ratings yet
Changes British English Student Ver2
Document5 pages
Changes British English Student Ver2
Світлана Гуменюк
No ratings yet
Edwin Morgan Study Pack
Document21 pages
Edwin Morgan Study Pack
Rebecca Harrison
No ratings yet
DualEnglishIV2 03aconflictanalysis
Document2 pages
DualEnglishIV2 03aconflictanalysis
Sofia Shliafer
No ratings yet
Practice - Unit 5
Document2 pages
Practice - Unit 5
Jose Vargas Fuentes
No ratings yet
Positive Negative No.5
Document15 pages
Positive Negative No.5
Sincerly Revellame
No ratings yet
L P 6 Ngày 1 Tháng 9
Document11 pages
L P 6 Ngày 1 Tháng 9
Trần Thất Bảo
No ratings yet
(Alvino E. Fantini) Language Acquisition of A Bili (BookFi)
Document283 pages
(Alvino E. Fantini) Language Acquisition of A Bili (BookFi)
Masfufatin Aisiyah
No ratings yet
Formula B1 Unit 7 Test DF
Document6 pages
Formula B1 Unit 7 Test DF
marxuky21
No ratings yet
Plural Nouns Worksheet
Document1 page
Plural Nouns Worksheet
Khirol Azuddin
No ratings yet
Eddn680 Siop Lesson 1
Document13 pages
Eddn680 Siop Lesson 1
api-645071941
No ratings yet
Upstream Beginner A+ Student's Book-2
Document152 pages
Upstream Beginner A+ Student's Book-2
axel Torres
No ratings yet
Common Mistakes in Thesis Writing
Document6 pages
Common Mistakes in Thesis Writing
aliciabrooksbeaumont
100% (2)
Present Simple Vs Present Continuous
Document1 page
Present Simple Vs Present Continuous
Alejandra Almonacid
No ratings yet
Lesson Exemplar SUBJECT AND VERB AGRREEMENT
Document4 pages
Lesson Exemplar SUBJECT AND VERB AGRREEMENT
Vergel Bacares Berdan
No ratings yet
"Way Forward" Study Pack - English Grade Eight First Term-2021
Document37 pages
"Way Forward" Study Pack - English Grade Eight First Term-2021
su87
No ratings yet
MIN-EM-GL-008 - FLS MIE Enovia Naming Conventions
Document5 pages
MIN-EM-GL-008 - FLS MIE Enovia Naming Conventions
Vladimir Jara
No ratings yet
Unit 2, Lesson 1.3 - Pronuciation & Speaking
Document4 pages
Unit 2, Lesson 1.3 - Pronuciation & Speaking
thu thuy Nguyen
No ratings yet
Iready at Home Activity Packets Student Ela Grade 3 P3
Document36 pages
Iready at Home Activity Packets Student Ela Grade 3 P3
Cemre yılmaz
No ratings yet
Module 10 B. Demonstrative Pronoun
Document9 pages
Module 10 B. Demonstrative Pronoun
Ricci Anne Rapinan Serafica
No ratings yet
Rephrasing, Practice 9
Document2 pages
Rephrasing, Practice 9
Sandra Emma
No ratings yet
ENG S3 HY Composition MS
Document2 pages
ENG S3 HY Composition MS
tkh rjlin
No ratings yet
Orton Gillingham Level K Lesson Plan Class: - Week Of: - Students/Group
Document1 page
Orton Gillingham Level K Lesson Plan Class: - Week Of: - Students/Group
Jacklyn Bukovick
100% (2)