Professional Documents
Culture Documents
Submitted by:
Muhammad Muneeb Iftikhar 2021-CS-648
Muhammad Awais Nasir 2021-CS-627
Hira Ahmad 2020-R/2021-CS-709
Supervised by:
Dr. Irfan Yousaf
1
Contents
List of Figures 3
List of Tables 4
References 17
2
List of Figures
1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3
List of Tables
1 Summary of Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Work Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4
Proposal Synopsis
1.2 Introduction
Nowadays it can be hard to find relevant information online because there is just so
much out there! Search engines regularly help us navigate this information overload, but
sometimes they don’t quite understand exactly what we’re looking for. The goal of this
project is to develop a new research tool that can use unique graphics (graphs) and clever
computer algorithms (machine learning) to find the best answers to your wide range of
questions.
Imagine that you have a difficult question and need an in-depth answer. This process
will be similar to that of your personal research assistant. You ask a long, detailed
question, and the system will highlight points and search the web for relevant websites.
But it won’t just search websites for similar terms - it will be much smarter! It will use
these images and charts to understand exactly what your request means and what is on
the websites. This way you get answers that are truly specific to your needs, not just
websites with random keywords.
This project is not concerned with obtaining any information on the Internet. About
the websites that best match your detailed question, make sure the answer you get is
exactly what you are looking for. The final feature will be easier to use and will show
which websites best answer your question. This will make it much easier to find the right
answers to your most difficult and detailed questions in the websites on the Internet
1.4 Objectives
The major aim of this graph-based document ranking model was to improve the efficiency
as well as accuracy in document retrieval and ranking processes. Our model aims are:
• It allows for personalized search results related to the user’s specific queries, truly
enhancing both the search experience and satisfaction.
• This enables recommendation systems to efficiently identify and suggest relevant
documents according to user preferences, budgets, and requirements.
5
Proposal Synopsis
1.5 Features/Scope
Scope:
• Development of a document ranking model based on semantic graphs.
• Use advanced ranking algorithms to make search results more accurate and relevant.
• Present the list of results to users using a simple and user-friendly format.
• Evaluate and test the system performance.
• Provide comprehensive documentation of methodology and results.
Features:
• Effectively extract keywords from user queries.
• It collects data from a variety of sources, including search engines and web scraping.
• Cleans and structures collected data for analysis.
• It uses advanced algorithms for document serialization based on graph theory, ma-
chine learning, and natural language processing.
• It provides an intuitive interface for query input and result display.
• Designed to handle large document processing (Internet) and improve efficiency.
• Aims improve the relevance and accuracy of the search results by using the latest
techniques.
• It ensures scalability, efficiency, and robustness of the system.
6
Proposal Synopsis
7
Proposal Synopsis
Universal Sentence Encoder to measure similarity between texts. They tested these
methods with a dataset and found that their new model performed better than other
methods. Finally, they developed an application for finding similarities between doc-
uments using their new models [5].
A Keyphrase Graph-Based Method for Document Similarity Measurement
This paper describes a method for measuring the mutual similarity of two datasets.
It attempts to provide a richer understanding of concepts and relationships in texts
by integrating knowledge from large databases such as DBpedia and Wikipedia,
proposing graph-based semantic models that consider the meaning of texts and of
its structure rather than relying solely on mathematical terminology. The method
analyzes the similarity by comparing semantic information represented by keyphrase
graphs in two documents, making it perform similar to specific methods for standard
data sets [6].
A Personalized Graph-Based Document Ranking Model Using a Semantic
User Profile
a personalized document ranking model uses extended graph-based distance mea-
sure, and semantic user profiles derived from Web Ontology (ODP), the individual
document ranking model model uses minimum common supergraph (MCS) and max-
imum common subgraph (mcs) methods together to align the user’s interests with
relevant documents It extends this approach by including understanding by shar-
ing concepts and using links. The comparative results show the effectiveness of the
model against Yahoo search results. [7].
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT
This project focuses on semantic text analogy (STS) in the Japanese clinical do-
main, where two Japanese datasets (treatment) were generated using a bidirectional
encoder representation from a transformer (BERT) that handles features for STS
tasks in different languages internal media and electronic medical records) other than
English. The results show that although both general and clinical Japanese BERT
models performed well demonstrated superior performance, possibly due to greater
prior training in texts compared with the clinical Japanese BERT. [8].
Efficient Graph-based Document Similarity
This paper presents an efficient approach to document similarity based on graph
representation, which incorporates the relational knowledge of knowledge graphs In
contrast to traditional vocabulary-based approaches, graph models consume lan-
guages and vocabularies handle differences effectively. Experimental results show
that this method outperforms comparative measures to relate document similarity
to human judgments even for small documents, being statistically efficient compared
to other graph-based methods [9].
Research on document similarity calculation and detection based on deep
learning
This project introduces a new approach to document similarity estimation and anal-
ysis using deep learning. It uses efficient subtree matching to estimate document
feature sequences and extract keyword frequencies. Using deep learning to classify
documents, the method aims to facilitate similarity detection. The empirical findings
show low computational errors and reliable results, demonstrating the effectiveness
and feasibility of the proposed method. [10].
8
Proposal Synopsis
9
Proposal Synopsis
10
Proposal Synopsis
11
Proposal Synopsis
12
Proposal Synopsis
1. Query Processing:
Receive long and detailed queries from the user. Extract keywords from the query
using natural language processing (NLP) techniques such as tokenization, stop-word
removal, and stemming.
2. Data Collection:
Query multiple search engines (SE1, SE2, SE3) using the extracted keywords. Col-
lect URLs and relevant metadata (title, snippet, etc.) of search results from each
search engine.
3. Web Scraping:
Visit each URL obtained from the search engines. Scrape the content of the web
pages using web scraping techniques such as BeautifulSoup or Scrapy. Extract text,
images, and other relevant information from the web pages.
4. Data Preprocessing:
Clean the scraped data by removing HTML tags, boilerplate content, and other
noise. Normalize the text data by converting it to lowercase and removing punctua-
tion. Tokenize the text into words or phrases. Perform lemmatization or stemming
to reduce words to their base form.
5. Graph Construction:
Represent the preprocessed data as a graph structure. Nodes in the graph repre-
sent web pages or documents. Edges between nodes represent relationships such as
hyperlinks, co-occurrence of keywords, or semantic similarity.
6. Ranking Algorithm:
Develop a ranking algorithm based on graph theory, machine learning, or deep learn-
ing techniques. Incorporate features such as node centrality, edge weights, document
relevance, and user query relevance. Train the ranking model using labeled data if
available, or unsupervised learning techniques otherwise.
7. Ranking and Result Generation:
Apply the ranking algorithm to the constructed graph to rank the web pages/documents.
Generate a ranked list of web pages/documents based on their relevance to the user
query. Select the top ranked results for display to the user.
13
Proposal Synopsis
8. User Interface:
Design and develop a user friendly interface for query input and result display. Pro-
vide features such as filtering, sorting, and pagination for result navigation. Ensure
responsiveness and accessibility across different devices and screen sizes.
9. Testing and Evaluation:
Conduct thorough testing of the search engine system, including unit testing, in-
tegration testing, and system testing. Evaluate the performance of the system in
terms of search accuracy, relevance, speed, and user satisfaction. Gather feedback
from users through surveys, interviews, or usability studies.
10. Optimization:
Identify areas for optimization based on testing and evaluation results. Optimize
the system for performance, scalability, and resource efficiency.
14
Proposal Synopsis
15
Proposal Synopsis
16
Proposal Synopsis
References
[1] https://arxiv.org/pdf/2204.07182.pdf. [Accessed 28-03-2024].
[2] arxiv.org. https://arxiv.org/pdf/2010.06395.pdf. [Accessed 28-03-2024].
[3] arxiv.org. https://arxiv.org/pdf/1902.03402.pdf. [Accessed 28-03-2024].
[4] arxiv.org. https://arxiv.org/pdf/2107.04771.pdf. [Accessed 28-03-2024].
[5] Detecting Semantic Similarity Of Documents Using Natural Language Process-
ing — pdf.sciencedirectassets.com. https://pdf.sciencedirectassets.com/
280203/1-s2.0-S1877050921X00129/1-s2.0-S1877050921011716/main.pdf?
X-Amz-Security-Token=IQoJb3JpZ2luX2VjEHwaCXVzLWVhc3QtMSJIMEYCIQC8de%
2BH5NGS5D3MN0fiVFB7iqA1%2FBZO8mLXoHJ41mk7CQIhALz6VQFnOqq%
2BoC5yRHJoL9NC1Aj2XyULXoTYtIxoTGGvKrsFCJX%2F%2F%2F%2F%2F%2F%2F%2F%
2F%2FwEQBRoMMDU5MDAzNTQ2ODY1IgyvRgBeD5BnE%2FkEB3YqjwUPYrXwd8rfTT%
2BF7N1pHaEoKqUmHTv3LoSERaa5n6JjOi0T96KKbo7hhwlW5M1H8GPLDbEwzrf%
2BVUn2AXeIuj9wZoqXb4yXkRSZPfDK9fmqaPk1b8ncx8OdLkEftWUBbhXcsQyN6dmsBYJsst2xi8OmDlW3
2BD734zZSDYRHGzp5e3XWSvKDZQkz9YYDbhqTcfq8xAhSewzVPYwu6sOoEUFWV87W0pbuCrTQCf0I2GSRM
2BG%2BY%2BgfPvjtTUgEhOJt5VCINWvCsFP1xbhODVSMjkfR5%
2BilLPZSVn9uOz4IPpFPadly65B7uu02huXJTh8pKlhmRSLJ3EVG48kM2fjGKBoUZ%
2BXBi7QNcDqH4OWfGE6Y6tp36IiomPTk8d5gQEU5iEMLGHrtBSfdPib%
2F8zQCLznY13BVeXxghMjcYCwDACv7A740afQqzvoTd%2BStmnEUfDZ2YrdkRS5KSkm7ZGoc4xMMr9LIVl
2F1og9OaOayd8d%2FnNkLRbq08aK3j3uv4Jnrrt6znYdeiL9Cf8ISm8GOnyUK9gkjW5Il7WunuZ%
2FuMLb%2B0cUyeZK6XFnrY9UWuchQj279L47S6999G0u5%2B%
2F7t3FQObbn5dxbAsb7GKy0bdnpABu564YjtisPMrAxfubntTsOgfE%
2BnToVicFtJ2Pz9lbjgLdJDxP6WNNavjV3Y0x3sc6yc2gvz5uMLwY9Ae%
2Fx2XbUqD01wupvweufdR9PhOng74E6%2BupMgQ3GdMOmlh7AGOrABD3rMtNpGMgZHj7YC57rzF4Hs5I81
2BlGUmUu9PEC%2BZ7E%2BQFXzOJzTvU%2BGJD1TQ9VljFu%
2BUaWrEUa64Y1KygIG6ATYjYUj0pvcbxrFl1P1tPzWXG6Y20Uq9BXA2COdASYGjpTtNKMrcXyxjbwShxNW
2BjLUAwVnuSvVh8Wwc17x17QSVaHijQHKfLYwYu2heLhvhvjx9rM%
3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=
20240325T210323Z&X-Amz-SignedHeaders=host&X-Amz-Expires=
300&X-Amz-Credential=ASIAQ3PHCVTY7VBGAT2U%2F20240325%
2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=
adefb57b859f40fa6b6884ca78cbd9d940ec9d7ed6004fe86e8bbdc441784e59&
hash=889b85535bba579a247f162ba27a1446ad2d1abe2864f418d5e1ffa0061f7511&
host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&
pii=S1877050921011716&tid=spdf-56c07cb6-3fcc-4915-8db8-01c30d15c0d8&
sid=c700e0da7af819488a7b15020b20ed7dc55fgxrqb&type=client&
tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=0a0d5d50065303560159&rr=
86a1d8e98e97de43&cc=pk. [Accessed 28-03-2024].
[6] EBSCOhost Login — eds.p.ebscohost.com. https://eds.p.ebscohost.com/eds/
pdfviewer/pdfviewer?vid=1&sid=f436834e-0858-4565-8f4a-8604f74bd736%
40redis. [Accessed 28-03-2024].
[7] Mariam Daoud, Lynda Tamine, and Mohand Boughanem. A personalized graph-
based document ranking model using a semantic user profile. In User Modeling,
Adaptation, and Personalization: 18th International Conference, UMAP 2010, Big
Island, HI, USA, June 20-24, 2010. Proceedings 18, pages 171–182. Springer, 2010.
17
Proposal Synopsis
[8] Faith Wavinya Mutinda, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. Se-
mantic textual similarity in japanese clinical domain texts using bert. Methods of
Information in Medicine, 60(S 01):e56–e64, 2021.
[9] Christian Paul, Achim Rettinger, Aditya Mogadala, Craig A Knoblock, and Pedro
Szekely. Efficient graph-based document similarity. In The Semantic Web. Latest
Advances and New Domains: 13th International Conference, ESWC 2016, Herak-
lion, Crete, Greece, May 29–June 2, 2016, Proceedings 13, pages 334–349. Springer,
2016.
[10] Cui Xing, Yan Yang, and Jian Luo. Research on document similarity calculation and
detection based on deep learning. In Journal of Physics: Conference Series, volume
1757, page 012007. IOP Publishing, 2021.
18