Quora Question Pairs Similarity: Bachelor of Engineering IN Computer Science & Engineering

Quora Question Pairs Similarity
Submitted in partial fulfillment of the requirements for the award of degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
Submitted to: Submitted By:

Er. Mradula Harish Malik – 18BCS1188
Mentor Signature
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Chandigarh University, Gharuan

March 2022
INTRODUCTION
Quora is a platform for Q&A, just like StackOverflow. But quora is more of a general-purpose
Q&A platform that means there is not much code like in StackOverflow.
One of the many problems that quora face is the duplication of questions. Duplication of question
ruins the experience for both the questioner and the answerer. Since the questioner is asking a
duplicate question, we can just show him/her the answers to the previous question. And the
answerer doesn’t have to repeat his/her answer for essentially the same questions.
For example, we have a question like “How can I be a good geologist?” and there are some
answers to that question. Later someone else asks another question like “What should I do to be a
great geologist?”.
We can see that both the questions are asking the same thing. Even though the wordings for the
question are different, the intention of both questions is the same.
So the answers will be the same for both questions. That means we can just show the answers to
the first question. That way the person who is asking the question will get the answers
immediately and people who have answered already the first question don’t have to repeat
themselves.
This problem is available on Kaggle as a competition. https://www.kaggle.com/c/quora-question-
pairs
FEASIBILTY STUDY
1) Financial feasibility: -
The project includes the use of programming language python and technology like
machine learning, deep learning and NLP . There is no need to buy any costly software to
complete this project. All the softwares required to accomplish this project are available
for free. This project is achievable with a basic computer with minimum specifications.
2) Technological feasibility: -
The techniques that use Machine Learning clustering and classification algorithms have
shown to achieve better performance on visual benchmarks. It is a project in which
firstly the data is divided into various clusters and then different algorithms are used for
different clusters which give best performance.
Resource and Time feasibility: -
Resources those are required for this project includes.
 Programming device(laptop)
 Programming tools (easily available)
 Programming individual
3) Users of the product: -
This is an end to end project which is up to the standard of industry, can be used to
identify the duplicate questions by platforms like quora. It provides various facilities like
logging details , easy to implement etc.
METHODOLOGY/PLANNING OF WORK
Following diagram shows the work flow of the project:

MODULES AND THE DISTRIBUTION OF WORK
The project has been divided into three modules as given below:
1. Data Preprocessing and EDA

The first and the foremost task is to pre-process the data. In data preprocessing we handle
null values, vectorize the questions using tf-idf, Word2Vec etc. In this we will also clean
the data and remove the features that are not necessary. In this we also perform
exploratory data analysis. Here we preprocess the text which includes removing emoji,
url, weblink etc.
2. Model Training
In model training we first divide our data into different clusters. And we will train
different clusters differently. We will select the models which are appropriate for the
cluster separately.
3. Final model prediction and deployment
In this step after training the model we save it in pickle file which is a binary file. We use
binary file format to save the model in order to save the memory and the model can be
used later for detecting duplicate question pairs. The finalised model is hosted using
streamlit. Also maintainence and logging is done from the cloud itself.
INNOVATION IN THE PROJECT
We can try different machine learning and deep learning algorithms. Using advanced algorithms
like lstm, gru, bert , neural networks etc gives better accuracy. Machine learning models are hit
and trial models , we don’t have prior knowledge if the model will perform better or not. In this
various advanced models can be used to give better results. Also it includes hand engineering of
the features which will enhance the accuracy of the model.
SOFTWARE AND HARDWARE REQUIREMENTS:
SOFTWARE REQUIREMENTS:
 WINDOWS 8
 Python version 3.7.4
 NLTK
 Keras
 tensorflow
HARDWARE RQUIREMENTS:
 RAM: 8GB
 Hard disk: 128 GB SSD
 Processor: i5 8th generation
BIBLIOGRAPHY
[1]
https://appliedroots.com/
[2]
https://towardsdatascience.com/the-quora-question-pair-similarity-problem-3598477af172

Quora Question Pairs Similarity: Bachelor of Engineering IN Computer Science & Engineering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quora Question Pairs Similarity: Bachelor of Engineering IN Computer Science & Engineering

Uploaded by

Copyright:

Available Formats

Quora Question Pairs Similarity

Submitted in partial fulfillment of the requirements for the award of degree of

Submitted to: Submitted By:

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Chandigarh University, Gharuan

question are different, the intention of both questions is the same.

This problem is available on Kaggle as a competition. https://www.kaggle.com/c/quora-question-

Resource and Time feasibility: -

Resources those are required for this project includes.

3) Users of the product: -

Following diagram shows the work flow of the project:

1. Data Preprocessing and EDA

INNOVATION IN THE PROJECT

SOFTWARE AND HARDWARE REQUIREMENTS:

You might also like