You are on page 1of 6

Quora Question Pairs Similarity

Submitted in partial fulfillment of the requirements for the award of degree of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING

Submitted to: Submitted By:


Er. Mradula Harish Malik – 18BCS1188

Mentor Signature

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Chandigarh University, Gharuan


March 2022
INTRODUCTION

Quora is a platform for Q&A, just like StackOverflow. But quora is more of a general-purpose

Q&A platform that means there is not much code like in StackOverflow.

One of the many problems that quora face is the duplication of questions. Duplication of question

ruins the experience for both the questioner and the answerer. Since the questioner is asking a

duplicate question, we can just show him/her the answers to the previous question. And the

answerer doesn’t have to repeat his/her answer for essentially the same questions.

For example, we have a question like “How can I be a good geologist?” and there are some

answers to that question. Later someone else asks another question like “What should I do to be a

great geologist?”.

We can see that both the questions are asking the same thing. Even though the wordings for the

question are different, the intention of both questions is the same.

So the answers will be the same for both questions. That means we can just show the answers to

the first question. That way the person who is asking the question will get the answers

immediately and people who have answered already the first question don’t have to repeat
themselves.

This problem is available on Kaggle as a competition. https://www.kaggle.com/c/quora-question-

pairs

FEASIBILTY STUDY

1) Financial feasibility: -
The project includes the use of programming language python and technology like
machine learning, deep learning and NLP . There is no need to buy any costly software to
complete this project. All the softwares required to accomplish this project are available
for free. This project is achievable with a basic computer with minimum specifications.

2) Technological feasibility: -

The techniques that use Machine Learning clustering and classification algorithms have
shown to achieve better performance on visual benchmarks. It is a project in which
firstly the data is divided into various clusters and then different algorithms are used for
different clusters which give best performance.

Resource and Time feasibility: -

Resources those are required for this project includes.

 Programming device(laptop)
 Programming tools (easily available)
 Programming individual

3) Users of the product: -

This is an end to end project which is up to the standard of industry, can be used to
identify the duplicate questions by platforms like quora. It provides various facilities like
logging details , easy to implement etc.

METHODOLOGY/PLANNING OF WORK

Following diagram shows the work flow of the project:


MODULES AND THE DISTRIBUTION OF WORK

The project has been divided into three modules as given below:

1. Data Preprocessing and EDA


The first and the foremost task is to pre-process the data. In data preprocessing we handle
null values, vectorize the questions using tf-idf, Word2Vec etc. In this we will also clean
the data and remove the features that are not necessary. In this we also perform
exploratory data analysis. Here we preprocess the text which includes removing emoji,
url, weblink etc.

2. Model Training

In model training we first divide our data into different clusters. And we will train
different clusters differently. We will select the models which are appropriate for the
cluster separately.
3. Final model prediction and deployment
In this step after training the model we save it in pickle file which is a binary file. We use
binary file format to save the model in order to save the memory and the model can be
used later for detecting duplicate question pairs. The finalised model is hosted using
streamlit. Also maintainence and logging is done from the cloud itself.

INNOVATION IN THE PROJECT

We can try different machine learning and deep learning algorithms. Using advanced algorithms
like lstm, gru, bert , neural networks etc gives better accuracy. Machine learning models are hit
and trial models , we don’t have prior knowledge if the model will perform better or not. In this
various advanced models can be used to give better results. Also it includes hand engineering of
the features which will enhance the accuracy of the model.

SOFTWARE AND HARDWARE REQUIREMENTS:

SOFTWARE REQUIREMENTS:
 WINDOWS 8
 Python version 3.7.4
 NLTK
 Keras
 tensorflow

HARDWARE RQUIREMENTS:
 RAM: 8GB
 Hard disk: 128 GB SSD
 Processor: i5 8th generation
BIBLIOGRAPHY

[1]  

https://appliedroots.com/

[2]

https://towardsdatascience.com/the-quora-question-pair-similarity-problem-3598477af172

You might also like