You are on page 1of 8

Kaggle Problem 5

Quora Insincere Questions Detect toxic content to improve online conversations


Classification

Description
link
keyboard_arrow_up

An existential problem for any major website today is how to handle toxic and divisive content. Quora
wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their
knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions
and connect with others who contribute unique insights and quality answers. A key challenge is to weed
out insincere questions -- those founded upon false premises, or that intend to make a statement rather than
look for helpful answers.
In this competition, Kagglers will develop models that identify and flag insincere questions. To date,
Quora has employed both machine learning and manual review to address this problem. With your help,
they can develop more scalable methods to detect toxic and misleading content.

Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be
Respectful” and continue to be a place for sharing and growing the world’s knowledge.

Important Note
Be aware that this is being run as a Kernels Only Competition, requiring that all submissions be made via a
Kernel output. Please read the Kernels FAQ and the data page very carefully to fully understand how this
is designed.

Evaluation
link
keyboard_arrow_up
Submissions are evaluated on F1 Score between the predicted and the observed targets.
Submission File
For each qid in the test set, you must predict whether the corresponding question_text is insincere (1) or
not (0). Predictions should only be the integers 0 or 1. The file should contain a header and have the
following format:
qid,prediction
0000163e3ea7c7a74cd7,0
00002bd4fb5d505b9161,0
00007756b4a147d2b0b3,0
...

Kernel Submissions
For this competition, you will make submissions directly from Kaggle Kernels. By adding your teammates
as collaborators on a kernel, you can share and edit code privately with them. For more details, please visit
the Kernels-FAQ for this competition.

Timeline
link
keyboard_arrow_up
 January 29, 2019 - Entry deadline. You must accept the competition rules before this date in
order to compete.
 January 29, 2019 - Team Merger deadline. This is the last day participants may join or merge
teams.
 February 5, 2019 - Final submission deadline. After this date, we will not be taking any more
submissions. Remember to select your two best submissions to be rescored during the re-run
period. In this competition we will not auto-select your two submissions.
 February 6 - 13, 2019 - Selected Kernel Re-runs on Private Test Set. Review the Discussion
Forum, particularly this post for details.
All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition
organizers reserve the right to update the contest timeline if they deem it necessary.

Prizes
link
keyboard_arrow_up
 1st Place - $12,000

 2nd Place - $8,000

 3rd Place - $5,000

Kernels FAQ
link
keyboard_arrow_up
How do I submit using Kernels?
1. Write the predictions generated by your code to a .csv file. Ensure
your submission file contains the same format and number of rows
expected in the sample_submission file on the competition's Data
page. For this competition, your submission must be named submission.csv.

2. Commit your Kernel.

3. Then navigate to the Output tab of the Kernel and "Submit to Competition".

To submit from a Script:


An example of a script where this was done for a different competition is below.
To submit from a Notebook:
An example of how this was done in a notebook for a different competition is below.
How do I upload external data?

This competition does not allow external data. If you attempt to make a submission whose kernel makes
use of external data sources, the "Submit to Competition" button will not be active after committing. There
are a set of whitelisted pre-trained models available on the data page, which you are permitted to use.

What are the compute limits of Kernels?

Both your training and prediction should fit in a single Kernel. That means ensembles will need to be done
in a single Kernel, and not from uploaded external data.

GPUs are enabled for this competition. If you use GPUs, you will be limited to 2 hours of run time. If you
do not use GPUs, you will be limited to 6 hours of run time. If you attempt to make a submission whose
kernel exceeds these limits, you will receive an error.

What Kernel features are not available for this competition?

In order for the "Submit to Competition" button to be active after the Kernel commit, the following
conditions must be met:
 CPU Kernel <= 6 hours run-time

 GPU Kernel <= 2 hours run-time

 No internet access enabled

 No multiple data sources enabled

 No custom packages

 Submission file must be named "submission.csv"

How do I team up in a Kernels-only competition?

All the competitions setup is the same as normal competitions, except that submissions are only made
through Kernels. So to team up, go to the "Team" tab and invite others.

How will winners be determined?

During the competition, you will create your models in kernels, and make submissions based on the test set
provided on the Data page. You will make submissions from your kernels using the above steps. This will
give you feedback on the public leaderboard about your model's performance.
Following the final submission deadline for the competition, your kernel code will be re-run on a privately-
held test set that is not provided to you. It is your model's score against this private test set that will
determine your ranking on the private leaderboard and final standing in the competition.

Citation
keyboard_arrow_up
Alex Ellis, inversion, Julia Elliott, Paula Griffin, William Chen. (2018). Quora Insincere Questions
Classification. Kaggle. https://kaggle.com/competitions/quora-insincere-questions-classification

Dataset Description
General Description
In this competition you will be predicting whether a question asked on Quora is sincere or not.

An insincere question is defined as a question intended to make a statement rather than look for helpful
answers. Some characteristics that can signify that a question is insincere:

 Has a non-neutral tone

o Has an exaggerated tone to underscore a point about a group of people

o Is rhetorical and meant to imply a statement about a group of people


 Is disparaging or inflammatory

o Suggests a discriminatory idea against a protected class of people, or seeks confirmation


of a stereotype

o Makes disparaging attacks/insults against a specific person or group of people

o Based on an outlandish premise about a group of people

o Disparages against a characteristic that is not fixable and not measurable

 Isn't grounded in reality

o Based on false information, or contains absurd assumptions

 Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine
answers

The training data includes the question that was asked, and whether it was identified as insincere (target =
1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
Note that the distribution of questions in the dataset should not be taken to be representative of the
distribution of questions asked on Quora. This is, in part, because of the combination of sampling
procedures and sanitization measures that have been applied to the final dataset.

File descriptions
 train.csv - the training set
 test.csv - the test set
 sample_submission.csv - A sample submission in the correct format
 enbeddings/ - (see below)

Data fields
 qid - unique question identifier
 question_text - Quora question text
 target - a question labeled "insincere" has a value of 1, otherwise 0
This is a Kernels-only competition. The files in this Data section are downloadable for reference in Stage
1. Stage 2 files will only be available in Kernels and not available for download.

What will be available in the 2nd stage of the competition?


In the second stage of the competition, we will re-run your selected Kernels. The following files will be
swapped with new data:

 test.csv - This will be swapped with the complete public and private test dataset. This file will
have ~56k rows in stage 1 and ~376k rows in stage 2. The public leaderboard data remains the
same for both versions. The file name will be the same (both test.csv) to ensure that your code will
run.
 sample_submission.csv - similar to test.csv, this will be changed from ~56k in stage 1 to ~376k
rows in stage 2 . The file name will remain the same.

Embeddings
External data sources are not allowed for this competition. We are, though, providing a number of word
embeddings along with the dataset that can be used in the models. These are as follows:

 GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
 glove.840B.300d - https://nlp.stanford.edu/projects/glove/
 paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
 wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

You might also like