Professional Documents
Culture Documents
Kaggle Problem 5
Kaggle Problem 5
Description
link
keyboard_arrow_up
An existential problem for any major website today is how to handle toxic and divisive content. Quora
wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their
knowledge with the world.
Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions
and connect with others who contribute unique insights and quality answers. A key challenge is to weed
out insincere questions -- those founded upon false premises, or that intend to make a statement rather than
look for helpful answers.
In this competition, Kagglers will develop models that identify and flag insincere questions. To date,
Quora has employed both machine learning and manual review to address this problem. With your help,
they can develop more scalable methods to detect toxic and misleading content.
Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be
Respectful” and continue to be a place for sharing and growing the world’s knowledge.
Important Note
Be aware that this is being run as a Kernels Only Competition, requiring that all submissions be made via a
Kernel output. Please read the Kernels FAQ and the data page very carefully to fully understand how this
is designed.
Evaluation
link
keyboard_arrow_up
Submissions are evaluated on F1 Score between the predicted and the observed targets.
Submission File
For each qid in the test set, you must predict whether the corresponding question_text is insincere (1) or
not (0). Predictions should only be the integers 0 or 1. The file should contain a header and have the
following format:
qid,prediction
0000163e3ea7c7a74cd7,0
00002bd4fb5d505b9161,0
00007756b4a147d2b0b3,0
...
Kernel Submissions
For this competition, you will make submissions directly from Kaggle Kernels. By adding your teammates
as collaborators on a kernel, you can share and edit code privately with them. For more details, please visit
the Kernels-FAQ for this competition.
Timeline
link
keyboard_arrow_up
January 29, 2019 - Entry deadline. You must accept the competition rules before this date in
order to compete.
January 29, 2019 - Team Merger deadline. This is the last day participants may join or merge
teams.
February 5, 2019 - Final submission deadline. After this date, we will not be taking any more
submissions. Remember to select your two best submissions to be rescored during the re-run
period. In this competition we will not auto-select your two submissions.
February 6 - 13, 2019 - Selected Kernel Re-runs on Private Test Set. Review the Discussion
Forum, particularly this post for details.
All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition
organizers reserve the right to update the contest timeline if they deem it necessary.
Prizes
link
keyboard_arrow_up
1st Place - $12,000
Kernels FAQ
link
keyboard_arrow_up
How do I submit using Kernels?
1. Write the predictions generated by your code to a .csv file. Ensure
your submission file contains the same format and number of rows
expected in the sample_submission file on the competition's Data
page. For this competition, your submission must be named submission.csv.
3. Then navigate to the Output tab of the Kernel and "Submit to Competition".
This competition does not allow external data. If you attempt to make a submission whose kernel makes
use of external data sources, the "Submit to Competition" button will not be active after committing. There
are a set of whitelisted pre-trained models available on the data page, which you are permitted to use.
Both your training and prediction should fit in a single Kernel. That means ensembles will need to be done
in a single Kernel, and not from uploaded external data.
GPUs are enabled for this competition. If you use GPUs, you will be limited to 2 hours of run time. If you
do not use GPUs, you will be limited to 6 hours of run time. If you attempt to make a submission whose
kernel exceeds these limits, you will receive an error.
In order for the "Submit to Competition" button to be active after the Kernel commit, the following
conditions must be met:
CPU Kernel <= 6 hours run-time
No custom packages
All the competitions setup is the same as normal competitions, except that submissions are only made
through Kernels. So to team up, go to the "Team" tab and invite others.
During the competition, you will create your models in kernels, and make submissions based on the test set
provided on the Data page. You will make submissions from your kernels using the above steps. This will
give you feedback on the public leaderboard about your model's performance.
Following the final submission deadline for the competition, your kernel code will be re-run on a privately-
held test set that is not provided to you. It is your model's score against this private test set that will
determine your ranking on the private leaderboard and final standing in the competition.
Citation
keyboard_arrow_up
Alex Ellis, inversion, Julia Elliott, Paula Griffin, William Chen. (2018). Quora Insincere Questions
Classification. Kaggle. https://kaggle.com/competitions/quora-insincere-questions-classification
Dataset Description
General Description
In this competition you will be predicting whether a question asked on Quora is sincere or not.
An insincere question is defined as a question intended to make a statement rather than look for helpful
answers. Some characteristics that can signify that a question is insincere:
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine
answers
The training data includes the question that was asked, and whether it was identified as insincere (target =
1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
Note that the distribution of questions in the dataset should not be taken to be representative of the
distribution of questions asked on Quora. This is, in part, because of the combination of sampling
procedures and sanitization measures that have been applied to the final dataset.
File descriptions
train.csv - the training set
test.csv - the test set
sample_submission.csv - A sample submission in the correct format
enbeddings/ - (see below)
Data fields
qid - unique question identifier
question_text - Quora question text
target - a question labeled "insincere" has a value of 1, otherwise 0
This is a Kernels-only competition. The files in this Data section are downloadable for reference in Stage
1. Stage 2 files will only be available in Kernels and not available for download.
test.csv - This will be swapped with the complete public and private test dataset. This file will
have ~56k rows in stage 1 and ~376k rows in stage 2. The public leaderboard data remains the
same for both versions. The file name will be the same (both test.csv) to ensure that your code will
run.
sample_submission.csv - similar to test.csv, this will be changed from ~56k in stage 1 to ~376k
rows in stage 2 . The file name will remain the same.
Embeddings
External data sources are not allowed for this competition. We are, though, providing a number of word
embeddings along with the dataset that can be used in the models. These are as follows:
GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html