Professional Documents
Culture Documents
Project Paper
Project Paper
EXPLORATORY ANALYSIS
In the Goodreads dataset, each datum
includes the following features: reviewID,
userID, bookID, timestamp, rating (0-5),
review contains spoiler (yes/no), and review
text. The review text is given as a list of
sentences where each sentence is flagged
as containing a spoiler or not. The timespan
covered by the Goodreads dataset begins
on December 7, 2016 and ends November Across reviews containing spoilers, I
5, 2017. There are 1,378,033 total reviews calculated the “spoil percentage” - the
in this dataset, representing 18,892 users proportion of sentences flagged as spoilers
to sentences which are not spoilers - per
review. On average, 28.24% of spoiled DATASET CONSTRUCTION
reviews are spoiler-flagged (Fig 3). From the reviews dataset I extract only
reviews which contain spoilers. I then
generate a new dataset which separates
reviews into individual labeled sentences. I
create a new dataset of label-text pairs,
where the text is a single sentence from a
review and the label identifies whether it is a
spoiler, plus the userID and bookID. To
balance the new dataset, I populate it with
an equal number of spoiler and non-spoiler
text samples.
References