Professional Documents
Culture Documents
Abstract
This paper examines the power of deep learning
techniques to distinguish samples of text coming from
Childrens Literature from samples of texts from
Advanced Literature. All of the samples come from
Project Gutenberg, and we define childrens literature to
be any novel found in Project Gutenbergs Childrens
Literature section (and Advanced Literature as all
other literature). We take two different approaches to
classifying the text the first relies on representing the
text as a Bag of Words, the second relies on using Mikolov
and Les Paragraph2Vec algorithm.
This work builds off of Mikolov and Le and Oliveira
and Rodrigo, who have illustrated the power of the
Paragraph2Vec algorithm in textual classification. Given
the considerable success of Paragraph2Vec, we examine
to what extent this algorithm is feasible to improve current
solutions for analyzing textual difficulty.
1. Introduction
Childrens literature continues to grow as an incredibly
popular and lucrative genre. Since 2013, over 500,000
books of fiction have been published in the United States
alone, and a focus of many publishers is to efficiently
categorize these books by reading level difficulty.1 The
commercial applications are immediate: with books
marked out by textual difficulty, publishers can more
easily advertise their books to the correct audiences, and
further, teachers can become far more equipped to select
texts appropriate for their students reading level. There is
currently great demand for the tools to categorize literature
by difficulty: the leading software in textual difficulty
classification, Lexile Analyzer, boasts over 100,000 users
since its release in 1998, with its users having analyzed
over 1,000,000 texts in the last two years alone. Lexile
Analyzer assigns scores to text, indicating the grade level
for which a certain text is appropriate. For reference, a
kindergarten-level text ought to receive a score around
200, and a text meant for 12th graders ought to receive a
1
4.2 Evaluation
The Bag of Words results initially seemed very
promising. While the Decision Tree based classifier fared
rather poorly, the Nave Bayes and SVM classifier showed
Nave Bayes
SVM
Decision Trees
60.2%
58.8%
53.5%
Nave Bayes
SVM
Decision Tree
Childrens Literature
18.3%
27.7%
6.8%
Advanced Literature
93.2%
83.3%
90.2
!!!
1
! =
Childrens
Literature
Advanced
Literature
N=5
74.0%
94.5%
N = 10
56.6%
93.7%
N = 100
41.8%
95.9%
N = 250
28.1%
95.3%
N = 500
20.5%
95.5%
!!!
1
! =
!!
!!! !!
+ !!
! = 1 !
In this case, class 1 corresponds to Childrens Literature,
and class 2 corresponds to Advanced Literature.
To perform this classification, we create paragraph
vectors from all the training and testing samples in
Childrens Literature and Advanced Literature. We
then train the neural
After computing this probability for each sample in the
testing set, the results revealed that the classifier was
almost completely unable to distinguish between the
Childrens Literature samples and the Advanced
Literature samples. Nearly every sample of the 3,600
testing samples was classified at approximately 49%
Childrens Literature, and 51% Advanced Literature,
indicating that the neural network after training finds very
little semantic distinction between the Childrens
Literature and Advanced Literature material.
Childrens
Literature
Advanced
Literature
.493
.507
.488
.512
.484
.516
.503
.497
.497
.503
.491
.509
.494
.506