You are on page 1of 6

Applying Deep Learning Techniques to

Distinguish Childrens Literature


from Advanced Literature
Miles Hinson
Princeton University
mghinson@princeton.edu

Abstract
This paper examines the power of deep learning
techniques to distinguish samples of text coming from
Childrens Literature from samples of texts from
Advanced Literature. All of the samples come from
Project Gutenberg, and we define childrens literature to
be any novel found in Project Gutenbergs Childrens
Literature section (and Advanced Literature as all
other literature). We take two different approaches to
classifying the text the first relies on representing the
text as a Bag of Words, the second relies on using Mikolov
and Les Paragraph2Vec algorithm.
This work builds off of Mikolov and Le and Oliveira
and Rodrigo, who have illustrated the power of the
Paragraph2Vec algorithm in textual classification. Given
the considerable success of Paragraph2Vec, we examine
to what extent this algorithm is feasible to improve current
solutions for analyzing textual difficulty.
1. Introduction
Childrens literature continues to grow as an incredibly
popular and lucrative genre. Since 2013, over 500,000
books of fiction have been published in the United States
alone, and a focus of many publishers is to efficiently
categorize these books by reading level difficulty.1 The
commercial applications are immediate: with books
marked out by textual difficulty, publishers can more
easily advertise their books to the correct audiences, and
further, teachers can become far more equipped to select
texts appropriate for their students reading level. There is
currently great demand for the tools to categorize literature
by difficulty: the leading software in textual difficulty
classification, Lexile Analyzer, boasts over 100,000 users
since its release in 1998, with its users having analyzed
over 1,000,000 texts in the last two years alone. Lexile
Analyzer assigns scores to text, indicating the grade level
for which a certain text is appropriate. For reference, a
kindergarten-level text ought to receive a score around
200, and a text meant for 12th graders ought to receive a
1

Source: Statista the Statistics Portal.


http://www.statista.com/statistics/194700/us-book-production-bysubject-since-2002-juveniles/

score around 1200.


Despite its popularity, Lexile Analyzer suffers critical
weaknesses that cause it to misjudge the difficulty of
various texts. For example, when Lexile Analyzer
examined two famous books of American literature, The
Grapes of Wrath and Native Son, they received scores of
680 and 700, respectively. According to Lexile Analyzer,
this indicates they could be suitable in a fourth or fifth
grade classroom, when they are clearly meant for a more
advanced audience. Moreover, the softwares relative
evaluation of many texts is flawed: for example, the
childrens book Harry Potter and the Sorcerers Stone
earned a score of 880, while Ernest Hemingways A
Farewell To Arms scoring at only 730.
The primary issue Lexile Analyzer faces stems from its
rudimentary methodology for evaluating textual difficulty.
While the code for the software is proprietary, the creators
of Lexile Analyzer have indicated that a Lexile score is
dependent on word frequency and sentence length. This
means that using the same words repeatedly makes a text
more likely to receive a lower score, and texts with longer
sentences are more likely to be given a higher score,
regardless of the content of the sentences. With this in
mind, the goal of this paper is to investigate what methods
can be used to outperform the at times inconsistent
performance of Lexile Analyzer.
Previous research indicates how the usage of machine
learning and natural language processing techniques can
improve the classification of text by difficulty. However,
these are primarily still focused on regressions and other
classifiers based on pre-defined features of Advanced
Literature depth of parse tree, frequency of particular ngrams, etc. This work intends to classify literature without
having set a specified criterion or list of features to look
for the goal is to classify without using any human
intuition guiding us to say what features Childrens
Literature or Advanced Literature may have.
2. Related Work
The use of machine learning techniques to assess
textual difficulty is a young but growing field, as scholars
search for what features should be considered when
classifying textual difficulty. Petersen and Ostendorf

investigated the effectiveness of Support Vector


Machines, using features of text such as perplexity,
sentence length, average sentence length, and average
number of syllables per word. [1] Kotani et al. analyze the
breakdown of the parse tree of sentences, treating the
syntactic structure of the text as the features in order to get
a more accurate measure of textual difficulty. [2] The
analyses have gone beyond English as well Francois and
Miltsakaki analyzed both the power of measuring
classic features (sentence length, word frequency) and
non-classic features (n-grams, latent semantic analysis)
and feeding them into a Support Vector Machine
classifier. [3]
However, scholars have yet to investigate at length the
power of Deep Learning for analyzing the difficulty of a
text, a surprising fact when using Deep Learning is
quickly becoming one of the most highly touted tools in
the realm of Natural Language Processing and textual
classification. In the paper introducing their algorithm,
Mikolov and Le illustrated the power of combining the
Paragraph2Vec model with a neural network to predict the
sentiments in movie reviews (positive/negative). [4]
Building off of this work, Oliveira and Rodrigo [5]
employed the Paragraph2Vec algorithm, combined with
the use of recurrent neural networks, to predict the level of
humor in a review from Yelp (here, humorous is defined
to be having greater than 3 upvotes on Yelps webpage).
The use of recurrent neural networks is extended in Bartle
and Zheng [6], who use Windowed RNNs to classify
various samples (blog posts, samples from books) by the
gender of the author.

representations and analyze their efficacy. The first is the


Bag-Of-Words model, combining it with various classical
machine learning algorithms (Nave Bayes, Support
Vector Machines and Decision Trees). The second
employs the Paragraph Vector representation developed
by Mikolov and Le. Each paragraph vector, labeled either
Childrens Literature or Advanced Literature, is fed
into a neural network, which will ultimately predict the
labels of the paragraph vectors formed from the testing set.
Given the relative lack of literature around the topic of
machine learning to examine textual difficulty, we decide
not to benchmark the results at the level of Mikolov and
Le, who used the Paragraph2vec algorithm on their dataset
of movie reviews to classify sentiment with 92.58%
accuracy. Instead, we start at the lowest of benchmarks:
50%, in order to prove that my approach significantly
outperforms merely guessing at random.
What makes the approach novel is that, unlike previous
work, this analysis of textual difficulty assumes no
features inherent to an easy or difficult text. Petersen
and Ostendorf and Francois and Miltsakaki pre-defined
features they considered to be relevant to determining the
difficulty of a text; this work makes no such assumptions.
We allow the machine to learn for itself what the features
of advanced literature may be the hallmark of a Deep
Learning approach.

3. Data and Methodology

The Bag of Words approach is a semantic-free


representation of the text, keeping a count of the number
of appearances of each word in our vocabulary for each
sample in our training set. Since its introduction by Zellig
Harris [7], it has proven to be an effective tool in textual
classification. For this work, the Bag-Of-Words
implementation removes stop-words, which are
generally considered of little use as features of the text
due to their commonness (words such as the, and,
a). The implementation of Bag-Of-Words also uses a
stemming algorithm, to ensure that English words of the
same lexeme but different morphological forms (e.g. run
and running) are treated as the same vocabulary item.

To obtain samples of literature to be classified as


childrens literature or advanced literature, we use
Project Gutenberg, a website that contains over 50,000
books in the public domain available. Within Project
Gutenberg, there are various bookshelves sections of
the website pertaining to different types of literature. We
treat all books found in Project Gutenbergs Childrens
Literature fiction section as my samples for Childrens
Literature, and fiction found in other bookshelves as
Advanced Literature. From these books, we create a
training data set of 12,000 samples 6,000 of Childrens
Literature, 6,000 of Advanced Literature. My model
would be tested on a data set consisting of 1,600
Childrens Literature samples and 2,000 Advanced
Literature samples.
Each of the samples is approximately 3 sentences long.
Samples of this length roughly correspond to the sampling
method used by the Lexile Analyzer software, which
divides its input text into slices of 125 words before
analyzing sentence length and word frequency.
On the training set, we use two different textual

4. Bag of Words Approach


4.1 Introduction to Bag of Words

4.2 Evaluation
The Bag of Words results initially seemed very
promising. While the Decision Tree based classifier fared
rather poorly, the Nave Bayes and SVM classifier showed
Nave Bayes

SVM

Decision Trees

60.2%

58.8%

53.5%

Figure 1: Bag-Of-Words performance over testing data

Nave Bayes

SVM

Decision Tree

Childrens Literature

18.3%

27.7%

6.8%

Advanced Literature

93.2%

83.3%

90.2

Figure 2: Bag-Of-Words performance in Childrens and Advanced Literature testing data


appearance in the document. With all values in the Bagsignificant improvement over the benchmark, as shown in
Figure 1. However, further investigation showed that the
above results were not true indicators of the models
performance, as indicated in Figure 2.
The
model
appears
biased
towards
predicting a given sample to be advanced text than
childrens text. Because the number of samples in the
testing data set contains slightly more examples of
advanced literature than childrens literature, there are
more opportunities for the bias to result in a correct guess
than an incorrect guess, leading to the classification rates
at well above 50% for the Advanced Literature
category.
4.3 Potential Improvements
After seeing the disparate rates of classification, one
weakness of this Bag-Of-Words implementation is that it
is based on the sheer number of times each word appears
in the text. The total number of words in the Advanced
Literature training data set far outnumbers the number of
words in the Childrens Literature training data set by
approximately 4:1. To illustrate how such a disparate
number of words in training set causes issues in
classification, we provide the classifier common words:
words such as know, love, I, to see whether given
those words alone to see whether it reflected a Childrens
Literature text or an Advanced Literature text. My
expected response was in the realm of .5, since given a
common word such as know, made, or I, the
classifier would have little way of providing meaningful
classification. These words appeared in both Childrens
Literature and Advanced Literature at approximately
equal rates (each at rate of just above 0.0025). However,
given just these words, the Nave Bayes Classifier
predicted the label Advanced Literature with probability
of approximately 83%. This indicates a high bias toward
classifying samples as Advanced Literature despite very
little data being given.
Future research into this topic that employs Bag-ofWords models might look into different implementations
than the one used in this study. Rather than using raw
count, a better method would be to use Term Frequency
Inverse Document Frequency weighting for the Bag-OfWords, or a Bag-Of-Words representation where each
dimension corresponds to the rate of the words

Of-Words represented as a value between 0 and 1, this


method can correct the issue of the Advanced Literature
containing far higher values in the Bag-Of-Words vector
than the Childrens Literature.
5. Paragraph Vector Approach
5.1 Introduction to Paragraph2Vec algorithm
On top of the issues posed by this particular
implementation, the main criticism with the Bag-OfWords representation of the text is that it is semantics-free.
In other words, there is no consideration for use of the
words in context, but merely the number of times they
appear. Moreover, word order is completely lost in the bag
of words model. Because Bag-Of-Words by design does
not consider many different features of the input text, a
more sophisticated textual representation seems necessary
for improving classification results.
With these shortcomings in mind, we turn to Mikolov
and Les Paragraph2Vec algorithm, the main aim of which
is to address the high dimensionality and lack of semantic
understanding inherent to the Bag-Of-Words model.
Ultimately the goal of the Paragraph2Vec algorithm is to
map different documents to vectors, such that similar
documents have vectors with high cosine similarity. The
Paragraph Vector representation is in large part inspired
by a previous framework called Word2Vec. Word2Vec
maps words to vectors, such that words with similar
semantic meaning have high cosine similarity.
More formally, given a sequence of training words w1,
w2, w3, ..., wT , the objective of the word vector model is to
map each wi to a vector such that we minimize the average
log probability:
1

!!!

log ! !!! , , !!! )


!!!

We wish to maximize our ability to predict the


following word in a sentence, given the words around it.
The primary difference between the Paragraph2Vec and
Word2Vec framework is that in addition to using all of the
word vectors across the corpus, the Paragraph2Vec
frameworks uses an additional vector representing the

current context or the topic of the current text itself, as


marked in orange in Figure 3 below.

Figure 3: Paragraph2Vec Framework The


Paragraph2Vec representation allows us to take
the word vectors and the paragraph vector (here
represented by the paragraph ID) to identify the
vector of the word most likely to come next in
context

In short, one can use the trained word vectors, which


capture the semantics of each word in a text sample, and
the paragraph vector, which represents the context in
which the words are used, to predict the words following
said text sample.
Our implementation of the Paragraph2Vec algorithm
comes from the Deep Learning NLP package gensim [8],
which contains multiple useful methods for textual
classification, as described below.
5.2 Metrics
5.2.1. N_most_similar
One of the functions within the gensim package is
called the n_most_similar function. After being given a
certain number of text samples (and converting them into
paragraph vectors), the function returns the N vectors with
the highest cosine similarity to a given paragraph vector.
We perform the classification based on the number of
similar documents as determined by the model: strictly
speaking, for a document d, we determine the readability
score sd from the n documents with highest cosine
similarity to d:

1
! =

Childrens
Literature

Advanced
Literature

N=5

74.0%

94.5%

N = 10

56.6%

93.7%

N = 100

41.8%

95.9%

N = 250

28.1%

95.3%

N = 500

20.5%

95.5%

Figure 4: Paragraph Vector n_most_similar classification


Running tests with various values of N yielded
inconsistent results, as illustrated in 43. With low N, the
classification rates are extremely high, far beating those of
the Bag-Of-Words model. However, with higher N, the
classification rate drops significantly. Moreover, this
classification model continues to exhibit the same bias that
the Bag-Of-Words model had done, classifying the
Advanced Literature samples at far higher rates of
success than with the Childrens Literature samples. This
suggests that, in general, this classification method finds
little topical and semantic similarity overall between
samples of childrens literature and samples of advanced
literature.
5.2.2 Classifying via Neural Networks
The second testing method employing the
Paragraph2Vec method is to train a neural network on our
data. We feed the training data through a neural network
with one hidden layer of 50 units identical to the type of
neural network employed by Mikolov and Le in their
previous testing of the Paragraph2Vec algorithm.
For this classification, we use the score function of
gensim, which returns the log probability of a sentence
appearing with a different label. After training our neural
network, we can use the score function in thegeni package,
We can use this function to determine the probability with
which a sample can be classified under a given label,
using the following adapted equation from Taddy et al.
[9]: given a sample containing k sentences, where the ith
sentence receives a score !! and !! corresponding to the
log probabilities of its occurrence in class 1 and class 2,
the probability of the sample as a whole occurring in class
1 is computed by:

!!!

where ti is 1 if document i has the label Childrens


Literature and 0 otherwise. A document with sd smaller
than .5 is classified as Childrens Literature, and a
document with sd below .5 are classified as Advanced
Literature.

1
! =

!!

!!! !!

Since there are two possible classes,

+ !!

! = 1 !
In this case, class 1 corresponds to Childrens Literature,
and class 2 corresponds to Advanced Literature.
To perform this classification, we create paragraph
vectors from all the training and testing samples in
Childrens Literature and Advanced Literature. We
then train the neural
After computing this probability for each sample in the
testing set, the results revealed that the classifier was
almost completely unable to distinguish between the
Childrens Literature samples and the Advanced
Literature samples. Nearly every sample of the 3,600
testing samples was classified at approximately 49%
Childrens Literature, and 51% Advanced Literature,
indicating that the neural network after training finds very
little semantic distinction between the Childrens
Literature and Advanced Literature material.
Childrens
Literature

Advanced
Literature

.493

.507

.488

.512

.484

.516

.503

.497

.497

.503

.491

.509

.494

.506

Figure 5: Paragraph Vector based


classification using Neural Networks
The above samples are a snapshot of the
results found by using the neural network
based classification method on Childrens
Literature vs. Advanced Literature.
6. Discussion
As neither the Bag--Words representation nor the
Paragraph Vector representation yielded strong results, we
are prompted to go back and consider how we can
improve the experiment protocols. For the Bag-Of-Words
model, it has been discussed that a change in
implementation (i.e. not making the Bag-Of-Words model
based on purely counts of each word). This aside, it still
comes as rather surprising that the classification rates on
the Childrens Testing data would be so poor.
Surprisingly, despite the usage of more semantic-driven
tools, the above classifiers present almost as much

ambiguity in the detection of Childrens Literature versus


Advanced literature as the semantics-free representations.
A hypothesis to investigate in further research is that there
is a lack of feature words for both the Childrens
Literature dataset and the Advanced Literature dataset. Of
crucial importance are the implications of Equation (1)
both the Word Vector and the Paragraph Vector
framework are designed to predict the next word in a text
sample, given the previous words in a certain context
window. These classification frameworks (and the BagOf-Words based classification as well) presuppose that
there is a significant difference in the lexica of labels on
which the model is predicting. The above experiments
suggest that perhaps the difference between the lexica of
what we considered Childrens Literature and
Advanced Literature is not quite as distinct as first
thought.
Moreover, with respect to my dataset itself, the
distinction between Childrens and Advanced Literature
isnt quite as distinguishable to a casual reader (unlike the
difference between positive and negative when doing
sentiment analysis, as with previous applications of the
Paragraph2Vec framework). Upon close inspection, many
of the Childrens Literature samples we had obtained
from Project Gutenberg were not easily recognizable to
the human eye as Childrens Literature. There is
certainly a noticeable difference between what we call
Childrens Literature today and Childrens Literature of
the 19th century. The data provided to the classifier wasnt
as instinctively binary in terms of easiness as might
have hoped, and framing the problem as a multi-class
categorization problem may have been better at capturing
more subtle differences between different text difficulties,
and might have led to more consistent classification results
overall.
7. Conclusion
We classify text samples as Childrens Literature and
Advanced Literature using Bag-Of-Words and
Paragraph Vector representations. We discover that BagOf-Words representation based on raw count of the
different words in the lexicon is insufficient, as it creates a
bias in the classifier to label all unseen texts as Advanced
Literature. Using the Paragraph Vectors, in conjunction
with the gensim packages various classification methods,
we find that the ability to distinguish between Childrens
Literature and Advanced Literature is no
straightforward task, as the model could not label text with
great certainty (no more than 51% certainty).
References
[1] Sarah E. Petersen and Mari Ostendorf. A Machine Learning
Approach to Reading Level Assessment. University of
Washington CSE Technical Report. 2006.

[2] Katsunori Kotani, Takehiko Yoshimi and Hitoshi Isahara. A


Machine Learning Approach to Measurement of Text
Difficulty for EFL Learners Using Various Linguistic
Features. US-China Education Review B 767-777, 2011.
[3] Thomas Francois and Eleni Miltsakaki. Do NLP and
Machine Learning improve traditional readability formulas?
NAACL-HLT 2012 Workshop on Predicting and Improving
Text Readability for target reader populations, 49-57, 2012.
[4] Tomas Mikolov and Quoc Le. Distributed Representation of
Sentences and Documents. Proceedings of the 31st
International Conference on Machine Learning, 2014
[5] Luke de Oliveira and Alfredo Lainez Rodrigo. Humor
Detection
in
Yelp
Reviews,
2015.
http://cs224d.stanford.edu/reports/OliveiraLuke.pdf
[6] Aric Bartle and Jim Zheng. Gender Classification with
Deep_Learning,
2015 http://cs224d.stanford.edu/reports/BartleAric.pdf
[7] Zellig Harris. Distributional Structure. Word, Vol. 10. 146162. 1954
[8] Radim Rehurek and Petr Sojka. Software Framework for
Topic Modelling with Large Corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP
Frameworks. 45-50. 2010.
[9] Matt Taddy. Document Classification by Inversion of
Distributed Language Representations. Proceedings of the
53rd Annual Meeting of the Asociation for Computational
Linguistics and the 7th International Joint Conference on
Natural
Language
Processing,
Association
for
Computational Linguistics. 45-49. 2015.

You might also like