You are on page 1of 10

A Corpus Analysis of Children’s Literature

Yiwen Ou

Penn State University

Introduction

As an important access to the literate world, children’s literature possesses huge

impact on children. Knowles and Malmkjaer (2003) pointed out that children’s literature has

received little serious linguistic analysis despite its widely acknowledged influence on the

development and socialization of young people. Nowadays, there is still little research on

children’s literature especially in quantitative research. As a result, this study intends to

investigate the linguistic features of children’s literature with some quantitative tools, and

there would be a discussion about how those features could contribute to children’s language

teaching especially on materials selecting and design.

In order to gain comparatively comprehensive results, I investigated lexical, syntactic

and semantic features of this genre. Tools such as Antconc, Tree-tagger, Lexical Complexity

Analyzer, Syntactic Complexity Analyzer, USAS online taggers are used, and only part of the

results are presented in this report.

Two corpora of children’s literature and adults’ literature are collected and analyzed,

and the adults’ corpora is used as a comparing group. In the research of children’s literature,

researchers tend to classify the children’s literature into children’s literature books and

children’s stories, and in study of Knowles and Malmkjaer (2003), they even did individual

analysis for fantasy fictions. As a results, I decided to set two groups for the children’s
literature corpus. The corpus of children’s literature is 18 texts of children’s literature, the

first group consists of 9 excerpts from famous and popular children’s literature books (A

wrinkle in time, Alice in Wonderland, Anne of Green Gables, Charlotte’s web, Harry Potter,

Percy Jackson, Narnia, The wonderful wizard of Oz, Winnie the Pooh). Among these books,

all the books except Alice in Wonderland are published after 1900, and most of them are

published after 1950. The language of Alice in Wonderland might be complicated because of

its old publish date, however this book created many classical plots and characteristics which

are still loved by children nowadays, thus I decided to leave this book in the corpus. Another

group of texts in the children’s literature corpora consists of nine texts of famous and popular

children’s stories. I collected these texts from a kid’s stories website, so the texts have been

edited to be readable easily by kids. As for the comparing corpus of adults’ literature, I chose

18 texts of famous short stories which were published after the year 1900. The numbers of

words of all the texts range from around 1800 to 2300.

Due to the fact that there are significant differences between the language of

children’s literature books and children’s stories in lexical and syntactic complexity, I

analyzed them both separately and as a whole. And I compared the data with the adults’

group. As for the semantic field analysis, I only analyze the children’s literature group

because the semantic features of the adults’ group is not part of my research goals.

The rest of the paper is organized as follows: I would first provide and interpret the

results of lexical, syntactic analysis. Then I would discuss the findings from semantic

analysis and identify some popular themes in children’s literature. Finally there is an

implication section about the application of the results in children’s language teaching.
Results and Analysis

1. Lexical complexity

Table1
ls1 ls2 vs1 ttr lv vv1 nv adjv advv
Group
0.381 0.339 0.129 0.263 0.442 0.422 0.440 0.076 0.037
Cbooks
0.309 0.230 0.053 0.194 0.309 0.313 0.298 0.059 0.021
Cstories

0.347 0.288 0.093 0.231 0.379 0.371 0.373 0.068 0.029


Children
0.406 0.417 0.217 0.301 0.546 0.570 0.532 0.110 0.034
Adults

Codes: ls1 (Lexical sophistication1), ls2 (Lexical sophistication2), vs1 (Verb

sophistication1), ttr (Type-token ratio), lv (Lexical word variation), vv1 (Verb variantion1),

nv (Noun variation), adjv (Adjective Variation), advv (Adverbial variation).

Table1 shows the results of lexical complexity, and the four rows present the mean

values of the measures of all the texts in the particular group. In general, adults literature uses

the most complex words, children’s books is the second, and children’s stories use the easiest

words. The differences between children’s books and stories are more obvious than the

differences between children’s literature and adults’ literature. One reason is that those

children’s books and stories are not for children of the same age. The readers of children’s

stories tend to be younger kids, and the books might be suitable for older children. Another

reason might be that some of the children’s literature books are not only written for children,

so the authors did not want their books to be too “childish”. However adults are not likely to

read those short fairy tales on a kids’ stories website.

The table also indicates that children’s literature use significantly easier verbs
compared with adults literature. From the values of lexical variance, we could see adults’

literature possesses a bigger range of nouns, adjectives and verbs. However, the values of

adverbials do not vary much among the groups of Cbooks, C stories and Adults, it reflects

that children’s literature uses a wide range of adverbials in the texts. The reason may be that

adverbials help to provide more dynamic to the actions of characters.

2. Syntactic complexity

Table2

Group MLS MLT MLC C/S C/T DC/T CT/T

Cbooks 13.70 11.10 7.18 1.914 1.550 0.454 0.364

Cstories 8.34 8.12 6.38 1.299 1.269 0.261 0.216

Children 10.89 9.53 6.78 1.586 1.397 0.347 0.282

Adults 19.25 14.31 9.35 2.065 1.528 0.479 0.355

Codes: MLS (mean length of sentence), MLT (mean length of T-unit), MLC (mean length of

clause), C/S (sentence complexity ratio), C/T (T-unit complexity ratio), DC/T (Dependent

clasuses per T-unit), CT/T (Complex T-unit ratio).

Table2 shows the results of syntactic complexity, the values of MLC, MLT and MLS

indicate that adults’ literature has longer sentences and clause. And the MLC of children’s

stories are far less than the other two groups. As for MLC, the value of children’s books

resembles children’s stories, which means children’s literature tends to have shorter sentences

compared with adults’ literature. As for the sentence complexity ratio, the value of children’s

literature is a little less than that of adults’ literature, and both of them are much more than the

value of children’s stories. C/T, DC/T and CT/T are all measures of subordination, the results
reflect that the complexity degree of subordination of children’s books and adults’ literature

were similar, however children’s stories do not have that many complex T-units and

dependent clauses.

3. Semantic analysis

In this part, I intended to analyze the semantic features of the two types of children’s

literature in order to identify typical themes of this genre. I did not analyze the features by

each text, but I used the Concordance plot to make sure that the semantic fields that I

discussed existed in most or all of the texts, and the hits did not gather in minority texts. The

method that I used is to first tag all the texts with USAS online taggers, and then I imported

texts of children’s literature books and children’s stories separately into Antconc to analyze

semantic fields.

(1) E4.1 (Happy/sad: Happy) & E4.2 (Happy/Sad: Contentment)

① E4.1 (Cbooks)

After the analysis with Antconc, I found out there are 79 hits in the texts of children

books, 26 of them are “+” (positive), and 53 of them are “-“ (negative). The table below

shows the most frequent words in this category. (For the rest of the tables, I would use easier

format to present the results: total: (total number), +: (number), -: (number))


+ frequency _ frequency

Happy 4 Cry(cried) 13

Grin(grinning,grins 5 Sob(Sobs, 5
, sobbing, sobbed)
Grinned)
laugh 3 Sad(sadly) 4

Cheer(cheerful) 3 Unhappy(unhapp 3
iness)
② E4.2 (Cbooks)

Total: 29. +:27 -: 2


+ frequency _ frequency

glad 10 Disappointed 2
(disappointment)
Content(contented, 4
Contentedly)
Please(pleased) 5

Proud(proudly) 2

① E4.1 (Cstories)

Total: 80. +: 45 -: 35
+ frequency - frequency

Laugh(laughed, 18 Sad(sadly) 7
Laughing)
Happy(happiness, 13 Cry(cried, 18
Happily) crying)
joy 5 grief 2

② E4.2 (Cstories)

Total: 9. +: 7. -: 2.

These results show that the world in children’s literature is not only filled with

happiness and laughers, on the contrary there might be more sad moments. Although there are

many sorrowful plots, the results of E4.2 show low frequency of disappointment. Combining

the results with the contents of these children’s literatures, we could see the characters of

these stories often experienced sadness and difficulties, but they go through all of them and

most of them reach to the happy endings. And if they stick to their dreams and love their

families and friends, life would not be disappointing. Therefore, a theme of “hope” is

important in children’s literature.


(2) E5: Fear/Bravery

Cbooks: total: 50 +: 12 -: 38
+ frequency - frequency
Courage 7 afraid 10
Brave(bravely, 4 Fear(fearsome, 5
Braver) Feared)

Terror(terrified) 4

Cstories: total:19 +:5 -: 14


+ frequency - frequency

Dare(dared) 3 fear 11

brave 2 Shock(shocked) 2

The results reflect a theme of “brave” in children’s literature. Many children’s

literatures tell stories of adventures, the characters encounter many difficulties and struggles.

They might feel afraid at first, but finally they would grow to be brave enough to overcome

the hardship. This kind of literature could teach students that there are difficulties in life,

however they are able to overcome them if they are brave. At the same time, plots about

characters encounter something a little afraid tend to be exciting and engaging for children.

(3) E6: Worry, concern, confident

Cbooks: total: 32 +: 5 -: 28

+ frequency - frequency

confidence 1 Worry(worried) 9

Anxious(anxiously) 5

Cstories: total: 26 +:0 -:26


+ frequency - frequency

Worry 7

Cares(care, cared) 9

The results show that the authors created a number of emotions of anxiousness and

worry, which could stimulate the curiosity of readers to find out the results.

(4) K6: Children’s games and toys

Cbooks:1 old dolls house(a wrinkle in the eye)

Cstories: 14(13 puppets(Pinocchio), 1 toys(the snow queen)

Children’s games and toys rarely exist in children’s literature, and children seem to be

more interested in fictional things such fairies, speaking animals and flying houses which

they could not really see in real life.

(5) S9 religion and the supernatural

Cbooks: total: 52 (witch, wizard, magic, dragon, goddess)

Cstories: total: 129 (wizard, witch, magic, fairy, godmother, Merlin)

These values indicate that fantasy and supernatural power are dominant themes in

children’s literature.

4. Animals in children’s literature

Implications

In order to design and select understandable and engaging materials for children, there

are several principles deriving from data analysis. For lexical use, stories with easier verbs

and nouns are preferable for children, and teachers could use various adverbials to add more

dynamic to the stories. For syntax, children prefer shorter sentences and clauses. In the

materials for older children, complex T-units and dependent clauses could be used normally.
While for younger kids, the reading materials are supposed to use comparatively less and

simpler subordinate structures.

As for the plots of stories, children show preferences of stories with themes of

“brave” and “hope”, and emotions of a little “scary” and “anxiousness” could help to engage

the readers.
Reference list

Knowles, M., Malmkjaer, K. (1996). Language and Control in Children's Literature. London:

Routledge.

You might also like