You are on page 1of 1

Ling 190, Spring 2020

Assignment 8
Distributed Monday, March 30
Due Monday, April 6

Charles Dickens wrote 15 novels between 1836 and 1870. From Files: Week 9 on the web-
site, download the file charles dickens novels.txt, which contains Project Gutenberg
URLs for text files of the novels along with their titles and years.

1. Write a Python program to compile a complete corpus of Dickens’ novels. Your pro-
gram should read in the URLs from the file and then access the content of each URL
using a request object. Unlike in the last homework, your program should now strip
off the Project Gutenberg-generated header and footer from each text as it reads it
in. (You should, however, leave in any other material in the original books, includ-
ing prefaces, tables of contents, chapter headers, etc.) To do so, find a (more or less)
consistent cue to the start and end of the text across Gutenberg files, and keep the
slice of the text between those two delimiters. I say “more or less” because the cues
might be characterized by regex even if they’re not exactly the same in every file.
Do not attempt to download and strip the files by hand, as it would be quite tedious
(and infeasible for larger corpora). In the combined corpus of all novels, what is the
total number of word tokens (alphabetical strings only, excluding punctuation and
digits)?

2. What are the ten most frequent words in Dickens’ combined novels? (Exclude punc-
tuation and digits, and lowercase all words, but do not run a stemmer.)

3. Is there a significant correlation between the lexical diversity of each novel and its
year? For example, do novels tend to become more or less diverse over Dickens’ ca-
reer? Before computing lexical diversity, exclude punctuation and digits, lowercase
all words, and apply the Porter stemmer. Have your program write a tab-delimited
file for R that contains the year of each book, its title, and its lexical diversity. In R,
compute Pearson’s r for lexical diversity against year and make a scatterplot with
year on the x-axis, lexical diversity on the y-axis, and points labeled by the title of
each book.

4. What is the mean number of words per sentence in the combined corpus? Sentences
are delimited by periods, exclamation points, and question marks (in any combina-
tion; e.g. “!?” and “...” delimit sentences too).

5. Is sentence length (in words) distributed log-normally? Have your program write
a tab-delimited file containing the frequency of each sentence length. In reckoning
sentence length, exclude punctuation but keep digits. In R, generate a histogram of
log sentence length along with the corresponding Q-Q plot (see Assignment 2 if you
don’t recall how to do this). How close is the distribution to normal?

You might also like