You are on page 1of 2

8/17/23, 9:24 PM Exp4_TSA - Jupyter Notebook

1. Download the Gutenberg corpus

In [20]: import nltk


from nltk.corpus import gutenberg

2. Import Gutenberg corpus from NLTK

In [21]: nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...


[nltk_data] Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!

Out[21]: True

3. Display all text associated with Gutenberg corpus

In [12]: gutenberg_files = gutenberg.fileids()


gutenberg_text = ''
for file_id in gutenberg_files:
gutenberg_text += gutenberg.raw(file_id)

4. Extract any one file from Gutenberg corpus

In [13]: selected_file_id = 'shakespeare-hamlet.txt'


selected_text = gutenberg.raw(selected_file_id)

5. Find the number of characters in the selected corpus

In [14]: num_characters = len(selected_text)


print("Number of characters:", num_characters)

Number of characters: 162881

6. Display the first 100 characters from the selected corpus

In [15]: first_100_characters = selected_text[:100]


print("First 100 characters:", first_100_characters)

First 100 characters: [The Tragedie of Hamlet by William Shakespeare 1599]

Actus Primus. Scoena Prima.

Enter Barnardo a

7 Fi d th
localhost:8891/notebooks/Exp4_TSA.ipynb#
b f d i th l t d 1/2
8/17/23, 9:24 PM Exp4_TSA - Jupyter Notebook
7. Find the number of words in the selected corpus

In [16]: words = nltk.word_tokenize(selected_text)


num_words = len(words)
print("Number of words:", num_words)

Number of words: 36372

8. Display the first 50 words from the selected corpus

In [17]: first_50_words = ' '.join(words[:50])


print("First 50 words:", first_50_words)

First 50 words: [ The Tragedie of Hamlet by William Shakespeare 1599 ] Actu


s Primus . Scoena Prima . Enter Barnardo and Francisco two Centinels . Barn
ardo . Who 's there ? Fran . Nay answer me : Stand & vnfold your selfe Bar
. Long liue the King Fran . Barnardo ?

9. Find the total number of sentences in the selected corpus

In [18]: sentences = nltk.sent_tokenize(selected_text)


num_sentences = len(sentences)
print("Total number of sentences:", num_sentences)

Total number of sentences: 2355

10. Display the first 5 sentences from the selected corpus

In [19]: first_5_sentences = '\n'.join(sentences[:5])


print("First 5 sentences:\n", first_5_sentences)

First 5 sentences:
[The Tragedie of Hamlet by William Shakespeare 1599]

Actus Primus.
Scoena Prima.
Enter Barnardo and Francisco two Centinels.
Barnardo.
Who's there?

localhost:8891/notebooks/Exp4_TSA.ipynb# 2/2

You might also like