You are on page 1of 6

Problem 2:

In this particular project, we are going to work on the inaugural corpora from the nltk

in Python. We will be looking at the following speeches of the Presidents of the

United States of America:

1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961

3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and

sentences for the mentioned documents. – 3Marks.

Import Libraries.

import nltk

nltk.download('inaugural')

from nltk.corpus import inaugural

inaugural.fileids()

inaugural.raw('1941-Roosevelt.txt')

inaugural.raw('1961-Kennedy.txt')

inaugural.raw('1973-Nixon.txt')

[nltk_data] Downloading package stopwords to

[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...

[nltk_data] Package stopwords is already up-to-date!

[nltk_data] Downloading package punkt to

[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...

[nltk_data] Package punkt is already up-to-date!

[nltk_data] Downloading package movie_reviews to

[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package movie_reviews is already up-to-date!

[nltk_data] Downloading package inaugural to

[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...

[nltk_data] Package inaugural is already up-to-date!

y0 = pd.DataFrame({'Text':inaugural.raw('1961-Kennedy.txt')},index = [0])

y1 = pd.DataFrame({'Text':inaugural.raw('1941-Roosevelt.txt')},index = [0])

y2 = pd.DataFrame({'Text':inaugural.raw( '1973-Nixon.txt')},index = [0])

Text wordcount char count sent c

[('the', 9446),

('of', 7087),

(',', 7045),

('and', 5146),

('.', 4856),

('to', 4414),

('in', 2561),

('a', 2184),

('our', 2021),

('that', 1748)]

Most Common top (10) Words Used by all 3 Presidents during the Inaugural Ceremony since the

Time.

2.2 Remove all the stop words from all three

speeches. – 3 Marks.

We can filter the stop words with the help to Filter, Sort & Stop function.

'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "yo

u're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yours


elves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', '

herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'thei

rs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "tha

t'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been'

, 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing'

, 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',

'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between'

, 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to

', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'ag

ain', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why

', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',

'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than'

, 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'shou

ld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', '

aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "does

n't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "i

sn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn'

t", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren'

, "weren't", 'won', "won't", 'wouldn', "wouldn't"]

from nltk.tokenize import word_tokenize

text =inaugural.raw('1941-Roosevelt.txt')

text_tokens = word_tokenize(y1['Text'][0])

tokens_without_sw = [word for word in text_tokens if not word in stop_t

est]

print(tokens_without_sw)
We need to tokenize the all three speeches to get the stop words and to get out the special

characters, Sentences and Words out of the Speeches.

filtered_sentence = (" ").join(tokens_without_sw)

print(filtered_sentence)

Need to Filter all speeches to get the speech in proper Maner., we can use function Filter Sentences.

2.3 Which word occurs the most number of

times in his inaugural address for each

president? Mention the top three words. (after

removing the stopwords) – 3 Marks¶

from collections import Counter

Roosevelt_split = filtered_sentence.split()#y0['Text'][0].split()

Roosevelt_counter = Counter(Roosevelt_split)

Kennedy_split = filtered_sentence.split()#y1['Text'][0].split()

Kenndey_counter = Counter(Kennedy_split)

Nixon_split = filtered_sentence.split()#y2['Text'][0].split()

Nixon_counter = Counter(Nixon_split)

In [39]:

Roosevelt_most_occur = Roosevelt_counter.most_common(10)

print("Most common word of Roosevelt speech ",Roosevelt_most_occur )

Roosevelt_freq = pd.DataFrame(Roosevelt_most_occur, columns= ['Roosevelt_Fr

equent_words', 'Roosevelt_total_words'])

Roosevelt_freq

Kennedy_most_occur = Kenndey_counter.most_common(10)

print("Most common word of Kennedy speech ",Kennedy_most_occur )

Kennedy_freq = pd.DataFrame(Kennedy_most_occur, columns= ['Kennedy_Frequent


_words', 'Kennedy_total_words'])

Kennedy_freq

Nixon_most_occur = Nixon_counter.most_common(10)

print("Most common word of Nixon speech ",Nixon_most_occur )

Nixon_freq = pd.DataFrame(Nixon_most_occur, columns= ['Nixon_Frequent_words

', 'Nixon_total_words'])

Nixon_freq

Nixon_Frequent_words Nixon_total_words

0 , 77

1 . 68

2 -- 25

3 It 13

4 The 10

5 know 10

6 We 10

7 spirit 9

8 life 9

9 us 8

The Most Common words use by the all 3 President during the Speech.

Most common word of Roosevelt speech [(',', 77), ('.', 68), ('--', 25

), ('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9),

('life', 9), ('us', 8)]

Most common word of Kennedy speech [(',', 77), ('.', 68), ('--', 25),

('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('l

ife', 9), ('us', 8)]


Most common word of Nixon speech [(',', 77), ('.', 68), ('--', 25), (

'It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('li

fe', 9), ('us', 8)]

2.4 Plot the word cloud of each of the speeches of

the variable. (after removing the stopwords) – 3

Marks¶

from wordcloud import WordCloud,STOPWORDS

from wordcloud import WordCloud,STOPWORDS

words = ' '.join(y0['Text'])

cleaned_word = " ".join([word for word in words.split()

if '\n' not in word

With the Help of World Cloud Function, we can distinguish the most used word by the all 3 Presidents

During the Speech. We need to change the Vales of y0,y1,& y2 for app

You might also like