BR PRB 2

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and
sentences for the mentioned documents. – 3Marks.
Import Libraries.
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data] Package movie_reviews is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data] Package inaugural is already up-to-date!
y0 = pd.DataFrame({'Text':inaugural.raw('1961-Kennedy.txt')},index = [0])
y1 = pd.DataFrame({'Text':inaugural.raw('1941-Roosevelt.txt')},index = [0])
y2 = pd.DataFrame({'Text':inaugural.raw( '1973-Nixon.txt')},index = [0])
Text wordcount char count sent c
[('the', 9446),
('of', 7087),
(',', 7045),
('and', 5146),
('.', 4856),
('to', 4414),
('in', 2561),
('a', 2184),
('our', 2021),
('that', 1748)]
Most Common top (10) Words Used by all 3 Presidents during the Inaugural Ceremony since the
Time.
2.2 Remove all the stop words from all three
speeches. – 3 Marks.
We can filter the stop words with the help to Filter, Sort & Stop function.
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "yo
u're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yours

elves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', '
herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'thei
rs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "tha
t'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been'
, 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing'
, 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between'
, 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to
', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'ag
ain', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why
', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than'
, 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'shou
ld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', '
aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "does
n't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "i
sn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn'
t", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren'
, "weren't", 'won', "won't", 'wouldn', "wouldn't"]
from nltk.tokenize import word_tokenize
text =inaugural.raw('1941-Roosevelt.txt')
text_tokens = word_tokenize(y1['Text'][0])
tokens_without_sw = [word for word in text_tokens if not word in stop_t
est]
print(tokens_without_sw)
We need to tokenize the all three speeches to get the stop words and to get out the special
characters, Sentences and Words out of the Speeches.
filtered_sentence = (" ").join(tokens_without_sw)
print(filtered_sentence)
Need to Filter all speeches to get the speech in proper Maner., we can use function Filter Sentences.
2.3 Which word occurs the most number of
times in his inaugural address for each
president? Mention the top three words. (after
removing the stopwords) – 3 Marks¶
from collections import Counter
Roosevelt_split = filtered_sentence.split()#y0['Text'][0].split()
Roosevelt_counter = Counter(Roosevelt_split)
Kennedy_split = filtered_sentence.split()#y1['Text'][0].split()
Kenndey_counter = Counter(Kennedy_split)
Nixon_split = filtered_sentence.split()#y2['Text'][0].split()
Nixon_counter = Counter(Nixon_split)
In [39]:
Roosevelt_most_occur = Roosevelt_counter.most_common(10)
print("Most common word of Roosevelt speech ",Roosevelt_most_occur )
Roosevelt_freq = pd.DataFrame(Roosevelt_most_occur, columns= ['Roosevelt_Fr
equent_words', 'Roosevelt_total_words'])
Roosevelt_freq
Kennedy_most_occur = Kenndey_counter.most_common(10)
print("Most common word of Kennedy speech ",Kennedy_most_occur )
Kennedy_freq = pd.DataFrame(Kennedy_most_occur, columns= ['Kennedy_Frequent

_words', 'Kennedy_total_words'])
Kennedy_freq
Nixon_most_occur = Nixon_counter.most_common(10)
print("Most common word of Nixon speech ",Nixon_most_occur )
Nixon_freq = pd.DataFrame(Nixon_most_occur, columns= ['Nixon_Frequent_words
', 'Nixon_total_words'])
Nixon_freq
Nixon_Frequent_words Nixon_total_words
0 , 77
1 . 68
2 -- 25
3 It 13
4 The 10
5 know 10
6 We 10
7 spirit 9
8 life 9
9 us 8
The Most Common words use by the all 3 President during the Speech.
Most common word of Roosevelt speech [(',', 77), ('.', 68), ('--', 25
), ('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9),
('life', 9), ('us', 8)]
Most common word of Kennedy speech [(',', 77), ('.', 68), ('--', 25),
('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('l
ife', 9), ('us', 8)]

Most common word of Nixon speech [(',', 77), ('.', 68), ('--', 25), (
'It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('li
fe', 9), ('us', 8)]
2.4 Plot the word cloud of each of the speeches of
the variable. (after removing the stopwords) – 3
Marks¶
from wordcloud import WordCloud,STOPWORDS
from wordcloud import WordCloud,STOPWORDS
words = ' '.join(y0['Text'])
cleaned_word = " ".join([word for word in words.split()
if '\n' not in word
With the Help of World Cloud Function, we can distinguish the most used word by the all 3 Presidents
During the Speech. We need to change the Vales of y0,y1,& y2 for app

BR PRB 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BR PRB 2

Uploaded by

Copyright:

Available Formats

Problem 2:

in Python. We will be looking at the following speeches of the Presidents of the

United States of America:

1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961

3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and

sentences for the mentioned documents. – 3Marks.

from nltk.corpus import inaugural

[nltk_data] Downloading package stopwords to

[nltk_data] Package stopwords is already up-to-date!

[nltk_data] Downloading package punkt to

[nltk_data] Package punkt is already up-to-date!

[nltk_data] Downloading package movie_reviews to

[nltk_data] Downloading package inaugural to

[nltk_data] Package inaugural is already up-to-date!

y2 = pd.DataFrame({'Text':inaugural.raw( '1973-Nixon.txt')},index = [0])

Text wordcount char count sent c

2.2 Remove all the stop words from all three

u're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yours

herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'thei

rs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "tha

, 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing'

'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between'

, 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to

ain', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why

aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "does

n't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "i

sn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn'

t", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren'

, "weren't", 'won', "won't", 'wouldn', "wouldn't"]

from nltk.tokenize import word_tokenize

tokens_without_sw = [word for word in text_tokens if not word in stop_t

characters, Sentences and Words out of the Speeches.

filtered_sentence = (" ").join(tokens_without_sw)

2.3 Which word occurs the most number of

times in his inaugural address for each

president? Mention the top three words. (after

removing the stopwords) – 3 Marks¶

from collections import Counter

print("Most common word of Roosevelt speech ",Roosevelt_most_occur )

Roosevelt_freq = pd.DataFrame(Roosevelt_most_occur, columns= ['Roosevelt_Fr

print("Most common word of Kennedy speech ",Kennedy_most_occur )

Kennedy_freq = pd.DataFrame(Kennedy_most_occur, columns= ['Kennedy_Frequent

print("Most common word of Nixon speech ",Nixon_most_occur )

Nixon_freq = pd.DataFrame(Nixon_most_occur, columns= ['Nixon_Frequent_words

('life', 9), ('us', 8)]

ife', 9), ('us', 8)]

fe', 9), ('us', 8)]

2.4 Plot the word cloud of each of the speeches of

the variable. (after removing the stopwords) – 3

from wordcloud import WordCloud,STOPWORDS

from wordcloud import WordCloud,STOPWORDS

words = ' '.join(y0['Text'])

cleaned_word = " ".join([word for word in words.split()

if '\n' not in word

You might also like