You are on page 1of 51

MACHINE

LEARINING
By G Kailash
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables.
You have to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting
overall win and seats covered by a particular party.

SUMMARY

 The Given dataset contains data collected from Leading news channel CNBE wants to
analyze recent elections and contains the survey of 1525 voters with 9 variables.
 A Model need to find out for to predict which party a voter will vote for on the basis
of the given information, to create an exit poll that will help in predicting overall win
and seats covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
The following process carried out for building a model as follows,

 Importing the required libraries regarding models such as logical regression ,


LDA,boosting,bagging,.. as required with pandas ,numpy ,etc.
 Reading the excel file ‘ Election data’.
 As per the summary the excel file contains the survey data of 1525 voters with 9
variables..

Dataset sample0

1525 rows × 10 columns

In this dataset, ‘Unnamed:0’ column available , a unnecessary variable and needs to


drop for the further process by using drop function.

1
 Data set contains 1525 rows with 9 columns (Shape).
no. of rows: 1525
no. of columns: 9

Inferences from the Data Dictionary

 Data set contains 2 object named ‘vote’ and ‘gender’.


 Data set contains 7 integer values and 2 object.
 There are 9 variables out of which the 'vote' variable is the dependent variable and
the rest are independent variables.
 Among the independent variables, except for 'age' variable which is continuous, all
the other seven variables are categorical type.
 Among the categorical independent variables, except for 'gender' all the others are
ordered categorical variables.

2
CHECKING DUPLICATE AND NULL VALUES:

 By checking the Null values in the given dataset, we can clearly see no null values
available.

 And by checking the Duplicates in the given dataset, there are 8 duplicated rows
present .

Number of duplicate rows = 8

3
We can say that, Datasets that contain duplicates may contaminate the training data
with the test data or vice versa. But in this case, there is no repeated values or any
repeated names occurred . Every duplicates mentioned here having slight difference in
their own .
Mainly here some variables are having scaled data as like 0 to 5 and others like vote
having only two characters ’ Labour and conservative’ . There are no much categories
available and this is the main reason its showing as Duplicates. So , keep as the data
available.
And for ‘0’ value, here in only one variable having ‘0’ as repeated as a category thing.
So, it also will not change the process path.
Lets move to the describe part, as we already discuss some variables having scaled
data, which will turn out the process for getting the quick solution .
The 'age' variable is expected to be not skewed as mean and median values are
nearly same. The minimum age of surveyed voter is 24 and maximum is 93.

Vote
Labour 1063
Conservative 462
Name: vote, dtype: int64

Among the voters surveyed most of them voted for 'Labour' Party.
Counts and Percentage of ordered and numerical variables
Vote
Labour 1063
Conservative 462
Name: vote, dtype: int64

Vote
Labour 0.697049
4
Conservative 0.302951
Name: vote, dtype: float64

Age
37 42
49 39
35 39
47 38
54 37
..
87 3
92 2
90 1
93 1
91 1
Name: age, Length: 70, dtype: int64

Age
37 0.027541
49 0.025574
35 0.025574
47 0.024918
54 0.024262
...
87 0.001967
92 0.001311
90 0.000656
93 0.000656
91 0.000656
Name: age, Length: 70, dtype: float64

Economic.cond.national
3 607
4 542
2 257
5 82
1 37
Name: economic.cond.national, dtype: int64

Economic.cond.national
3 0.398033
4 0.355410
2 0.168525
5 0.053770
1 0.024262
Name: economic.cond.national, dtype: float64

Economic.cond.househodrop
3 648
4 440
2 280
5 92
1 65
Name: economic.cond.household, dtype: int64

5
Economic.cond.househodrop
3 0.424918
4 0.288525
2 0.183607
5 0.060328
1 0.042623
Name: economic.cond.household, dtype: float64

Blair
4 836
2 438
5 153
1 97
3 1
Name: Blair, dtype: int64

Blair
4 0.548197
2 0.287213
5 0.100328
1 0.063607
3 0.000656
Name: Blair, dtype: float64

Hague
2 624
4 558
1 233
5 73
3 37
Name: Hague, dtype: int64

Hague
2 0.409180
4 0.365902
1 0.152787
5 0.047869
3 0.024262
Name: Hague, dtype: float64

Europe
11 338
6 209
3 129
4 127
5 124
8 112
9 111
1 109
10 101
7 86
2 79
Name: Europe, dtype: int64

Europe
6
11 0.221639
6 0.137049
3 0.084590
4 0.083279
5 0.081311
8 0.073443
9 0.072787
1 0.071475
10 0.066230
7 0.056393
2 0.051803
Name: Europe, dtype: float64

Political.knowledge
2 782
0 455
3 250
1 38
Name: political.knowledge, dtype: int64

Political.knowledge
2 0.512787
0 0.298361
3 0.163934
1 0.024918
Name: political.knowledge, dtype: float64

Gender
female 812
male 713
Name: gender, dtype: int64

Gender
female 0.532459
male 0.467541
Name: gender, dtype: float64

 The target variable 'vote' shows that 69.7% voters are in favour of the 'Labour' Party
and 30.29% in favour of 'Conservative' Party. From the modelling point of view, there
is some class imbalance observed. We will observe the outputs of classification
models to see this effect and consider if any treatment is required.

7
 The percentage of male and female voters are nearly same with 53.24% female and
46.75% male. Thus there seems to be equal representation of the either gender.
 Most of the voters surveyed assessed the current national economic conditions on
average scale with around 75% voters giving a ratings of 3 or 4. Only around 8% of
surveyed voters gave extreme ratings of 5 or 1.
 Most of the voters surveyed assessed the current household economic conditions on
average scale with around 71% voters giving a ratings of 3 or 4. Only around 10% of
surveyed voters gave extreme ratings of 5 or 1.
 The Labour Party Leader 'Blair' received ratings of 4 or above by around 64% of the
surveyed voters.
 The Conservative Party Leader 'Hague' received ratings of 2 or below by around 56%
of the surveyed voters.
 Around 22% of the surveyed voters highly disregard closer links between Britain and
European Union, i.e. they rated the maximum as 11 on the 'Eurosceptic' sentiment
scale.
 Around 67.5% of the surveyed voters having some idea about the Labour and
Conservative Party's positions on European integration. And the rest of the surveyed
voters had no or little idea on this front.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.

8
 According to IQR rule there are no outliers in 'age' variable.

9
 As most of the ratings given by the surveyed voters for 'economic.cond.national' and
'economic.cond.household' variables are 3 or above, the ratings of 1 are considered
as outliers by the IQR method.
 In the 'age' variable it is observed that voters of all ages are taken in the survey with
nearly equal representation of the different age groups. Although it may be noted
that the number of higher and lower aged voters is comparitively less. There does
not seem to be any skewness in the distribution.
 For 'economic.cond.national' and 'economic.cond.household' variables it is observed
that most surveyed voters have given an average ratings of 3 or 4.
 The variables 'Blair' and 'Hague' show that for Blair of the Labour Party the surveyed
voters have shown greater appreciation and for Hague lesser appreciation.
 The Eurosceptic sentiments are observed from low to high equally present among
the surveyed voters. Although a high spike is observed in the number of voters who
are highly sceptic of the European integration of Britian.

10
 'political.knowledge' variable shows that most surveyed voters have knowledge
about the two parties' positions on European integration. Although there are some
present who have no idea on this front.
 And There is not much skewness observed in the variables.

From this plot, the economic cond.national with respect to the vote relationship
informed.

11
From this plot, Labour type category dominates more than the conservative asmore
than 65%.
vote gender
Conservative female 259
male 203
Labour female 553
male 510
Name: gender, dtype: int64

Bivariate/Multivariate Analysis
12
 The pairplot does not give proper visual description as the variables except age are all
ordered categorical variables and hence they are observed in a different way.

Heatmap showing the correlation values between the variables :

 The correlation values show that there is no high correlation.


 Minor positive coorelation between Economic condition national and economic
condition household.

13
The Male and female of all ages were covered in the survey.
The mean and median age of voters seem to higher for extreme views on
Eurosceptic sentiment both in favour and against.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).

Now let set the Target Variable as ‘Vote’

Conversion of 'object' data type variables:


It is a prerequisite for Modelling that the object data type independent variables are
converted to numerical data type. And all the ordered categorical variables are already
integer and onle the gender variable is required to be encoded.
Since it is a nominal variable it is better is dummy encoding is done, but as there
are only two categories present label encoding will also give similar result. Hence Label
encoding is done.
By using the Label Encoder , we can fit the Data set and get those dummy values and
the after the change data type of the dataset as follows,

14
The object data types are converted to numerical. Now the data types of all the
variables are in acceptable format for Modeling.
In the target variable 'vote', the category: 'Conservative' is coded as 0 and 'Labour' is
coded as 1. In the variable 'gender', the category: 'female' is coded as 0 and 'male' is coded
as 1.

Scaling:
Scaling is necessary only for KNN model and only for age feature – scaling will be
done. Since all other features have values ranging from 1 – 5, scaling is not required. Age
feature has values ranging from 24 to 93 and standard deviation is very high. So we will
scale the data for aging only for KNN model which is very sensitive to same.
As we already discussed to take target variable ‘Y’ as ‘vote’ and ‘X’ will be our data
set except ‘vote’.
Now splitting the data into train and test as 70:30 ratio(70% will go to training data
set and 30% will be for test data set).
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)
1.Logical Regression
The restrictions are mainly in 'solver' and 'penalty' used, hence for these different
combinations are tried and for the other hyperparameters a particular set of values are
used for grid search.
The 'newton-cg', 'sag', and 'lbfgs' solvers support only L2 regularization with primal
formulation, or no regularization. The Elastic-Net regularization is only supported by the
'saga' solver. 'none', i.e. no regularization is applied is not supported by the liblinear solver.

Hyperparameters Tuned:
15
'solver', 'penalty', 'max_iter', 'tol', 'C', 'class_weight'.

Combination 1:
penalty: 'l2'
solver: 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'.

Model Performance on Train Data set Accuracy/ Score for combination 1 : 84%

Model Performance on Test Data set Accuracy/ Score for combination 1 : 82%
Combination 2:
Penalty:’none’
Solver:’sag’

Model Performance on Train Data set Accuracy/ Score for combination 2 : 84%

16
Model Performance on Test Data set Accuracy/ Score for combination 2 : 82%

There are no much difference when changing/tuning with solver itself and both
coming mostly equal

The results for different combinations are as follows :


Combination 1:

The parameters are: 'max_iter': 1000,'penalty': 'l2', 'solver': 'sag','tol': 0.1. The
Evaluation Parameters are as follows :
For Training Set Class - Conservative:

 precision: 0.77
 recall: 0.69
 f1-score: 0.73

For Training Set Class - Labour:

 precision: 0.87
 recall: 0.91
 f1-score: 0.89

For Testing Set Class - Conservative:

 precision: 0.70
 recall: 0.65
 f1-score: 0.68

For Testing Set Class - Labour:

 precision: 0.87
 recall: 0.89

17
 f1-score: 0.88

Combination 2:
The parameters are: 'max_iter': 1000,'penalty': 'none', 'solver': 'sag', 'tol': 0.1. The
Evaluation Parameters are as follows:

For Training Set Class - Conservative:

 precision: 0.76
 recall: 0.68
 f1-score: 0.72

For Training Set Class - Labour:

 precision: 0.86
 recall: 0.90
f1-score: 0.88

For Testing Set Class - Conservative:

 precision: 0.69
 recall: 0.66
 f1-score: 0.68

For Testing Set Class - Labour:

 precision: 0.87
 recall: 0.88
 f1-score: 0.88

Inferences and Final Logistic Refression Model hyperparameters:


All the combinations are giving nearly the same results. Since the simplest model is
the Combination 2, and it is selected as the best model among the Logistic Regression
model combinations.

2.LDA

18
Shrinkage LDA can be used by setting the shrinkage parameter of the
LinearDiscriminantAnalysis class to 'auto'. And 'shrinkage' is applied only for solver: 'lsqr'
and 'eigen'.
Here Singular Value Decomposition(svd) is yet another dimension
reduction algorithm.

Hyperparameters Tuned:

'solver', 'shrinkage' and 'tol

Combination 1:
solver: 'svd', tol:’0.1’.

Model Performance on Train Data set Accuracy/ Score for combination 1 : 84%
Model Performance on Test Data set Accuracy/ Score for combination 1 : 82%
These values are mostly equal to the Logical regression that we previously
examined. And gridsearch cv or by direct model(LDA) estimator also showing the same
results.

Combination 2:
solver: 'lsqr', tol:’0.1’,shrinkage:’auto.
19
Model Performance on Train Data set Accuracy/ Score for combination 2 : 84%
Model Performance on Test Data set Accuracy/ Score for combination 2 : 82%

The results for different combinations are as follows :


Combination 1:

The parameters are solver: 'svd', tol:’0.1’. The Evaluation Parameters are as follows :
For Training Set Class - Conservative:

 precision: 0.76
 recall: 0.70
 f1-score: 0.73

For Training Set Class - Labour:

 precision: 0.87
 recall: 0.90
 f1-score: 0.88

For Testing Set Class - Conservative:

 precision: 0.69
 recall: 0.66
20
 f1-score: 0.67

For Testing Set Class - Labour:

 precision: 0.87
 recall: 0.88
 f1-score: 0.87

Combination 2:
The parameters are: solver: 'lsqr', tol:’0.1’,shrinkage:’auto. The Evaluation
Parameters are as follows:

For Training Set Class - Conservative:

 precision: 0.76
 recall: 0.70
 f1-score: 0.73

For Training Set Class - Labour:

 precision: 0.87
 recall: 0.90
 f1-score: 0.88

For Testing Set Class - Conservative:

 precision: 0.69
 recall: 0.66
 f1-score: 0.67

For Testing Set Class - Labour:

 precision: 0.87
 recall: 0.88
 f1-score: 0.87

Inferences and Final LinearDiscriminantAnalysis Model hyperparameters:

21
All the combinations are giving nearly the same results. Since the simplest model is
the Combination 2, and it is selected as the best model among the
LinearDiscriminantAnalysis model combinations.
Both the models are properly fitted. None of them are over fitted or under fitted. The
performance on test and train data set does not differ much

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

1.KNN MODEL
Now process the KNN model by import the Kneighbour classifier from sklearn, KNN
method is one of the best methods in Machine Learning. And same process as the LDA and
the LOGICAL REGRESSION, we need to take the target variable as vote and splitting them
into train and test data and fitting them into KNN model.
Combination 1:
n_neighbors= 7 , weights = 'uniform'.

Model Performance on Train Data set Accuracy/ Score for combination 1 : 85%

Model Performance on Train Data set Accuracy/ Score for combination 1 : 80%

And the score of train set as 0.84817 and the test set as 0.7969 and the difference is
0.051, As the difference between train and test accuracies is 5.1 % which is less than 10%
(Industry standard).So, we can consider, but from the above combinations, we got more
variations in the predictions with train and test data than the other models by the accuracy.
22
This is because of the non-scaled data of ‘age’ in the dataset as the KNN model sensitive for
the non scaled data.
Lets check the miscalculation error as the results depend on the k values we
determine, and the plot as follows,

In this above plot ,the error mostly short in the range of 17 as the k value. Lets
try another combination.

Combination 2:
n_neighbors= 17 , weights = 'uniform'.

Model Performance on Train Data set Accuracy/ Score for combination 2 : 83%

23
Model Performance on Test Data set Accuracy/ Score for combination 2 : 82%
And the score of train set as 0.8294 and the test set as 0.8165 and the difference is
0.012, As the difference between train and test accuracies is 1.2 % which is also less than
10%(Industry standard).So, we can also consider this model.
Now the variations between the train and test getting matched up but still the model
accuracy is less than the 90%. Lets try the different combination by removing the ‘age’ as
the non scaled variable in the dataset.

Combination 3 :without the column(‘age’)


n_neighbors= 17 , weights = 'uniform'.

Model Performance on Train Data set Accuracy/ Score for combination 3 : 89%

Model Performance on Test Data set Accuracy/ Score for combination 3 : 88%

And the score of train set as 0.889 and the test set as 0.877 and the difference is
0.011, As the difference between train and test accuracies is 1.1 % which is also less than
10%(Industry standard).
From all the above models we see, this model gives the more clear predictions.

24
Eventhough its giving better accuracy but still the roc_auc value is less than any other
models as 71.94% for the test dataset.

Lets check with the Naïve Bayes model,

2. NAÏVE BAYES
Same process as the Knn model, now import the required library as GaussianNB from
sklearn. Taking the target variable as vote and using it to predict a model.
Now Split the data into train and test and fitting it into the GaussianNB model.

Model Performance on Train Data set Accuracy/ Score : 83%

25
Model Performance on Test Data set Accuracy/ Score: 83%
And train data accuracy is 0.8331 and test data accuracy score is 0.82532 only 1% is
different from train data to test data.

The results are as follows,


 precision: 0.74
 recall: 0.72
 f1-score: 0.73

For Training Set Class - Labour:

 precision: 0.88
 recall: 0.88
 f1-score: 0.88

For Testing Set Class - Conservative:

 precision: 0.68
 recall: 0.72
 f1-score: 0.70

For Testing Set Class - Labour:

 precision: 0.89
 recall: 0.87
 f1-score: 0.88

26
From these two models,
 Naive Bayes model is not overfitted or under fitted.
 Recall and Precision and Accuracy scores are within the range of +- 5%.
 KNN model is also not over fitted or under fitted, from the 3 combinations , the
combination 2 gives the better output as the n_neighbour 17.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting

Boosting:

Boosting is an ensemble learning method that combines a set of weak learners


into a strong learner to minimize training errors. In boosting, a random sample of
data is selected, fitted with a model and then trained sequentially—that is, each
model tries to compensate for the weaknesses of its predecessor. With each
iteration, the weak rules from each individual classifier are combined to form one,
strong prediction rule

In Boosting two models are used:


1. Adaptive Boost Classifier
2. Gradient Boost Classifier

The Best Parameter used in ADA BOOSTING is n_estimators=100,random_state=1 as normal


model functions,

Model Performance on Train Data set Accuracy/ Score: 85%

27
Model Performance on Train Data set Accuracy/ Score: 82%
As the other models, the above information getting nearby the same values,

The results are as follows,


 precision: 0.78
 recall: 0.72
 f1-score: 0.74

For Training Set Class - Labour:

 precision: 0.88
 recall: 0.91
 f1-score: 0.89

For Testing Set Class - Conservative:

 precision: 0.68
 recall: 0.69
 f1-score: 0.68

For Testing Set Class - Labour:

 precision: 0.88
 recall: 0.87
 f1-score: 0.87

Here, the recall for the parameter 1 getting above 90.


And the scores of train data is 0.8472 and the scores of test data is 0.8187 which the
difference is about 2% only . So, it also considered as a good model, still it’s not a perfect
model and need to work on the data set to improve the model.

28
Now we need take account on Gradient boosting by importing the Gradient boosting
classifier

The Best Parameter used in GRADIENT BOOSTING is "n_estimators":


[5,50,250,500],"max_depth":[1,3,5,7,9],"learning_rate":[0.01,0.1,1,10,100],

Model Performance on Train Data set Accuracy/ Score: 87%

Model Performance on Train Data set Accuracy/ Score: 84%


As we can see that Gradient boosting gives the better accuracy than any other by
using some tuning parametrs.
The difference in their score is within the range of +- 5% and it considered to be a
good model as their auc score very high as range 90%

BAGGING
Bagging, also known as bootstrap aggregation, is the ensemble learning method that
is commonly used to reduce variance within a noisy dataset. In bagging, a random sample
of data in a training set is selected with replacement—meaning that the individual data
points can be chosen more than once.

29
Combination1:

The Best Parameter used in Bagging is "


base_estimator=RF,n_estimators=100,random_state=1, here RF is Random forest

Model Performance on Train Data set Accuracy/ Score for this combination 1: 97%

Model Performance on Train Data set Accuracy/ Score for this combination 1: 84%

The results are as follows,


 precision: 0.97
 recall: 0.92
 f1-score: 0.94

For Training Set Class - Labour:

 precision: 0.96
 recall: 0.99
 f1-score: 0.98

For Testing Set Class - Conservative:

 precision: 0.71
 recall: 0.71
 f1-score: 0.71

30
For Testing Set Class - Labour:

 precision: 0.88
 recall: 0.89
 f1-score: 0.89

In the above Combination1 Train data having 97% accuracy which is more than the
other models and as well as for the test data accuracy is 84% only .Eventhough recall for
the 1 of train dataset is 0.99(close to 1) , some variations in the test data as 0.89. For the
overall this good with train data and need to work with dataset for test improvements.

31
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized

For LR model

For LDA model


32
For KNN model
33
For Naïve Bayes

34
For Boosting
35
For Bagging
36
From these plots , Boosting gives the more compact test and train correlation , even
we can say for higher AUC score in Bagging , But variation between train and test is more
and we cant predict these type of laggings.
In my point of view Boosting(Gradient Boosting ) model will be best/optimized.

1.8 Based on these predictions, what are the insights?


37
All the models have some disturbances by the given data .Its data given is already
skewed and 70% of survey regards with labour and 30% covered by conservative.This is the
reason we can say that data is ‘Unbalanced’ is we need to balance the data using SMOTE.
But here is an another problem regarding with the smote, its helps us to give
balancing data with some duplicates. It results in unstable to predict the poll from the
survey as its regards with votes.
And clearly we can easily LABOUR will be more dominated upon every parameters as
well as in economic which means the people choosing majority with this only.
For those models, my preference is upon Boosting as it gives better and equivalent
values with Train and Test data.

38
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
Summary:
NLTK will provide you with everything from splitting paragraphs to sentences, splitting
words, identifying the part of speech, highlighting themes, and even helping your machine
understand what the text is about.
Let us look in speeches given by the three different Presidents of the United States of
America separately.

1. President Franklin D. Roosevelt in 1941


Introduction:
Lets Import the required libraries such as Pandas, NumPy ,Regular Expressions as re,
etc . And the Main library that plays a important role here is NLTK.
From the NLTK Library import the ‘inaugural’ and Downloading the 1941 inaugural
speech of President Roosevelt and the process also completed.

2.1 Find the number of characters, words, and sentences for the mentioned documents.
After importing the text file, we would first count the total number of characters in
each file separately , and the Textstat is an easy to use library to calculate statistics from
text. It helps determine readability, complexity, and grade level .
Lets check into President Roosevelt inaugural speech,
 Number of words used in Roosevelt inaugural speech is 7571
 Total words which includes stop words as well.
 Roosevelt speech contains ‘58’ unique words.
 Number of sentences used in speech is 68.
 Number of characters used in speech is 6174.
 Number of words used without stop words is 1360.

Num sentences: 68
Num chars: 6174
Num words: 1360

39
Create Data Frame:
Lets create the data frame for Roosevelt speech for further process. Now need to
Import the stop words from nltk corpus. And before removing the stop words, Word ‘the’
used in speech for 104 times. Then word ‘of’ used 81 times, likewise ‘and’,’to’.’in’,..etc.
the 104
of 81
and 41
to 35
in 30
...
carried 1
undertaken 1
common 1
joined 1
God. 1
Length: 581, dtype: int64

Lets Remove the punctuations special characters from Roosevelt speech by using the
Lambda ‘Lower’ function to make all the words as lower case in inaugural speech.
Now these words are splitted the data into each words and lets taking the value
count of those. Word ‘the’ used 114 times in Roosevelt speech, ‘of’ as 81,..etc. The sample
of the words repeated as follows,
[('the', 114),
('of', 81),
('and', 46),
('to', 36),
('in', 35),
('we', 32),
('a', 30),
('it', 28),
('is', 24),
………………………………
('stand', 1),
('service', 1),
('country', 1),
('god', 1)]

2.2 Remove all the stop words from the three speeches. Show the word count before and
after the removal of stop words. Show a sample sentence after the removal of stop
words.
We would use the library from nltk.corpus import stopwords.

40
Remove Stop Words:
We need to take from the nltk library ,‘English’ from the Stop words and list with
strings and punctuations . Then Create the for-loop and removing these stop words from
president speech. After removing stop words, the Roosevelt speech contains ‘427’ words.

427 total words used

And the sample sentences are,

'nation know spirit democracy us life people america years freedom human mind
speaks day states nations men government new body must something faith united task
within history live future free alone still every continent like person world
sacred came first destiny national 1789 sense create together disruption without
come time midst stock may lives little measure doubt measured americans true
republic acted security things present many built maintained constitution freely
american seen cannot enterprise forms hopes find even upon early peoples written
land could forward go enough would seem old words preservation inauguration since
renewed dedication washingtons weld lincolns preserve save institutions swift
happenings pause moment take recall place rediscover risk real peril inaction
determined count lifetime man threescore ten less fullness believe form frame
limited kind mystical artificial fate unexplained reason tyranny slavery become
surging wave ebbing tide eight ago seemed frozen fatalistic terror lien lived
perished daily ways often unnoticed obvious capital processes governing
sovereignties 48 counties cities towns villages hemisphere across seas enslaved
well sometimes fail hear heed voices privilege story proclaimed prophecy spoken
president inaugural almost directed year 1941 fire liberty republican model justly
considered deeply finally staked experiment intrusted hands lose fireif let
smothered fear shall reject washington strove valiantly triumphantly establish
furnish highest justification sacrifice make cause defense face great perils never
encountered strong purpose protect perpetuate integrity muster retreat content
stand service country god'

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stop words)

FreqDist({'nation': 11, 'know': 10, 'spirit': 9, 'democracy': 9, 'us': 8, 'life':


8, 'people': 7, 'america': 7, 'years': 6, 'freedom': 6, ...})

From this series of 427 words, the top most frequently used words are nation, know,
spirit,democracy. Here, nation used 11 times, know used 10 times ,spirit used 9 times,
democracy also used 9.

The President Roosevelt’s famous speech is ‘Nothing to fear but fear itself may refer
to’. But these combinations, the word ‘nation’ used frequently in his speech as he loves his
nation so much .
41
2.4 Plot the word cloud of each of the three speeches. (After removing the stop words)

Lets plot the word cloud of president Roosevelt speech, by combining all the
frequently used words. Here the ‘word cloud’ plays the important role.

PLOT:

INSIGHTS:
As we already discussed the word ‘nation’ repeatly occurs as the President Roosevelt
speech loves his nation so much we can decide. And word know also used frequently as he
convey the truth of the status in those war period. And the spirit he refers to the people of
America who were trusts the president more, most importantly he used word
democracy ,comes from the Greek words "demos", meaning people, and "kratos" meaning
power; so democracy can be thought of as "power of the people": a way of governing
which depends on the will of the people as he informing their people to peek their will that
will change the world.

42
2. President John F. Kennedy in 1961
Introduction:
Lets Import the required libraries such as Pandas, NumPy ,Regular Expressions as re,
etc . And the Main library that plays a important role here is NLTK.
From the NLTK Library import the ‘inaugural’ and Downloading the 1961 inaugural
speech of John F. Kennedy and the process also completed.

2.1 Find the number of characters, words, and sentences for the mentioned documents.
After importing the text file, we would first count the total number of characters in
each file separately , and the Textstat is an easy to use library to calculate statistics from
text. It helps determine readability, complexity, and grade level .
Lets check into President John F. Kennedy inaugural speech,
 Number of words used in John F. Kennedy inaugural speech is 7618
 Total words which includes stop words as well.
 John F. Kennedy speech contains ‘57’ unique words.
 Number of sentences used in speech is 54.
 Number of characters used in speech is 6202.
 Number of words used without stop words is 1390.

Num sentences: 54
Num chars: 6202
Num words: 1390

Create Data Frame:


Lets create the data frame for John F. Kennedy speech for further process. Now need
to Import the stop words from nltk corpus. And before removing the stop words, Word
‘the’ used in speech for 83 times. Then word ‘of’ used 65 times, likewise ‘and’,’to’ used
37.’and’ used 37,..etc.
the 83
of 65
to 37
and 37
a 29
..
it, 1
doing 1
Communists 1
required 1
own. 1
Length: 620, dtype: int64

43
Lets Remove the punctuations special characters from John F. Kennedy speech by
using the Lambda ‘Lower’ function to make all the words as lower case in inaugural speech.
Now these words are splitted the data into each words and lets taking the value
count of those. Word ‘the’ used 86 times in John F. Kennedy speech, ‘of’ as 65,..etc. The
sample of the words repeated as follows,
[('the', 86),
('of', 65),
('to', 42),
('and', 41),
('we', 30),
('a', 29),
('in', 26),
('our', 21),
('that', 20),
('not', 19),
('for', 16),
('let', 16),
………………………………
('knowing', 1),
('here', 1),
('gods', 1),
('work', 1),
('must', 1)]

2.2 Remove all the stop words from the three speeches. Show the word count before and
after the removal of stop words. Show a sample sentence after the removal of stop
words.
We would use the library from nltk.corpus import stopwords.

Remove Stop Words:


We need to take from the nltk library ,‘English’ from the Stop words and list with
strings and punctuations . Then Create the for-loop and removing these stop words from
president speech. After removing stop words, the President John F. Kennedy speech
contains ‘455’ words.

456 total words used

And the sample sentences are,

let us world sides new pledge citizens power shall free nations ask president
fellow freedom man first americans war peace always cannot hope help arms country
call today well human poverty life globe dare go generation know bear control may
good join begin never final vice mr god forebears century hands forms yet around
rights hand revolution word forth time friend foe passed nation committed every
whether burden meet oppose assure success loyalty united little powerful states
44
welcome merely far tyranny find supporting back best seek south offer deeds
alliance powers instruments weak finally would anew science weakness beyond doubt
course balance negotiate fear explore problems unite instead bring absolute
together disease earth endeavor finished days though long struggle year enemies
history light truly america johnson speaker chief justice eisenhower nixon truman
reverend clergy observe victory party celebration symbolizing end beginning
signifying renewal change sworn almighty solemn oath l prescribed nearly

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stop words)
FreqDist({'let': 16, 'us': 12, 'world': 8, 'sides': 8, 'new': 7, 'pledge': 7,
'citizens': 5, 'power': 5, 'shall': 5, 'free': 5, ...})

From this series of 456 words, the top most frequently used words are nation, know,
spirit,democracy. Here, let used 16 times, us used 12 times, world used 8 times , sides used
8 times, new also used 7.

The President John F. Kennedy famous speech is ‘Let us never negotiate out of fear.
But let us never fear to negotiate’. But these combinations, the word ‘let, us’ used
frequently as his speech was also an appeal for domestic and international cooperation to
tackle universal humanitarian issues while promoting democratic ideals, and also the
word ‘us’ used as he always refering the citizens are all Equal .

2.4 Plot the word cloud of each of the three speeches. (After removing the stop words)

Lets plot the word cloud of president John F. Kennedy speech, by combining all the
frequently used words. Here the ‘word cloud’ plays the important role.

45
PLOT:

INSIGHTS:
Now the Plot created. From this plot, we can see that ‘let, us’ is highlighted so much
as we already discussed that President John F.Kennedy speech was an appeal for domestic
and international cooperation to tackle universal humanitarian issues while promoting
democratic ideals.
Even though in that time of cold war, he care about the people and always wants to
work as team to overcome the fear, so he used word ‘us’ as he was not used ‘I’ or ‘YOU’ .
The word ‘world’ informs that he wants to face new challenges with his nation for
the new things or by new plans by using the word ‘new’ .

3. President Richard Nixon in 1973

Introduction:
Lets Import the required libraries such as Pandas, NumPy ,Regular Expressions as re,
etc . And the Main library that plays a important role here is NLTK.
From the NLTK Library import the ‘inaugural’ and Downloading the 1973 inaugural
speech of President Richard Nixon and the process also completed.
46
2.1 Find the number of characters, words, and sentences for the mentioned documents.
After importing the text file, we would first count the total number of characters in
each file separately , and the Textstat is an easy to use library to calculate statistics from
text. It helps determine readability, complexity, and grade level .
Lets check into President Richard Nixon inaugural speech,
 Number of words used in Richard Nixon inaugural speech is 9991
 Total words which includes stop words as well.
 Richard Nixon speech contains ‘61’ unique words.
 Number of sentences used in speech is 70.
 Number of characters used in speech is 8122.
 Number of words used without stop words is 1819.

Num sentences: 70
Num chars: 8122
Num words: 1819

Creating Data Frame:


Lets create the data frame for Richard Nixon speech for further process. Now need to
Import the stop words from nltk corpus. And before removing the stop words, Word ‘the’
used in speech for 8 times. Then word ‘of’ used 6 times, likewise ’to’ used 65 ,’in’ used
54,and so on.
the 80
of 68
to 65
in 54
and 47
..
Vice 1
also, 1
gladly, 1
engage; 1
purpose. 1
Length: 610, dtype: int64

Lets Remove the punctuations special characters from Richard Nixon speech by using
the Lambda ‘Lower’ function to make all the words as lower case in inaugural speech.
Now these words are splitted the data into each words and lets taking the value
count of those. Word ‘the’ used 83 times in Richard Nixon, ‘of’ as 68,..etc. The sample of
the words repeated as follows,

47
[('the', 83),
('of', 68),
('to', 65),
('in', 58),
('and', 50),
('we', 47),
('a', 35),
('that', 33),
('our', 32),
('for', 32),
…………………………………..
('go', 1),
('sustained', 1),
('created', 1),
('striving', 1),
('always', 1),
('serve', 1),
('purpose', 1)]

2.2 Remove all the stop words from the three speeches. Show the word count before and
after the removal of stop words. Show a sample sentence after the removal of stop
words.
We would use the library from nltk.corpus import stopwords.

Remove Stop Words:


We need to take from the nltk library ,‘English’ from the Stop words and list with
strings and punctuations . Then Create the for-loop and removing these stop words from
president speech. After removing stop words, the President Nixon speech contains ‘419’
words.

419 total words used

And the sample sentences are,


'us let peace world new america responsibility government great home abroad nation
americas together years shall policies role make every history better time nations
right people help four today era responsibilities progress come respect others act
one promise long work freedom old proud faith mr country share war resolve retreat
greatly century bold end another future forward build structure live system gladly
challenges away way individual ask ashamed think spirit conflict meet stand use
enter leads danger renew past year initiatives toward merely wars generations
important understand unless preserve nature force continue arms also indispensable
preserving worlds made place differences strong chance meeting remain ever american
needs reach building turning failed shift lived washington turn best human must
learn vital play pledge boldly 200th level record confident god dreams may hope

48
vice president speaker chief justice senator cook mrs eisenhower fellow citizens
good met ago bleak depressed prospect seemingly endless destructive threshold
central question postwar periods often isolation stagnation invites become borne
third saw farreaching results continuing revita

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stop words)
FreqDist({'us': 26, 'let': 22, 'peace': 19, 'world': 16, 'new': 15, 'america': 13,
'responsibility': 11, 'government': 10, 'great': 9, 'home': 9, ...})

From this series of 419 words, the top most frequently used words are us, let,
peace,world,… so on. Here, us used 26 times, let used 22 times, world used 16 times , new
used 15 times, and so on.
Here in president Richard Nixon speech, also word ‘us’ is repeated most as the other
presidents for conveying the informations to the people.

2.4 Plot the word cloud of each of the three speeches. (After removing the stop words)
PLOT:

Lets plot the word cloud of president Richard Nixon speech, by combining all the
frequently used words. Here the ‘word cloud’ plays the important role.

49
INSIGHTS:

Now the Plot created. From this plot, we can see that same as the President john
F.Kennedy, President Richard nixon also used the word ‘us’, it’s highlighted in plot most.
Then he used word ,let and peace we can clearly see that president nixon doesn’t want war
eventhough in the time of Vietnamese war and also used the word ‘world’ most as he
suggest the lead world domination .

50

You might also like