Professional Documents
Culture Documents
PROJECT
Statement of Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to create
an exit poll that will help in predicting overall win and seats covered by a particular party.
2. age: in years
7. Europe: an 11-point scale that measures respondents' attitudes toward European integra
High scores represent ‘Eurosceptic’ sentiment.
9. gender: female or
male.
Labo fem
0 1 43 3 3 4 1 2 2
ur ale
Labo mal
1 2 36 4 4 4 4 5 2
ur e
Labo mal
2 3 35 4 4 5 2 3 2
ur e
Labo fem
3 4 24 4 2 2 1 4 0
ur ale
Labo mal
4 5 41 2 2 1 1 6 2
ur e
column name count mean std min 25% 50% 75% max
Unnamed: 0 1525 763 440.374 1 382 763 1144 1525
vote 1525 0.69705 0.45969 0 0 1 1 1
age 1525 54.1823 15.7112 24 41 53 67 93
economic.cond.national
1525 3.2459 0.88097 1 3 3 4 5
economic.cond.household
1525 3.14033 0.92995 1 3 3 4 5
Blair 1525 3.33443 1.17482 1 2 4 4 5
Hague 1525 2.74689 1.2307 1 2 2 4 5
Europe 1525 6.72853 3.29754 1 4 6 10 11
political.knowledge
1525 1.5423 1.08332 0 0 2 2 3
gender_male 1525 0.46754 0.49911 0 0 0 1 1
Observations:
• There are 1525 rows and 10 columns in the dataset.
• It has two 2 categorical variables and 8 numerical variables.
• There are no missing values present in the dataset.
• There are 8 duplicated rows present in the dataset.
• ' Unnamed: 0' column has to be dropped.
• In age column five point summary we can see that the minimum age is 24 and maximum
age is 93. So people with age below 24 have not been surveyed.
• Also female voters are greater in frequency in than male voters.
• The mean values are slightly higher than the median values for some variables. It
indicates that the data may be a bit skewed.
• From the target variable we can see that 70% voters chose Labour party.
1.2 Perform EDA (Check the null values, Data types, shape, Univariate, bivariate analysis).
Also check for outliers (4 pts). Interpret the inferences for each (3 pts) Distribution
plots(histogram) or similar plots for the continuous columns. Box plots, Correlation plots.
Appropriate plots for categorical variables. Inferences on each plot. Outliers proportion
should be discussed, and inferences from above used plots should be there. There is no
restriction on how the learner wishes to implement this but the code should be able to
represent the correct output and inferences should be logical and correct.
• From the above plot the target variables categories are seems to be unbalanced.
• The Labour category has 1063 values while the Conservative category has 462 values.
• If, our types of data inside a columns is in 70-30 ratio we can consider it as not imbalanced
or good spread. So we don't have to do sampling.
gender female male All
vote
• There are no outliers present in the Blair, Hague, Europe and political.knowledge variables.
• Both Blair and Hague features have symmetrical box plot. But Europe feature is left skewed
and political.knowledge feature is right skewed.
• Since the political.knowledge feature has 0 as genuine values, the left whisker is not visible
as the lower quartile is equal to minimum value.
• In feature Blair number 4 has the highest data points and in feature Hague number 3 has
the highest data points.
• In Feature Europe number 11 has the most number of data points. Europe is an 11-point
scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment. So most of the voters has 'Eurosceptic’ sentiment.
• As per political.knowledge feature number 2 is the highest number. But the second highest
number is 0 which indicate that many voters don't have the Knowledge of parties' positions
on European integration.
• From the above heat map we can see that there is not much correlation between features.
• Europe and age has a slight positive correlation.
• Most of the features have negative correlation with each other.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not?( 2 pts), Data Split: Split the data into train and test (70:30) (2 pts). The learner is
expected to check and comment about the difference in scale of different features on
the bases of appropriate measure for example std dev, variance, etc. Should justify
whether there is a necessity for scaling. Object data should be converted into
categorical/numerical data to fit in the models. (pd.categorical().codes(),
pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split
should be discussed.
Encoding:
• One Hot encoding is done on the only ‘Object’ types variable i.e. ‘gender’.
• A new column is created, with 1 indicating that variable as True and 0 as False.
• For Target variable used categorical label encoding as we are going to use it for classification.
• This is how the extended variable’s data looks:
In economic.c Ha
economic.cond. Bl Eur political.kn gender
de vote age ond.nation gu
household air ope owledge _male
x al e
1 1 43 3 3 4 1 2 2 0
2 1 36 4 4 4 4 5 2 1
3 1 35 4 4 5 2 3 2 1
4 1 24 4 2 2 1 4 0 0
5 1 41 2 2 1 1 6 2 1
Table :Encoded dataset
• After encoding we have split the dataset into train and test variables in the given ratio
70:30 with random state as 1.
• Since there are 1 and 0 values in the dependent variable, we need to ensure that an equal
number of 1 and 0 are split into both Training and Testing datasets.
• This will ensure a balance in the data and will not cause biasness while Training or Testing
the model. Therefore, a function stratify=target is used while splitting.
Index age economic.cond.national economic.cond.household Blair Hague Eur
534 71 3 3 4 2 1
709 57 4 5 4 1
1145 24 3 4 2 4 1
1082 43 4 4 2 3
958 37 3 2 4 2
Table : Sample train dataset of independent variables
economic.c Bl
Ind ag economic.cond Ha Eur political.k gender
ond.nation ai
ex e .household gue ope nowledge _male
al r
275 71 2 3 4 2 11 0 0
768 31 2 2 2 4 5 2 1
417 35 4 3 2 1 7 2 1
103
34 4 4 4 2 7 0 0
4
508 40 3 4 4 2 7 3 1
Table: Sample test dataset of independent variable
Index vote
534 1
709 1
1145 0
1082 1
958 1
Table : Sample train dataset of dependent variable
Index vote
275 1
768 0
417 1
1034 1
508 1
Table: Sample test dataset of dependent variable
Scaling:
• Scaling is a necessity when using Distance-based models such as KNN, Logistic
regression etc.
• Algorithms like Linear Discriminant Analysis and Naive Bayes do not require feature
scaling.
• So we can use unscaled data for Naive ayes and LDA algorithms and for algorithms like
KNN we can use scaled dataset.
• Especially in this dataset some of the feature values are in different units and ranges.
• For example, 'Age' feature is a continuous variable with wide range of data points
between 24 to 93. So it has a mean of 54 and standard deviation of 15.7, which is totally
in different range from other variables which are ordinal.
• Most of the ordinal variables have 1-5 so their mean is around 3 and standard deviation
is around 1, which is very different from 'Age' variable.
• This might result in inaccuracies in model predictions.
• Therefore, in order for machine learning models to interpret these features on the same
scale, we need to perform feature scaling.
• We can use minmax scaler function as we have both continuous and ordinal variables in
our dataset.
1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis) (2 pts). Interpret
the inferences of both model s (2 pts). Successful implementation of each model. Logical
reason behind the selection of different values for the parameters involved in each model.
Calculate Train and Test Accuracies for each model. Comment on the validness of models
(over fitting or under fitting)
Model Building
0.8417
Classification report
0.8289
Classification report
precision recall f1-score support
Inference
• The model is neither over fitted nor under fitted as the we can see the difference
between the accuracy values of train and test data is not much.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
0.8370
classification_report
0.8333
classification_report
precision recall f1-score support
Inference
• The LDA model looks a bit better than the Logistic Regression models since the
accuracy value of train and test data are nearly equal.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
1.5) Apply KNN Model and Naïve Bayes Model (2pts). Interpret the inferences of each model
(2 pts). Successful implementation of each model. Logical reason behind the selection of
different values for the parameters involved in each model. Calculate Train and Test
Accuracies for each model. Comment on the validness of models (over fitting or under fitting)
KNN Model
As KNN Model is a distance based algorithm we have to use scaled dataset. After the
data preprocessing and scaling KNN model is applied to the Train and Test datasets with default
hyper-parameters. So the n_neighbors parameter value used is 5 here.
0.8737
classification_report
0.8158
classification_report
classification_report
precision recall f1-score support
0.8553
classification_report
Inference
• In the new KNN model test accuracy has increased and train accuracy is decreased. But
the difference between train and test accuracies have reduced. So this is a better fit.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
0.8323
classification_report
precision recall f1-score support
0.0 0.73 0.71 0.72 323
1.0 0.88 0.88 0.88 744
0.8202
classification_report
Inference
• The model is neither over fitted nor under fitted as the we can see the difference
between the accuracy values of train and test data is very less.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
1.6) Model Tuning (4 pts) , Bagging ( 1.5 pts) and Boosting (1.5 pts). Apply grid search on each
model (include all models) and make models on best_params. Define a logic behind choosing
particular values for different hyper-parameters for grid search. Compare and comment on
performances of all. Comment on feature importance if applicable. Successful
implementation of both algorithms along with inferences and comments on the model
performances.
classification_report
precision recall f1-score support
0.8443
classification_report
precision recall f1-score support
Inference
• The model is over-fitted.
• As we can see, the train data has a 99.9% accuracy and test data has 84% accuracy.
The difference is more than 10%. So, we can infer that the Random Forest model is
over-fitted.
• We will use bagging to improve the performance of the model.
Bagging Model
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on
random subsets of the original dataset and then aggregate their individual predictions (either
by voting or by averaging) to form a final prediction. Here we have used the default parameter
values such as n_estimators=10 because increasing it leads to higher over fitting problem.
0.9840
classification_report
0.8421
classification_report
precision recall f1-score support
Inference
• Even after using bagging classifier the over fitting problem is still there.
• But as we can see both train and test data accuracy has reduced.
0.8398
classification_report
precision recall f1-score support
0.8355
classification_report
precision recall f1-score support
Inference
• The ADA Boosting model looks better than the Random Forest and Bagging models
since the accuracy value of train and test data are nearly equal.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
0.8860
classification_report
precision recall f1-score support
0.8421
classification_report
precision recall f1-score support
Inference
• The model is neither over fitted nor under fitted as the we can see the difference
between the accuracy values of train and test data is very less.
• Accuracy, Precision and Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
Best parameters:
{'penalty': 'l2',
'solver': 'sag',
'tol': 0.0001,
max_iter=10000,
n_jobs=2,}
Logistic regression Tuned model(Train data):
Accuracy
0.8398
classification_report
precision recall f1-score support
0.8224
classification_report
precision recall f1-score support
Tuned Model
Data score Regular Model (%) (%)
Train Accuracy 84.17 83.98
Precision 87 86
Recall 91 92
F1-score 89 89
Test Accuracy 82.89 82.24
Precision 86 85
Recall 90 90
F1-score 88 88
Inference
• As we can see from the above tabular comparison, there is not much difference betwee
n the performance regular logistic regression model and tuned logistic regression model
.
• The difference between the train and test accuracies has also been reduced.
• The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good models.
0.8369
classification_report
precision recall f1-score support
0.0 0.75 0.69 0.72 323
1.0 0.87 0.90 0.89 744
0.8333
classification_report
0.8511
classification_report
precision recall f1-score support
0.0 0.75 0.77 0.76 323
1.0 0.90 0.89 0.89 744
0.8266
classification_report
precision recall f1-score support
0.0 0.71 0.72 0.71 138
1.0 0.88 0.87 0.88 318
Inference
• There is no over-fitting or under-fitting in the tuned KNN model. Overall, it is a good mo
del.
• As we can see, the regular KNN model was even better fit than tuned model. So model t
uning has not helped the model much.
• The values are better in the regular or base KNN model.
• Therefore, the regular KNN model is a better model.
0.7832
classification_report
precision recall f1-score support
0.7917
classification_report
precision recall f1-score support
Inference
• The regular Random Forest model was very much over fitted. But tuning helped t
he model to overcome the over fitting problem.
• The difference between the train and test accuracies has also been reduced.
• But accuracy is very much decreased. It is very low compared to other classifier
models.
• Recall, Precision and F1-scores has improved a lot after tuning.
0.9268
classification_report
precision recall f1-score support
0.8355
classification_report
precision recall f1-score support
Inference
• The random forest model even after using bagging technique, is still over- fitted.
0.8520
classification_report
precision recall f1-score support
0.8136
classification_report
precision recall f1-score support
0.8860
classification_report
0.8268
classification_report
Inference
• There is no over-fitting or under-fitting in the tuned Gradient Boosting model. Overall, it
is a good model.
• As we can see, the regular Gradient Boosting model was even better fit than tuned mod
el. So model tuning has not helped the model much.
• The values are better in the regular or base Gradient Boosting model.
• Therefore, the regular Gradient Boosting model is a better model.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model,
classification report (4 pts) Final Model - Compare and comment on all models on the basis of
the performance metrics in a structured tabular manner. Describe on which model is
best/optimized, After comparison which model suits the best for the problem in hand on the
basis of different measures. Comment on the final model.(3 pts)
0.892 0.882
0.8398 0.8224
0.892 0.882
Inference
• Both the models are neither over fitted nor under fitted as the we can see the
difference between the accuracy values of train and test data is not much.
• Accuracy, Precision, Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
• ROC_AUC_score and ROC curve indicates no over fitting.
LDA Model
0.8228 0.8531
0.877 0.914
0.8369 0.8333
0.877 0.882
Inference
• Both the models are neither over fitted nor under fitted as the we can see the
difference between the accuracy values of train and test data is not much.
• Accuracy, Precision, Recall for test data is almost in line with training data. This
proves no over fitting or under fitting has happened, and overall the model is a good
model for classification.
• ROC_AUC_score and ROC curve indicates no over fitting.
KNN Model
0.8407 0.8553
0.914 0.874
0.8228 0.8266
0.922 0.914
1.0 0.8443
[[323 0] [[ 93 45]
[ 0 744]] [ 26 292]]
1.0 0.895
0.7926 0.7763
0.890 0.871
0.9840 0.8421
0.999 0.876
0.9246 0.8268
0.983 0.881
Inference
• Even after using bagging classifier the over fitting problem is still there.
• But as we can see both train and test data accuracy has reduced.
83.98 0.8355
0.900 0.910
0.8520 0.8136
0.915 0.862
Inference
0.8860 0.8421
0.910 0.904
0.9048 0.8289
0.969 0.874
Inference
• There is no over-fitting or under-fitting in the tuned Gradient Boosting model. Overall, it
is a good model.
• As we can see, the regular Gradient Boosting model was even better fit than tuned mod
el. So model tuning has not helped the model much.
• The values are better in the regular or base Gradient Boosting model.
• Therefore, the regular Gradient Boosting model is a better model.
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.
• All the tuned models have high values and every model is good. But as we can see, the
most consistent tuned model in both train and test data is the KNN model.
• The tuned gradient boost model performs the best with 84% accuracy score in train and
85% accuracy score in test. Also it has the best AUC score of 91.4% in both train and test
data which is the highest of all the models.
• It also has a precision score of 88% and recall of 92% which is also the highest of all
the models. So, we conclude that KNN model is the best/optimized model
Statement of Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1) Find the number of characters, words and sentences for the mentioned documents.
(Hint: use .words(), .raw(), .sent() for extracting counts)
Number of characters:
• President Franklin D. Roosevelt's speech have 7571 characters.
• President John F. Kennedy's speech have 7618 characters.
• President Richard Nixon's speech have 9991 characters.
Number of words:
• There are 1536 words in President Franklin D. Roosevelt's speech.
• There are 1546 words in President John F. Kennedy's speech.
• There are 2028 words in President Richard Nixon's speech.
Number of Sentences:
2.2) Remove all the stopwords from the three speeches. Show the word count before and
after the removal of stopwords. Show a sample sentence after the removal of stopwords.
Before, removing the stop-words, we have changed all the letters to lowercase and we
have removed special characters. Also we have stemmed the words using Snowball Stemmer.
2.4) Plot the word cloud of each of the three speeches. (after removing the stopwords)