You are on page 1of 29

Great Learning

Machine Learning
Business Report
DSBA April 2023

By

Donkada Vindhya Mounika Patnaik

1
Contents

Problem 1……………………………………………………………………………………5
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it………………………………………………………………………………………..5
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers…7
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30)…………………………………………………………….13
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)……………………………14
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results…………………………….17
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting……...21
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized………………………….25
1.8 Based on these predictions, what are the insights?......................................................................26

Problem 2……………………………………………………………………………………………26

2.1 Find the number of characters, words, and sentences for the mentioned documents…………….26

2.2 Remove all the stopwords from all three speeches……………………………………………….26

2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)…………………………………………………28

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)……28

2
List of Figures
Figure 1: Univariate Analysis…………………………………………………………………………..7
Figure 2: Vote Countplot……………………………………………………………………………....8
Figure 3: Gender Countplot……………………………………………………………………………8
Figure 4: Pairplot……………………………………………………………………………………….9
Figure 5: Heatmap……………………………………………………………………………………..9
Figure 6: Bivariate Analysis…………………………………………………………………………..10
Figure 7: Blair and Economic Household……………………………………………………………10
Figure 8: Hague and Economic Household…………………………………………………………..11
Figure 9: Hague and Economic National……………………………………………………………...11
Figure 10: Blair and Economic National……………………………………………………………...11
Figure 11: Blair and Europe…………………………………………………………………………...12
Figure 12: Hague and Europe…………………………………………………………………………12
Figure 13: Blair and Political Knowledge……………………………………………………………..13
Figure 14: Hague and Political Knowledge…………………………………………………………...13
Figure 15: Confusion Matrix………………………………………………………………………….14
Figure 16: ROC Curve – Train Data………………………………………………………………….15
Figure 17: ROC Curve – Test Data…………………………………………………………………...15
Figure 18: ROC Curve – Train Data………………………………………………………………….16
Figure 19: ROC Curve – Test Data…………………………………………………………………...17
Figure 20: KNN Classification – Test Data…………………………………………………………..17
Figure 21: ROC Curve – Train Data………………………………………………………………….18
Figure 22: Naïve Bayes – Train Data…………………………………………………………………19
Figure 23: Naïve Bayes – Test Data…………………………………………………………………..20
Figure 24: ADA Model – Train Data…………………………………………………………….........21
Figure 25: ADA Model – Test Data…………………………………………………………………..22
Figure 26: Decision Tree Classifier – Train Data…………………………………………………….22
Figure 27: Decision Tree – Test Data…………………………………………………………………23
Figure 28: Random Forest – Train Data………………………………………………………………24
Figure 29: Random Forest – Test Data………………………………………………………………..24

List of Tables
Table 1: Head…………………………………………………………………………………………5
Table 2: Tail…………………………………………………………………………………………..5
Table 3: Data Information……………………………………………………………………………5
Table 4: Null Values………………………………………………………………………………….6
Table 5: Data Description…………………………………………………………………………….6
Table 6: Skeweness…………………………………………………………………………………..7
Table 7: Transformed Data………………………………………………………………………….13
Table 8: Train Test Split……………………………………………………………………………..14
Table 9: Classification Report – Train Data………………………………………………………...14
Table 10: Classification Report – Test Data………………………………………………………...15
Table 11: Classification Report – Train Data……………………………………………………….16
Table 12: Classification Report – Test Data………………………………………………………...16
Table 13: KNN Classification – Test Data…………………………………………………………..17

3
Table 14: KNN Classification – Train Data………………………………………………………..18
Table 15: Naïve Bayes – Train Data………………………………………………………………..19
Table 16: Naïve Bayes – Test Data…………………………………………………………………19
Table 17: ADA Model – Train Data………………………………………………………………...21
Table 18: ADA Model – Test Data………………………………………………………………….21
Table 19: Decision Tree Classifier – Train Data……………………………………………………22
Table 20: Decision Tree - Test Data………………………………………………………………...23
Table 21: Random Forest Classifier – Train Data…………………………………………………..23
Table 22: Random Classifier – Test Data…………………………………………………………...24
Table 23: Model Comparison……………………………………………………………………….25

4
Problem 1:

You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to
predict which party a voter will vote for on the basis of the given information, to create an exit
poll that will help in predicting overall win and seats covered by a particular party.

1.1 Read the dataset. Describe the data briefly. Interpret the inferences for each. Initial steps
like head() .info(), Data Types, etc . Null value check, Summary stats, Skewness must be
discussed.

Table 1: Head

Table 2: Tail

Above table has 1525 columns and 10 rows.

Table 3: Data Information

5
As mentioned in the above table, there are 8 integer variables and 2 object.

Table 4: Null Values

As per the output in above table, there are zero null values in the data. There are no duplicates and no
missing values.

Table 5: Data Description

Inferences:

 The average age of the survey respondents is approximately 54 years. Age varies widely, with
the youngest respondent being 24 years old and the oldest being 93 years old. The majority of
respondents (50%) fall in the age range between 41 and 67 years.
 The average national economic condition rating is around 3.25.
 The average household economic condition rating is slightly lower at around 3.14.
 The average rating for a variable named "Blair" is approximately 3.33.
 Hague, has an average rating of around 2.75.
 Europe has an average score of about 6.73.
 Respondents' political knowledge, as represented by the variable "political.knowledge," has
an average score of approximately 1.54. 50% of respondents having a knowledge score of 2.

6
Table 6: Skeweness

Inferences:

 The positive coefficient for "age" (0.144621) suggests that as age increases, the outcome
variable tends to increase. This indicates that the data may have a slight positive skew, with a
concentration of lower values and some high values at the right tail.
 The strongly negative coefficient for "Blair" (-0.535419) indicates that unfavorable sentiment
or ratings towards Blair are associated with a higher likelihood of the outcome. This could
suggest that the data related to "Blair" might exhibit negative skewness, with more
respondents having lower ratings and a few having very high ratings.

1.2 Perform EDA (Check the null values, Data types, shape, Univariate, bivariate analysis).
Also check for outliers (4 pts). Interpret the inferences for each (3 pts) Distribution
plots(histogram) or similar plots for the continuous columns. Box plots. Appropriate plots for
categorical variables. Inferences on each plot. Outliers proportion should be discussed, and
inferences from above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct output and
inferences should be logical and correct.

Figure 1: Univariate Analysis

7
Inferences:

 Age is normally distributed as there are no outliers.


 Economic national and household are left skewed as there is presence of outlier.

Figure 2: Vote Countplot

Number of votes for Labour for is more than conservative.

Figure 3: Gender Countplot

In the above chart, 0 represents female voters and 1 represents male voter. Thus, Figure 3 shows that
there are more female voters than male.

8
Figure 4: Pairplot

Figure 5: Heatmap

There is no correlation among variables.

9
Figure 6: Bivariate Analysis

Inferences:

 Minimum age of voter in Blair is 40. Concentration of voters are aged between 40 to 65 years.
 Minimum age of voter in Hague is 42. Concentration of voters are aged between 42 to 65
years.

Figure 7: Blair and Economic Household

10
Figure 8: Hague and Economic Household

By comparing Figure 7 and 8, Blair has more economic points in terms of Hague.

Figure 9: Hague and Economic National

Figure 10: Blair and Economic National

Blair has more points in terms of economic household than Hague.

11
Figure 11: Blair and Europe

Figure 12: Hague and Europe

In the whole Europe Blair is leading in comparison to Hague.

12
Figure 13: Blair and Political Knowledge

Figure 14: Hague and Political Knowledge

Blair is leading in Europe.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?( 2
pts), Data Split: Split the data into train and test (70:30)

As shown in Table 3, vote and gender are categorical variables i.e. string values which are converted
into numeric values.

Table 7: Transformed Data

13
Further data is divided into 2 sets – Training set and Testing set which is split in 70 and 30 per cent
respectively. Vote is the target variable which is dropped.

Table 8: Train Test Split

The dataset exhibits disparities in feature magnitudes and units, posing a challenge for machine
learning algorithms relying on Euclidean distance. Neglecting units can lead to inconsistent results,
especially when comparing different units (e.g., 1 km vs. 1000 meters). Scaling via MinMaxScaler is
essential to standardize features, accounting for a mix of variable types. However, it is done for KNN.

1.4 Apply Logistic Regression and LDA (Linear Discriminant Analysis) (2 pts). Interpret the
inferences of both models.

Logistic Regression:

x_train and y_train is fit to perform Logistic Regression. Considering Vote as the target variable –

Table 9: Classification Report – Train Data

Figure 15: Confusion Matrix

Inferences:

 The model appears to perform better for the "Labour" class in terms of precision, recall, and
F1-score, indicating it is more accurate and sensitive in identifying instances of "Labour."
 While the model has a higher precision for the "Labour" class, it has a slightly lower recall for
the "Conservative" class, suggesting that it might miss some "Conservative" instances.

14
 The overall accuracy of the model is 84%, which is a good starting point, but the specific
choice of metrics should depend on the problem and its requirements.

Table 10: Classification Report – Test Data

Inferences:

 The model performs better in classifying class 1 (likely 'Labour') with higher precision, recall,
and F1-score compared to class 0 (likely 'Conservative').
 The model's overall accuracy is reasonably good at 81%.

Figure 16: ROC Curve – Train Data

Figure 17: ROC Curve – Test Data

15
Linear Discriminant Analysis:

Table 11: Classification Report – Train Data

Figure 18: ROC Curve – Train Data

 The model performs well in classifying both 'Conservative' and 'Labour,' with high precision,
recall, and F1-scores for both classes.
 The model has a relatively high accuracy of 84%, indicating its ability to correctly classify
most instances.
 The 'Labour' class has higher precision, recall, and F1-score compared to the 'Conservative'
class, suggesting better performance in classifying 'Labour.'

Table 12: Classification Report – Test Data

16
Figure 19: ROC Curve – Test Data

Inferences:

 The model has a decent accuracy of 81%, indicating its ability to correctly classify a majority
of instances.
 The 'Labour' class has higher precision, recall, and F1-score compared to the 'Conservative'
class, suggesting better performance in classifying 'Labour.'

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the inferences of each model.
Calculate Train and Test Accuracies for each model. Comment on the validness of models (over
fitting or under fitting)

KNN Model:

An accuracy score of 0.6819 suggests that the KNN model is making correct predictions for about
68.20% of the test data.

Table 13: KNN Classification – Test Data

Figure 20: KNN Classification – Test Data

17
Inferences:

 The precision, recall, and F1-score for the "Labour" class are considerably better than those
for the "Conservative" class. This suggests that the model is better at predicting "Labour"
instances, while its performance is poorer for "Conservative" instances.
 The weighted average F1-score (0.70) suggests an overall moderate performance of the model
on both classes.

Table 14: KNN Classification – Train Data

Figure 21: ROC Curve – Train Data

Inferences:

 The precision and recall for both classes have improved significantly compared to the
previous set of metrics. The model correctly identifies both "Conservative" and "Labour"
instances.
 The "Labour" class continues to have a higher precision and recall, indicating that the model
is more effective at identifying "Labour" instances.
 The F1-score for "Labour" is particularly high at 0.86, indicating a good balance between
precision and recall.
 The weighted average F1-score (0.80) suggests an overall good performance of the model on
both classes. The weighted average takes into account the class imbalance, and the model
performs well across both classes.
 An accuracy of 0.79 means that the model correctly predicts the class labels for the majority
of instances in the test data.

18
Naïve Bayes:

Table 15: Naïve Bayes – Train Data

Figure 22: Naïve Bayes – Train Data

Inferences:

 The model seems to perform well, with high precision, recall, and F1-Scores for both classes.
 "Labour" class has higher precision and recall compared to the "Conservative" class,
suggesting better model performance in predicting "Labour."
 The accuracy of 84% indicates the overall correctness of the model's predictions.

Table 16: Naïve Bayes – Test Data

19
Figure 23: Naïve Bayes – Test Data

Inferences:

 The model shows a reasonably good performance in distinguishing between "Conservative"


and "Labour."
 "Labour" class has higher precision and recall compared to the "Conservative" class,
indicating better model performance in predicting "Labour."
 The accuracy of 81% suggests that the model provides correct predictions for a substantial
portion of the dataset.
 The macro and weighted averages are similar, suggesting that class imbalances may not be
significant in this dataset.

KNN and Naïve Bayes Comparison:

 Accuracy: The Naïve Bayes model has a higher accuracy on the test data (81%) compared to
the KNN model (68.20%). This suggests that the Naïve Bayes model is making correct
predictions for a larger portion of the test dataset.
 Precision and Recall: Both models perform better in predicting the "Labour" class compared
to the "Conservative" class, but the Naïve Bayes model tends to outperform KNN in terms of
precision and recall for both classes.
 F1-Score: The F1-scores for the "Labour" class are higher for both models, but the Naïve
Bayes model generally has higher F1-scores overall.
 Class Imbalance: The Naïve Bayes model seems to handle class imbalance well, as indicated
by similar macro and weighted averages. The KNN model's performance varies more between
training and test data.

In conclusion, the Naïve Bayes model appears to be more valid and robust for the given dataset. It
consistently outperforms the KNN model in terms of accuracy, precision, recall, and F1-scores.

20
1.6 Model Tuning, Bagging and Boosting. Apply grid search on each model (include all models)
and make models on best_params. Compare and comment on performances of all. Comment on
feature importance if applicable. Successful implementation of both algorithms along with
inferences and comments on the model performances.

ADA Boosting Model:

Table 17: ADA Model – Train Data

Figure 24: ADA Model – Train Data

Inferences:

The model appears to perform reasonably well with an overall accuracy of 86%. However, it's better
at predicting the Labour class, which is evident from higher precision and recall values for Labour.
The model's performance for the Conservative class is slightly lower, as indicated by lower precision
and recall.

Table 18: ADA Model – Test Data

21
Figure 25: ADA Model – Test Data

Inferences:

The model appears to perform reasonably well with an overall accuracy of 82%. It is better at
predicting the Labour class, as indicated by higher precision and recall values for Labour. The model's
performance for the Conservative class is slightly lower, as indicated by lower precision and recall.

Decision Tree Classifier:

Table 19: Decision Tree Classifier – Train Data

Figure 26: Decision Tree Classifier – Train Data

22
Inferences:

The model appears to perform reasonably well with an accuracy of 86%. It is better at predicting the
"Labour" class, as indicated by the higher precision and recall values for "Labour." The F1-scores
suggest that the model achieves a good balance between precision and recall for both classes.
However, the model's performance for the "Conservative" class is slightly lower than for the "Labour"
class.

Table 20: Decision Tree - Test Data

Figure 27: Decision Tree – Test Data

Inferences:
The model has a moderate overall accuracy of 75%. It is better at predicting the "Labour" class, as
indicated by the higher precision and recall values for "Labour." The F1-scores suggest that the model
achieves a good balance between precision and reolkcall for the "Labour" class, but its performance
for the "Conservative" class is lower.

Random Forest Classifier:

Table 21: Random Forest Classifier – Train Data

23
Figure 28: Random Forest – Train Data

Inferences:

The model performs well with an overall accuracy of 86.48%. It is slightly better at predicting the
"Labour" class, as indicated by the higher precision and recall values for "Labour." The F1-scores
suggest that the model achieves a good balance between precision and recall for both classes, with a
particularly high F1-score for "Labour." This model appears to be quite effective in distinguishing
between "Conservative" and "Labour" instances in the dataset.

Table 22: Random Classifier – Test Data

Figure 29: Random Forest – Test Data

Inferences:
The model performs relatively well with an overall accuracy of 80%. It is better at predicting the
"Labour" class, as indicated by the higher precision and recall values for "Labour." The F1-scores
suggest that the model achieves a good balance between precision and recall for the "Labour" class,
but its performance for the "Conservative" class is lower.

24
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model,
classification report Final Model - Compare and comment on all models on the basis of the
performance metrics in a structured tabular manner. Describe on which model is
best/optimized, After comparison which model suits the best for the problem in hand on the
basis of different measures. Comment on the final model.

Considering above ROC and AUC under each model, following comparison is derived.

Train Test Precision Precision Recall Recall F1-Score F1-Score


Model Accuracy Accuracy (Labour) (Conservative) (Labour) (Conservative) (Labour) (Conservative)
Logistic
Regression - - High Moderate High Moderate High Moderate
Linear
Discriminant High Moderate High High High High High High
KNN High Moderate High Low High Moderate High Low
Naïve Bayes High Moderate High High High High High High
ADA
Boosting High Moderate High Low High Moderate High Low
Decision
Tree High Moderate High Low High Moderate High Low
Random
Forest High Moderate High Low High Moderate High Low

Table 23: Model Comparison

Based on the comparison:

 Accuracy: The Logistic Regression, Linear Discriminant, and Naïve Bayes models have the
highest accuracy on both training and test data. The ADA Boosting, Decision Tree, and
Random Forest models also have good accuracy, but the KNN model lags behind.
 Precision: Logistic Regression, Linear Discriminant, KNN, Naïve Bayes, and ADA Boosting
have relatively high precision for the "Labour" class. However, Logistic Regression, Linear
Discriminant, and Naïve Bayes outperform the others for the "Conservative" class.
 Recall: Logistic Regression, Linear Discriminant, KNN, Naïve Bayes, and ADA Boosting
have high recall for the "Labour" class, but Logistic Regression and Linear Discriminant have
higher recall for the "Conservative" class.
 F1-Score: Logistic Regression, Linear Discriminant, and Naïve Bayes models have high F1-
scores for both classes. KNN and ADA Boosting have high F1-scores for the "Labour" class
but lower scores for the "Conservative" class.
 Optimization: The Decision Tree model, while performing well on the training data, shows a
drop in accuracy and F1-scores on the test data, indicating overfitting. In contrast, the
Random Forest model provides a good balance of performance on both training and test data.

Conclusion:

Among the models evaluated, the Naïve Bayes model and the Logistic Regression model seem to be
the best options. They consistently demonstrate high accuracy, precision, recall, and F1-scores for
both classes, making them well-suited for this classification problem. Naïve Bayes has slightly better
precision and recall for the "Labour" class, while Logistic Regression offers a more balanced
performance for both classes.

25
1.8 Based on your analysis and working on the business problem, detail out appropriate
insights and recommendations to help the management solve the business objective. There
should be at least 3-4 Recommendations and insights in total.

Business Insights:
1. Labour party is performing better than Conservative party.
2. Female voters are more than male voters.
3. Individuals who favor the Labour party tend to have a preference for it when the national
economic conditions are more favorable. Those with a deeper understanding of politics tend
to cast their votes in favor of the Conservative party.

Problem 2:

In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)

2.1 Find the number of characters, words, and sentences for the mentioned documents.

2.2 Remove all the stopwords from all three speeches.

26
27
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) –
3 Marks [ refer to the End-to-End Case Study done in the Mentored Learning Session ]

28
29

You might also like