You are on page 1of 65

MACHINE LEARNING

PROJECT

ABHISHEK.V.
PGDSBA
List of Content or Index.

Problem 1:
1.1 Read the dataset. Do the descriptive statistics and do the null value condition
check. Write an inference on it. (4 Marks)
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check
for Outliers. (7 Marks)
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here
or not? Data Split: Split the data into train and test (70:30). (4 Marks)
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model. Final Model: Compare the models and write inference which model
is best/optimized. (7 marks)
1.8 Based on these predictions, what are the insights? (5 marks)

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
2.1 Find the number of characters, words, and sentences for the mentioned
documents. – 3 Marks
2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks

1.1) Read the dataset. Describe the data briefly. Interpret the inferences for each.
Initial steps like head() .info(), Data Types, etc . Null value check, Summary stats,
Skewness must be discussed.
Dataset head:

Shape of the dataset:

The Election dataset have 1525 rows and 10 columns


Below are the variables and their datatypes:

All the variable except vote and gender are in int64 datatypes.
But when looking at the values in the dataset for the other variables, they all look
like categorical column except age.

Removing the unwanted variable ‘Unnamed=0’, which is not giving meaningful


information. And displaying the head of the dataset.
From the below snippet it is evident that the dataset does not have null values.

The describe function of the dataset of the numerical variables.

The describe function of the dataset of the Categorical variables.

The dataset has few duplicates and removing them is the best choice as duplicate
does not add any value. Below snippets also shows the shape of the dataset after
removing the duplicate.

The categorical variable as unique value and their counts.


The skewness of the dataset.

The coefficient variance of the dataset.

Descriptive statistics for the dataset.


From the above snippet we can come to the conclusion the dataset has only one
integer column which is ‘age’.
The mean and median for the column ‘age’ is almost same indicating the column is
normal distributed.
‘vote’ as two unique values Labour and Conservative, which also dependent
variable.
‘gender’ as two unique value Male and Female.
Rest all the column has object variables with ‘Europe’ being highest having 11
unique values.
1.2) Perform EDA (Check the null values, Data types, shape, Univariate, bivariate
analysis). Also check for outliers. Interpret the inferences for each Distribution
plots(histogram) or similar plots for the continuous columns. Box plots.
Appropriate plots for categorical variables. Inferences on each plot. Outliers
proportion should be discussed, and inferences from above used plots should be
there. There is no restriction on how the learner wishes to implement this but the
code should be able to represent the correct output and inferences should be logical
and correct.
Univariate Analysis
For Continuous Variables.
We can see that all the numerical variables are normally distributed (not perfectly
normal through and are multi model in some instances as well)
There are outliers present in ‘economic_cond_national’ and
‘economic_cond_household’ variables that can be seen from the boxplot right too.
Also from the boxplot the min and max values of the variables are not very clear,
we can separately obtain them while checking for outliers.

Bivariate Analysis.
Pairplot-
Pairplot tells us about the interaction each variable with every other variable
present.
As such there is no strong relationship is present between the variable.
There is a mixture of positive and negative relationships through which excepted.
Overall, its rough estimate of the interaction, clearer picture be obtained by heatmap
and other kinds of plots.

Analysis of Blair and Age.

People above 45 years of age thinks that Blair is doing good job.
Analysis of Hague and Age.
Hague has slightly more concentration of neutral points than that of Blair for people
above 50 years of age.

Catplot analysis- Blair (count) on economic_cond_household

Catplot Analysis- Hague (count) on economic_cond_household


Blair has more points in terms of economic_cond_household than Hague.

Catplot Analysis- Blair(count) on economic_cond_national.

Catplot Analysis- Hague(count) on economic_cond_national.

Blair has more points in terms of economic national than Hague.


Catplot Analysis-Blair (count) on Europe.

Catplot Analysis- Hague(count) on Europe.

In the whole Europe if we look at the data then Blair leading.


Catplot Analysis-Blair (count) on political_knowledge.

Catplot Analysis- Hague (count) on political_knowledge.

In terms of political_knowledge blair is considered better.


Heatmap.

Multicollinearity is an important issue which can harm the model. Heatmap is the
good way of identifying this issue. Because it gives us the basic idea of the
relationship the variables having with each other.
Observations:
* Highest positive correlation is between ‘economic_cond_national’ and
‘economic_cond_household’ (35%). Good thing is that its not huge.
* Highest negative correlaton is between Blair and Hague(35%) but that is not
huge.
Thus, Multicollinearity won’t be issue in this dataset.
Outlier Check/Treatment:
Using boxplot.
There are outliers present the‘economic_cond_national’ and the
‘economic_cond_househlod’ variables that can be seen from the boxplots.
We will find the upper and lower limits to get clear picture of the outlier.

The upper and lower limit is not much distinct from each other and the outliers the
lower side only that too having value 1 where the lower limit is 1.5.
So it is not advisable to treat the outliers in this case.
We will move forward without treating the outliers.
1.3) Encode the data (having string values) for Modelling. Is Scaling necessary here
or not ? ,Data Split: Split the data into train and test (70:30) .
Encoding the dataset.
As many machine learning models cannot work with the string values we will
encode the categorical variables and convert their datatypes into integer type.
From the info of the dataset, we know there are 2 categorical type variables, so we
need to encode 2 variables with suitable technique.
Those 2 variables are ‘vote’ and ‘gender’. Their distribution is given below.
Gender Distribution.

Vote Distribution.

From the above results we can see that both variables contain only two
classifications of data in them.
After encoding the info.
Data.

Now, the model can be built on this dataset.

Scaling.
We are not going to scale the data for logistic Regression, LDA, Naive Bayes
model as it is not necessary.
But in case of KNN it is necessary to scale the data, as distance based logarithm.
Scaling data gives similar weightage to all the variables.

1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis). Interpret
the inferences of both models.
* Logistic Regression.
Applying logistic Regression and fitting the train data.
Confusion matrix display.

AUC ROC curve for Logistic Train.


AUC : 0.889

Applying logistic Regression and fitting the test data.


Confusion matrix display.

AUC ROC curve for Logistic Test.


AUC: 0.883

The model is not overfitting or underfitting. Training and testing results shows that
the model is excellent with good precision and recall values.
* LDA (Linear Discriminant Analysis).
Applying LDA and fitting the train data.

Confusion matrix display.

AUC ROC curve for LDA Train.


AUC: 0.889
Applying LDA and fitting the test data.

Confusion matrix display.


AUC ROC curve for LDA Test.
AUC: 0.884

Training and Testing results shows that the model is excellent with good precision
and recall values. The LDA model is better than the logistic Regression with better
test accuracy and recall values.

1.5) Apply KNN Model and Naïve Bayes Model. Interpret the inferences of each
model.
Applying KNN model and fitting the train data.
Confusion matrix display.

AUC ROC curve for KNN Train.


AUC: 0.932
Applying KNN model and fitting the test data.

Confusion matrix display.

AUC ROC curve for KNN Test.


AUC: 0.871
The training and testing results show that the model is neither overfitting or
underfitting.
Here we can see that there is on much difference in model scores (giving 75%
accuracy). Hence is not necessary that there will be a performance improvement by
model tuning.

Applying Naive Bayes model and fitting the train data.


Confusion matrix display.

AUC ROC curve for Naive Bayes Train.


AUC: 0.886
Applying Naive Bayes model and fitting test data.

Confusion matrix display.

AUC ROC curve for Naive Bayes Test.


AUC: 0.885
The training and testing results show that the model is neither overfitting or
underfitting.
The Naive Bayes model also performs well with better accuracy and recall values.
Even though KNN and NB model have same train and test accuracy. Based on the
recall value in test dataset it is evident that KNN performs better than Naive Bayes.

1.6) Model Tuning, Bagging and Boosting (1.5 pts). Apply grid search on each
model (include all models) and make models on best params.

Grid-search is used to find the optimal hyperparameters of a model which results in


the most ‘accurate’ predictions. Here we are tried minimize the false negatives in
the models using the grid-search to find the optimal parameters. Grid search can be
used to improve any specific elevation metric.

BAGGING:
Bagging is a machine learning ensemble meta-algorithm designed to improve the
stability and accuracy of machine learning algorithm used in statistical
classification and regression.
Bagging reduces variance and helps to avoid overfitting. Using Decision Tree
Classifier for Bagging Below.
Applying Bagging model and fitting Train data.
Confusion matrix display.

AUC ROC curve for Bagging model Train.


AUC: 1.00
Applying Bagging model and fitting Test data.

Confusion matrix display:

AUC ROC curve for Bagging model Test.


AUC: 0.877
We can see that this is very strong model as the values when comparing train and
test data exceeds the recommended +/- 10.

Bagging reduces variance and helps to avoid overfitting. Using Random Tree Forest
for Bagging Below.

Applying Bagging model and fitting Train data.


Confusion matrix display.

AUC ROC curve for Bagging model Train.


AUC: 0.997
Applying Bagging model and fitting Test data.

Confusion matrix display.

AUC ROC curve for Bagging model Test.


AUC: 0.897
We can see that this is very strong model as the values when comparing train and
test data exceeds the recommended +/- 10.

BOOSTING: Boosting is an ensemble strategy that consecutively builds on weak


learners in order to generate one final strong learner. A weak learner is a model that
may not be very accurate or may not take many predictors into account.
By building weak model, making conclusion about the various feature importance
and parameters and then using those conclusions to build new, strong model,
boosting can effectively convert weak learners into a strong learner. The method of
boosting used here is AdaBoost (Adaptive boosting).
Applying AdaBoost model and fitting Train data.

Confusion matrix display:


AUC ROC curve for Boosting model (Ada boost) Train.
AUC: 0.910

Applying AdaBoost model and fitting Test data.

Confusion matrix display:


AUC ROC curve for Boosting model (Ada boost) Test.
AUC: 0.880

The method of boosting used here is GradientBoost (Gradient boosting).


Applying GradientBoost model and fitting Train data.
Confusion matrix display:

AUC ROC curve for Gradient Boosting model Train.


AUC: 0.950
Applying GradientBoost model and fitting Test data.

Confusion matrix display:

AUC ROC curve for Gradient Boosting model Test.


AUC: 0.904
Gradient Boosting model performs the best with 89% train accuracy. And also have
93% precision and 91% recall which is better than models that have performed in
here with the election data.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model, classification report Final Model.
Logistic Regression
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.889 AUC ROC curve for Test:0.883

Logistic Regression (Grid-searchcv)


Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.890 AUC ROC curve for Test:0.883

LDA:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.883 AUC ROC curve for Test:0.884

LDA (grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.894 AUC ROC curve for Test:0.865

KNN:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.932 AUC ROC curve for Test:0.871

KNN (grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:1.000 AUC ROC curve for Test:0.824

Naïve Bayes:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.886 AUC ROC curve for Test:0.885

Naïve Bayes(grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.892 AUC ROC curve for Test:0.855

Bagging (DecisionTree):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 1.000 AUC ROC curve for Test:0.877

Bagging –DecisionTree (Grid_searchcv):


Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.990 AUC ROC curve for Test:0.843

Bagging –RandomForest:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.997 AUC ROC curve for Test:0.897

Bagging –RandomForest(Grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.998 AUC ROC curve for Test:0.864

Boosting – Adaboost:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.910 AUC ROC curve for Test:0.880

Boosting – Adaboost(Grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.915 AUC ROC curve for Test:0.861

Boosting – Gradient Boosting:


Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.950 AUC ROC curve for Test:0.904

Boosting – Gradient Boosting(Grid_searchcv):


Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.939 AUC ROC curve for Test:0.873

Model comparison:
This is a process through which we will compare all model build and find the best
optimized among. There are total of 8 different kind of model which model build 2
times in following fashion.
- Without scaling.
- With scaling (Grid_searchcv)
The basis on which models are evaluated are known as performance metrics. The
metrics on which models are evaluated are.
- Accuracy
- AUC
- Recall
- Precision
- F1-score

Without Scaling:

From the above:


- Basis on the Accuracy - Bagging Model performs better than other models.
- Basis on the Precision – Naïve Bayes & Gradient Model performs better than
other models.
- Basis on the Recall – Logistic & Bagging Model performs better than other
models.
- Basis on the F1-Score – Bagging Model performs better than other models.

All models performed well with slight difference ranging from (1-5%)
With Scaling (Grid_Searchcv)

From the above:


- Basis on the Accuracy – Logistic Regression Model performs better than other
models.
- Basis on the Precision – Logistic Regression Model performs better than other
models.
- Basis on the Recall – Logistic, LDA, Naïve Bayes & Gradient Model
performs better than other models.
- Basis on the F1-Score – Logistic Model performs better than other models.

All models performed well with slight difference ranging from (1-5%)
As for the scaled and unscaled data models, scaling only improved the performance
of the distance based algorithm for others its slightly decreased performance
overall. Here only KNN from scaled data model performed slightly well than KNN
unscaled model.
Best optimized model-on the basis of all the comparisons and performance metrics
‘Logistic Regression Model’ without scaling performed best of all.
1.8) Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective.

Inferences:
Logistic Regression performed the best out of all the models build.
Logistic Regression equation for the model.
How each feature contributes to the predict output.

The important features in Logistic Regression are.

Insights and Recommendations:


Our Main Business objective is – “To build a model, to predict which party a voter
will vote on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party”.

 Using logistic Regression model without scaling for predict the outcome as it
has the best optimized performance.
 Hyper-parameter is an important aspect of model building. There is a
limitation to this as to process the combinations huge amount of processing
power is required. But if tuning can be done with many sets of parameters
than we might get even better results.
 Gathering more data will also help in training the models and thus improving
their predictive powers.
 We can also create a function in which all the models predict the outcomes in
sequence. This will help in better understanding and the probability of what
the outcome will be.

Problem 2:

In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1.President Franklin D. Roosevelt in 1941
2.President John F. Kennedy in 1961
3.President Richard Nixon in 1973

2.1) Find the number of characters, words and sentences for the mentioned
documents. (Hint: use. words(), .raw(), .sent() for extracting counts)
1.President Franklin D. Roosevelt in 1941 has 7571 characters 1536 words and 68
sentence.
2.President John F. Kennedy in 1961 has 7618 characters 1546 words and 52
sentence.
3.President Richard Nixon in 1973 has 9991 characters 2028 words and 69
sentence.
2.2) Remove all the stopwords from the three speeches. Show the word count
before and after the removal of stopwords. Show a sample sentence after the
removal of stopwords.
1.Words along with stopwords in Roosevelt speech.

Frequency of words before removal of stopwords.

Only stopwords in Roosevelt Speech.

Frequency of words after removal of stopwords.


2.words along with stopwords in john F. Kennedy.

Frequency of words before removing stopwords.

Only stopwords in john F. Kennedy.

Frequency of words of removal of stopwords.


3.words along with stopwords in Richard Nixon.

Frequency of words before removal of stopwords.

Only stopwords in Richard Nixon.

Frequency of words after removal of stopwords.


Before removal of stopwords.
Roosevelt W1=1536
Kennedy W2=1546
Richard W3=2028
After removal of stopwords.
Roosevelt W1=680
Kennedy W2=719
Richard W3=867

2.3) Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)

In the below snippets we could see the words that occurred most number of times in
their inaugural address.
1.President Roosevelt’s 1941 speech.

Most frequently used words from Roosevelt’s speech is


Know
Spirit
Life
2. President john F. Kennedy 1961.

Most frequently used words from john F. Kennedy’s speech is


World
Sides
New
3. President Richard Nixon 1973.

Most frequently used words from Richard Nixon’s speech is


America
peace
world

2.4) Plot the word cloud of each of the three speeches. (after removing the
stopwords).

1.Word cloud for president Franklin D. Roosevelt speech (after removal).


2.Word cloud for president john F. Kennedy’s speech (after removal)

3.Word cloud for president Richard Nixon speech (after removal).

You might also like