Professional Documents
Culture Documents
MACHINE LEARNING
PROJECT
By Ritesh Tandon
Problem 1:
You are hired by one of the leading news channel CNBE who wants to analyse recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Dataset for Problem: Election_Data.xlsx
Data Ingestion:
1. Read the dataset. Do the descriptive statistics and do null value condition
check. Write an inference on it.
Exploratory Data Analysis
All the variables except vote and gender are int64 datatypes.
But when looking at the values in the dataset for the other variables, they all look like categorical
columns except age.
Removing the unwanted variable “Unnamed : 0”, which is not giving a meaningful information. And
displaying the head of the Election dataset.
From the below snippet it is evident that the dataset does not have null values.
By Ritesh Tandon
The dataset has few duplicates and removing them is the best choice as duplicates does not add any
value.
Below snippet also shows the shape of dataset after removing the duplicates.
Converting the necessary variables to object as it is meant to be. Because these variables have
values that are numeric but are a categorical column.
From the above snippet we can come to a conclusion that the dataset has only one integer
column which is ’age’
The mean and median for the only integer column ‘age’ is almost same indicating the column
is normally distributed.
‘vote’ have two unique values Labour and Conservative, which is also a dependent variable.
‘gender’ has two unique values male and female.
Rest all the columns has object variables with ‘Europe’ being highest having 11 unique
values.
‘age’ is the only integer variable and it is not having outliers. Also, the dist. plot shows that
the variable is normally distributed.
By Ritesh Tandon
BIal r Hague
By Ritesh Tandon
Bivariate Analysis
By Ritesh Tandon
Labour gets the highest voting from both female and male voters.
Almost in all the categories Labour is getting the maximum
votes. Conservative gets a little bit high votes from Europe ‘11’.
By Ritesh Tandon
From the above we could see people who vote Conservative are the people who are older.
In variable Europe ‘1’ are older people.
Pair Plot
Heat Map
There is no correlation between the variables.
By Ritesh Tandon
Data Preparation:
1. Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test (70:30).
Encoding the dataset
The variables ‘vote’ and ‘gender’ have string values. Converting them into numeric values
for modelling,
By Ritesh Tandon
Scaling
We are not going to scale the data for Logistic regression, LDA and Naive Baye’s models as
it is not necessary.
But in case of KNN it is necessary to scale the data, as it a distance-based algorithm
(typically based on Euclidean distance). Scaling the data gives similar weightage to all the
variables.
Splitting the data into train and test
Modelling:
1. Logistic Regression.
Applying Logistic Regression and fitting the training data
By Ritesh Tandon
The model is not overfitting or underfitting. Training and Testing results shows that the
model is excellent with good precision and recall values.
Training and Testing results shows that the model is excellent with good precision and recall
values.
The LDA model is better than Logistic regression with better Test accuracy and recall values.
2. KNN Model.
Scaling the dataset as it is required because KNN is a distance-based algorithm,
By Ritesh Tandon
Training and Testing results shows that the model is excellent with good precision and recall
values.
This KNN model have good accuracy and recall values.
Naive Bayes.
Importing GaussianNB from sklearn and applying NB model
Fitting the training data
Training and Testing results shows that the model neither overfitting nor underfitting.
The Naive Bayes model also performs well with better accuracy and recall values.
Even though NB and KNN have same Train and Test accuracy. Based on their recall
value in test dataset it is evident that KNN performs better than Naive Bayes.
Applying KNN model and using the hyperparameter Leaf size and n_neighbour to estimate
the model parameters,
By Ritesh Tandon
Basic Decision Tree classifier with gini index and random state of 1
Applying Ada Boosting model and predicting the train and test,
Applying Gradient Boosting model with random state 1 and predicting the dependent
variable,
Applying Random forest, tuning the model to get the best parameters.
By Ritesh Tandon
Applying Bagging on a Random forest model to check its performance and predicting the train and
test,
The Logistic Regression model performs well with good Precision, recall and f1 score values.
Tuned KNN
KNN model is better than LR model performs good train accuracy of 84%, but the test is 83%,
which is also good.
The precision, recall and f1 score is the same as Logistic regression.
By Ritesh Tandon
Decision Tree
DT model is overfitted with 100% train accuracy and 80% test accuracy.
Statistical 10% + or – is acceptable, but here it is over 10%. Hence, OVERFITTING
Confusion Matrix and Classification report - Train
Ada Boosting
Applying Ada Boosting model and predicting the train and test.
The train and test accuracy are 84% and 82% respecting. We have seen models that performs
better than this.
Gradient Boosting
Gradient Boosting model performs the best with 89% train accuracy and with 83% test
accuracy. The precision, recall and f1 score is also good.
Random Forest
Random forest model’s train and test accuracy scores.
Rest all the models are more or less have same accuracy of 84%
By Ritesh Tandon
Inference:
1. Based on these predictions, what are the insights?
The important variable in predicting the dependent variables are
‘Hague’ and ‘Blair’
These are the ratings that the people gave to the Leaders of the ‘Labour’ and ‘Conservative’
party.
As the frequency distribution suggests most of the people gave 4 stars to ‘Blair’ and there are larger
number of people gave 2 stars to ‘Hague’ which made an impact in the dependent variable ‘vote’
By Ritesh Tandon
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words and sentences for the mentioned
documents.
(Hint: use .words(), .raw(), .sent() for extracting counts)
President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390
words.
President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words.
By Ritesh Tandon
President Franklin D. Roosevelt’s speech and President Richard Nixon’s Speech have 68
Sentences and,
President John F. Kennedy’s Speech have 52 Sentences.
2.2 Remove all the stop words from all the three speeches.
Converting all the character to lower case and removing all the punctuations.
By Ritesh Tandon
All the stop words have been removed from all the three speeches.
2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the stop
words)
By Ritesh Tandon
In the below snippets we could see the words that occurred most number of times in their
inaugural address.
Removing the additional stop words from the above snippets and when checking the
frequency
By Ritesh Tandon
Most frequently used words from President Franklin D. Roosevelt’s speech are
Nation
Democracy
Spirit
Most frequently used words from President Richard Nixon’s Speech are
Peace
World
New
America
Most frequently used words from President John F. Kennedy’s Speech are
World
New
Pledge
Power
By Ritesh Tandon
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stop words)
Word Cloud for President Franklin D. Roosevelt’s speech (after cleaning)!!
By Ritesh Tandon
a8ua¿¿e s