You are on page 1of 16

Part:1

Question 1:
Part a)
Read the data file into a pandas data frame
Answer:

Part b)
Identify any duplicate record (s).
Answer:
Part c)
By keeping one duplicated record delete the other record (s) from the dataset.
Answer:

Part d)
What is the dimension of the data frame after removing the duplicates?
Answer:
dimension of data frame after removing duplicate records: (398, 9)

Question 2:
Part a):
How many missing values are in the horsepower column?
Answer:
total missing value in horsepower column is: 6

Part b):
Remove the records having the missing values in the horsepower column.
Answer:
Part c):
Take 10% of the available records as a test set and set the horsepower to null for those
records.
Answer:

Part d):
Fill in the missing values of the test set based on the mean and median of the horsepower of
the training set (90%). Calculate the RMSEs for the imputed values of the test set.
Answer:
Mean of the 10% test horsepower sample is : 28.68627757007431
Median of the 90% horsepower sample is : 92.0

Part e):
Using the same way find the RMSEs, if scikit-learn KNNImputer (for n_neighbors 1, 3 and
5) is
used with weight, acceleration, displacement and mpg features. Decide whether you need to
standardise data.
Answer:
rmse of mean value -> 28.68627757007431
rmse of median value -> 92.0
rmse of knn with n_neg=1 -> 105.4226231257407
rmse of knn with n_negbour=3 -> 105.40709369343823
rmse of knn with n_neigh=5 -> 105.40709369343823

Part f):
Use the best solution to fill the missing values in the horsepower column. What are the filled
values?
Answer:
we can impute the missing value using mean, median KNN, for this case the mean value is
less than the median value so i fill missing value with mean value

Question 3:
Part a):
What are the kurtosis and skewness values of the mpg attribute? Draw the histogram using
the seaborn distplot function.
Answer:
Skew of mpg data: 0.4627614738024299
Kurtosis of mpg data: -0.5164809219117741
Part b):
Identify outliers of mpg using Inter Quartile Range (IQR) approach and impute them with
min
and max values appropriately.
Answer:

IQR value of mpg is: 12.0


Part c):
Transform mpg column using loge (x+1) formula to make the mpg values follow the normal
Distribution
Answer

Part d):

Use a QQ-plot to show that loge (x+1) is a better transformation for mpg. Find the kurtosis
and
skewness of mpg after the transformation.
Answer
Skew of mpg_log data: -0.10088917191646131
Kurtosis of mpg_log data: -0.830610315221898

Part e):
Similarly detect and correct outliers in the weight, displacement, horsepower and acceleration
columns.
Answer
There is no outliers in weight column

#finding displacement outliers


There is total 173 outliers in displacement column.
#finding horsepower outliers
There is total 10 outliers in horsepower column.

#finding acceleration outliers


There is total 10 outliers in acceleration column.
Part f):
Display the correlation matrix using the seaborn heatmap function between continuous
variables; mpg, horsepower, weight, displacement, and acceleration.
Answer:

Question 2:
Part a):
Transform categorical variables `cylinders’ and `origin’ using one-hot encoding.
Answer

#Transforming `cylinders’ using one-hot encoding

#Transforming `origin’ using one-hot encoding


Part B):
Calculate correlation matrices for 1) one-hot encoded `cylinders’ with mpg, and 2) one-hot
encoded `origin’ with mpg.
Answer
#correlation matrices for one-hot encoded `cylinders’ with mpg
#correlation matrices for one-hot encoded `origin’ with mpg

Part C):
Discuss the correlation coefficient values in part b.
Answer
#correlation coefficient values of cylinder with mpg

#correlation coefficient values of origin with mpg


Part D):
Use the label encoder technique to find the correlations between `cylinders’ and mpg, and
`origin’ and mpg
Answer
#correlations between `cylinders’ and mpg using label encoder technique

#correlations between `origin’ and mpg using label encoder technique


Part E):
Which encoder is better (label encoder or one-hot encoder)?
Answer
one hot encoder is better. The benefit of One-Hot-Encoding is that the output is binary rather
than ordinal and that everything is in an orthogonal vector space.

Question 2:
Part a):
Categorize cars into three classes based on fuel efficiency (mpg): low, medium, and high.
Use equal frequency (i.e. number of cars) categorization.
Answer

Part b):
Answer
Use PCA to reduce the dimensionality of correlated features; weight, acceleration,
displacement, and horsepower. (Hint: use Python library PCA from sklearn.decomposition)
Answer
#showing the reducing dimensionality of correlated features using heatmap

##Check the Co-relation between features without PCA

We can observe from the above heatmap that weight, acceleration, displacement, and
horsepower have high co-relation.
Thus, we evidently need to apply dimensionality reduction.

#Checking Co-relation between features after PCA


The heatmap above plainly shows that there is no association between the various derived
principal components (PC1 and PC2).
As a result, we have gone from a higher-dimensional feature space to a lower-dimensional
feature space while ensuring that the correlation between the acquired PCs is as low as
possible.

Part c):
Using a scatter plot, display the differences of three fuel efficiency classes with the first two
principal components (PCs).
Answer

You might also like