Practical 01

Part:1
Question 1:
Part a)
Read the data file into a pandas data frame
Answer:
Part b)
Identify any duplicate record (s).
Answer:
Part c)
By keeping one duplicated record delete the other record (s) from the dataset.
Answer:
Part d)
What is the dimension of the data frame after removing the duplicates?
Answer:
dimension of data frame after removing duplicate records: (398, 9)
Question 2:
Part a):
How many missing values are in the horsepower column?
Answer:
total missing value in horsepower column is: 6
Part b):
Remove the records having the missing values in the horsepower column.
Answer:
Part c):
Take 10% of the available records as a test set and set the horsepower to null for those
records.
Answer:
Part d):
Fill in the missing values of the test set based on the mean and median of the horsepower of
the training set (90%). Calculate the RMSEs for the imputed values of the test set.
Answer:
Mean of the 10% test horsepower sample is : 28.68627757007431
Median of the 90% horsepower sample is : 92.0
Part e):
Using the same way find the RMSEs, if scikit-learn KNNImputer (for n_neighbors 1, 3 and
5) is
used with weight, acceleration, displacement and mpg features. Decide whether you need to
standardise data.
Answer:
rmse of mean value -> 28.68627757007431
rmse of median value -> 92.0
rmse of knn with n_neg=1 -> 105.4226231257407
rmse of knn with n_negbour=3 -> 105.40709369343823
rmse of knn with n_neigh=5 -> 105.40709369343823
Part f):
Use the best solution to fill the missing values in the horsepower column. What are the filled
values?
Answer:
we can impute the missing value using mean, median KNN, for this case the mean value is
less than the median value so i fill missing value with mean value
Question 3:
Part a):
What are the kurtosis and skewness values of the mpg attribute? Draw the histogram using
the seaborn distplot function.
Answer:
Skew of mpg data: 0.4627614738024299
Kurtosis of mpg data: -0.5164809219117741
Part b):
Identify outliers of mpg using Inter Quartile Range (IQR) approach and impute them with
min
and max values appropriately.
Answer:
IQR value of mpg is: 12.0

Part c):
Transform mpg column using loge (x+1) formula to make the mpg values follow the normal
Distribution
Answer
Part d):
Use a QQ-plot to show that loge (x+1) is a better transformation for mpg. Find the kurtosis
and
skewness of mpg after the transformation.
Answer
Skew of mpg_log data: -0.10088917191646131
Kurtosis of mpg_log data: -0.830610315221898
Part e):
Similarly detect and correct outliers in the weight, displacement, horsepower and acceleration
columns.
Answer
There is no outliers in weight column
#finding displacement outliers

There is total 173 outliers in displacement column.
#finding horsepower outliers
There is total 10 outliers in horsepower column.
#finding acceleration outliers

There is total 10 outliers in acceleration column.
Part f):
Display the correlation matrix using the seaborn heatmap function between continuous
variables; mpg, horsepower, weight, displacement, and acceleration.
Answer:
Question 2:
Part a):
Transform categorical variables `cylinders’ and òrigin’ using one-hot encoding.
Answer
#Transforming `cylinders’ using one-hot encoding
#Transforming òrigin’ using one-hot encoding

Part B):
Calculate correlation matrices for 1) one-hot encoded `cylinders’ with mpg, and 2) one-hot
encoded òrigin’ with mpg.
Answer
#correlation matrices for one-hot encoded `cylinders’ with mpg
#correlation matrices for one-hot encoded òrigin’ with mpg
Part C):
Discuss the correlation coefficient values in part b.
Answer
#correlation coefficient values of cylinder with mpg
#correlation coefficient values of origin with mpg

Part D):
Use the label encoder technique to find the correlations between `cylinders’ and mpg, and
òrigin’ and mpg
Answer
#correlations between `cylinders’ and mpg using label encoder technique
#correlations between òrigin’ and mpg using label encoder technique

Part E):
Which encoder is better (label encoder or one-hot encoder)?
Answer
one hot encoder is better. The benefit of One-Hot-Encoding is that the output is binary rather
than ordinal and that everything is in an orthogonal vector space.
Question 2:
Part a):
Categorize cars into three classes based on fuel efficiency (mpg): low, medium, and high.
Use equal frequency (i.e. number of cars) categorization.
Answer
Part b):
Answer
Use PCA to reduce the dimensionality of correlated features; weight, acceleration,
displacement, and horsepower. (Hint: use Python library PCA from sklearn.decomposition)
Answer
#showing the reducing dimensionality of correlated features using heatmap
##Check the Co-relation between features without PCA
We can observe from the above heatmap that weight, acceleration, displacement, and
horsepower have high co-relation.
Thus, we evidently need to apply dimensionality reduction.
#Checking Co-relation between features after PCA

The heatmap above plainly shows that there is no association between the various derived
principal components (PC1 and PC2).
As a result, we have gone from a higher-dimensional feature space to a lower-dimensional
feature space while ensuring that the correlation between the acquired PCs is as low as
possible.
Part c):
Using a scatter plot, display the differences of three fuel efficiency classes with the first two
principal components (PCs).
Answer

Practical 01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical 01

Uploaded by

Copyright:

Available Formats

Part:1

IQR value of mpg is: 12.0

#finding displacement outliers

#finding acceleration outliers

#Transforming `cylinders’ using one-hot encoding

#Transforming `origin’ using one-hot encoding

#correlation coefficient values of origin with mpg

#correlations between `origin’ and mpg using label encoder technique

##Check the Co-relation between features without PCA

#Checking Co-relation between features after PCA

You might also like