You are on page 1of 30

Cars4u Project

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Objective
● Explore and visualize the dataset.
● Build a linear regression model to predict the prices of used cars.
● Generate a set of insights and recommendations that will help the business.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Data Information
The data contains information about various cars and their characteristics

Variable Description The maximum power of the engine in bhp.


Power
S.No. Serial Number The number of seats in the car.
Seats
Name of the car which includes Brand The price of a new car of the same model in
Name
name and Model name New_Price INR Lakhs.(1 Lakh = 100, 000)
The location in which the car is being
Location The price of the used car in INR Lakhs (1
sold or is available for purchase Cities
Price
Lakh = 100, 000)
Year Manufacturing year of the car

Kilometers_driven The total kilometers driven in the car by


the previous owner(s) in KM.
The type of fuel used by the car. (Petrol,
Fuel_Type Observations Variables
Diesel, Electric, CNG, LPG)

Transmission The type of transmission used by the car.


(Automatic / Manual) 7253 14
Owner Type of ownership

The standard mileage offered by the car


Mileage
company in kmpl or km/kg
The displacement volume of the engine
Engine
in CC.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Exploratory Data Analysis
1. S.No. clearly has no interpretation here but as discussed earlier let us drop it
only after having looked at the initial linear model.

2. Kilometers_Driven values have an incredibly high range. We should check a


few of the extreme values to get a sense of the data.

3. Minimum and maximum number of seats in the car also warrant a quick
check. On an average a car seems to have 5 seats, which is about right.

4. We have used cars being sold at less than a lakh rupees and as high as 160
lakh, as we saw for Lamborghini earlier. We might have to drop some of these
outliers to build a robust model.

5. Min Mileage being 0 is also concerning, we'll have to check what is going on.

6. Engine and Power mean and median values are not very different. Only
someone with more domain knowledge would be able to comment further on
these attributes.

7. New price range seems right. We have both budget friendly Maruti cars and
Lamborghinis in our stock. Mean being twice that of the median suggests that
there are only a few very high range brands, which again makes sense.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
Exploratory Data Analysis
● Some observations / Anomalies present in data -

● Kilometers_Driven:
● An extreme value in this column is 6500000 kms for a car manufactured in 2007, which is almost impossible.
● The other observations that follow are also on a higher end.
● There is a good chance that these are outliers. We'll look at this further while doing the univariate analysis.

● Seats:
● Audi A4 having 0 seats is clearly a data entry error. This column warrants some outlier treatment or we can treat seats == 0 as a
missing value.

● Mileage:
● Some values in mileage columns are 0 which should be treated as missing values

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Exploratory Data Analysis – Price

Observations: Observations:

● This is a highly skewed distribution. Let us use log ● Log Transformation has definitely helped in reducing
transformation on this column to see of that helps the skew
normalise the distribution.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Exploratory Data Analysis – Kilometers_Driven

Observations: Observations:

● This is a highly skewed distribution. Let us use log ● Log Transformation has reduced the extreme
transformation on this column to see of that helps skewness.
normalise the distribution.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Exploratory Data Analysis – Price vs Location
Price vs Location

Observations:

● Price of used cars has a large IQR in Coimbatore and Bangalore

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
Exploratory Data Analysis
● Contrary to intuition Kilometers Driven does not seem to have a
relationship with price.
● Price has a positive relationship with Year. Newer the car, higher
the price.
● S.No. does not capture any information that we were hoping for.
The temporal element of variation is captured in the year column.
● 2 seater cars are all luxury variants. Cars with 8-10 seats are
exclusively mid to high range.
● Mileage does not seem to show much relationship with the price of
used cars.
● Engine displacement and Power of the car have a positive
relationship with the price.
● New Price and Used Car Price are also positively correlated, which
is expected.
● Kilometers Driven has a peculiar relationship with the Year variable.
Generally, newer the car lesser the distance it has travelled, but this
is not always true.
● CNG cars are conspicuous outliers when it comes to Mileage. The
mileage of these cars is very high.
● Mileage and power of newer cars is increasing owing to
advancement in technology.
● Mileage has a negative correlation with engine displacement and
power. More powerful the engine, more fuel it consumes in general.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
Exploratory Data Analysis – Correlation

Observations:

● Power and engine are important predictors of price


● We will have to work on imputing New Price missing values because this is a very important feature in
predicting used car price accurately
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
Feature Engineering – ‘Name’ column
● The Name column in the current format might not be very useful in our analysis. Since the name contains both the brand name and the model
name of the vehicle, the column would have to many unique values to be useful in prediction.
● With 2041 unique names, car names are not going to be great predictors of the price in our current data.
● But we can process this column to extract important information and see if that reduces the number of levels for this information.
● Car names are strings where the 1st work is car brand, 2nd word is car Model and the subsequent words are further details of the models.
● We extracted the first to word to see which cars are most frequent in our dataset.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
Feature Engineering – Car Brand
Maruti and Hyundai dominate the Indian car market followed by other popular brands like Honda and Toyota.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
Feature Engineering – Car Model
Althogh Maruti and Hyundai are the most dominant brand in the market, Honda’s city seems to be the leader when it comes to models.
Hyundai’s i20s are a close second

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
Missing Values
● 2 Electric car variants don't have entries Column Name No. of missing values
mileage_unit 2
for Mileage.
● Engine displacement information of 46
observations is missing and maximum Mileage 2
engine_num 46
power of 175 entries is missing.
● Information about number of seats is not
Engine 46
available for 53 entries. power_num 175
● New Price as we saw earlier has a huge
missing count. We'll have to see if there is Power 46
new_price_num 6247
a pattern here.
● Price is also missing for 1234 entries.
Seats 53
Since price is our response variable that price_log 1234
we want to predict, we will have to drop
these rows when we actually build a New_Price 6247
model. These rows will not be able to help
us in modelling or model evaluation. But
while we are analysing the distributions Price 1234
and doing missing value imputations, we
will keep using information from these
km_per_unit_fuel 2
rows.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
Missing Values Treatment
● Strategy to treat missing values in - Seats, power_num, engine_num and new_price_num
● We'll impute these missing values by taking median value of the attribute for a particular car, using the grouped values of Brand
and Model name.
● There are still NAs in engine_num, power_num, Year,km_per_unit_fuel, new_price_num and Seats because there are a few car
brands and models in our dataset that do not contain the new price information at all.
● Strategy to impute missing values in above mentioned columns - KNN imputation
● Missing values in Price - Since it is our dependent variable any kind of imputation can have an effect on the predictions, therefore
we can drop them and proceed with model building

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
Linear Model Building
Steps of model building:

1. Identifying dependent variable - What we want to predict is the "Price". We will use the normalised version 'price_log' for
modelling.
2. Encoding categorical features - Before we proceed to modelling, we'll have to encode categorical features. We will drop
categorical features like - Name
3. Split the data into Train and Test -We'll split the data into train and test, to be able to evaluate the performance of the model
that we build on the train data.
4. Model Building - Build a robust Linear Regression model using the train data.
5. Check Statistical Assumptions - The final regression model we build must satisfy the assumptions
6. Model Performance -Evaluate the model performance on test data

Assumptions of Linear Regression:

1. No Multicollinearity
2. Mean of residuals should be 0
3. No Heteroscedacity
4. Linearity of variables
5. Normality of error terms

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
Linear Regression Model to predict Price of used cars

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
Checking Assumptions
1. No Multicollinearity -

● We used VIF scores, to check if there is multicollinearity in the


data.
● Multicollinear features were dropped/treated till all the features
had a VIF score <5

VIF Scores of the final model -


● We have removed multicollinearity from the data now
● Fuel_Type variables are showing high vif because most cars are
either diesel and petrol.
● These two features are correlated with each other.
● We will not drop this variable from the model because this will not
affect the interpretation of other features in the model

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
Checking Assumptions
2. Mean of residuals should be 0 -

● Mean of residuals = 4.9733274438848885e-14


● Mean of residuals is very close to 0. The second assumption is satisfied.

3. No Heteroscedasticity-

● Goldfeldquandt Test -
● Null hypothesis : Residuals are homoscedastic
Alternate hypothesis : Residuals have hetroscedasticity
● Test Result : [('F statistic', 0.8889075306852734), ('p-value', 0.9963654482385906)]
● Since p-value > 0.05 we cannot reject the Null Hypothesis that the residuals are homoscedastic
● Assumptions 3 is also satisfied by our model.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
Checking Assumptions
4. Linearity of variables -

● There is no pattern in the residual vs fitted values plot.


● Assumption 4 is also satisfied by our model.

5. Normality of error terms -

● The residuals have a close to normal distribution

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
Linear Model
● Now that we have seen that our model follows all the linear regression assumptions. Let us use that model to draw inferences.
● Model outcomes on training data -

R-Squared Adjusted R-Squared

0.896 0.895

● Both the R-squared and Adjusted R squared of our model are high. This is a clear indication that we have been able to create a
very good model that is able to explain variance in price of used cars for upto 90%
● The model is not an underfitting model.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 21
Model Performance
Model outcomes on test data-

Data RMSE MAE MAPE

Train 5.04 2.07 22.98

Test 3.83 1.91 20.63

● Mean Absolute Error indicates that our current model is able to predict used cars prices within mean error of 1.9 lakhs on test
data.
● The units of both RMSE and MAE are same - Lakhs in this case. But RMSE is greater than MAE because it penalises the outliers
more.
● Mean Absolute Percentage Error is ~20% on the test data.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 22
Observations from the model
1. With our linear regression model we have been able to capture ~90 variation in our data.

2. The model indicates that the most significant predictors of price of used cars are -

I. The year of manufacturing


II. Number of seats in the car
III. Power of the engine
IV. Mileage
V. Kilometers Driven
VI. Location
VII. Fuel_Type
VIII. Transmission - Automatic/Manual
IX. Car Category - budget brand to ultra luxury

The p-values for these predictors are <0.05 in our final model

3. Newer cars sell for higher prices. 1 unit increase in the year of manufacture leads to [ exp(0.1170) = 1.12 Lakh ] increase in the price
of the vehicle, when everything else is constant.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 23
Observations from the model
4. As the number of seats increases, the price of the car increases - exp(0.0343) = 1.03 Lakhs

5. Mileage is inversely correlated with Price. Generally, high Mileage cars are the lower budget cars.

6. Kilometers Driven have a negative relationship with the price which is intuitive. A car that has been driven more will have more wear
and tear and hence sell at a lower price, everything else being 0.

7. The categorical variables are a little hard to interpret. But it can be seen that all the car_category variables in the dataset have a
positive relationship with the Price and the magnitude of this positive relationship increases as the brand category moves to the luxury
brands. It will not be incorrect to interpret that the dropped car_category variable for budget friendly cars would have a negative
relationship with the price (because the other 3 are increasingly positive.)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 24
Business Insights and Recommendations
1. Our final Linear Regression model has a MAPE of 20% on The test data, which means that we are able to predict
within 20% of the price value. This is a very good model and we can use this model in production.
2. Some southern markets tend to have higher prices. It might be a good strategy to plan growth in southern cities using
this information. Markets like Kolkata(coeff = -0.2) are very risky and we need to be careful about investments in this
area.
3. We will have to analyse the cost side of things before we can talk about profitability in the business. We should
gather data regarding that.
4. The next step post that would be to cluster different sets of data and see if we should make multiple models for
different locations/car types.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 25
Appendix

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 26
Data Cleaning
● Mileage, Engine, Power and New_Price are objects when they should ideally be numerical. To be able to get summary statistics
for these columns, we can extract numeric part from them.

● Mileage
● We have car mileage in two units, kmpl and km/kg.
● After a quick research on the internet it is clear that these 2 units are used for cars of 2 different fuel types.
● kmpl - kilometers per litre - is used for petrol and diesel cars.
● km/kg - kilometers per kg - is used for CNG and LPG based engines.
● Mileage column is broken up into two new columns:
i) km_per_unit_fuel
ii) mileage_unit

Mileage km_per_unit_fuel Mileage unit

17 kmpl 17 kmpl

26.60 km/kg 26.60 km/kg

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 27
Data Cleaning

● Engine

● The data dictionary suggests that Engine indicates the displacement volume of the engine in CC. We will make sure that all the
observations follow the same format - [ numeric + " " + "CC"] and create a new numeric column from this column.

Engine engine_num

998 CC 998

● Power

● The data dictionary suggests that Power indicates the maximum power of the engine in bhp. We will make sure that all the
observations follow the same format - [ numeric + " " + "bhp"] and create a new numeric column from this column, like we did for
Engine
Power power_num

58.16 bhp 58.16

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 28
Data Cleaning

● New_Price

● We know that New_Price is the price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
● Not all values are in Lakhs. There are a few observations that are in Crores as well, Let us convert these to lakhs. 1 Cr = 100
Lakhs

New_Price engine_num

8.61 Lakh 8.61

1.04 Cr 104.0

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 29
Happy Learning !

30
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like