You are on page 1of 6

21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.

ipynb - Colaboratory

Double-click (or enter) to edit

Problem Objective :

1.Explore and analyze the dataset.

2.Peform necessary Univariate and Bivariate analysis of features highlighting the insights.

3.Preprocess the data and find out duplicates, missing values and treatment, outliers and treatment, bad data.

Data Dictionary

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase Cities

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR Lakhs.

Price : The price of the used car in INR Lakhs

Import necessary libaries - pandas,numpy,seaborn,matplotlib,datetime and set option to display


maximum 5000 rows and 5000 columns

2 marks
Code Text

Read the dataset from given csv file

2 marks

# from google.colab import drive


# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun

Check the number of rows and column in data

2 marks

Visualize the first 10 and last 10 records of the dataset

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 1/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory

2 marks

Check the column names of the dataset

2 marks

Describe the numerical columns in data

2 marks

Check datatype of each column

4 marks

Check for Unique values' count in each column

6 marks

Missing values analysis

Check count of missing values in each column

2 marks

Check percentage of missing values in each column

4 marks

Dropping SNo column as it like an index

# Remove S.No. column from data

2 marks

Feature Engineering

Determining used car's age in years.

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 2/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
date.today().year command gives you current year and

age of the car can be determined by current year - Year column in the dataset

6 marks

Getting brand name and model of the car

### Get brand

4 marks

### Get model

4 marks

Find all the records which have Brand names - ISUZU,Mini,Land

Use the brand column that you just created in above command

4 marks

Replace all below brand names as below:

- ISUZU with Isuzu


- Mini with Mini Cooper
- Land with Land Rover

4 marks

Statistical summary

Give summary(count,mean,std,min,25%,50%,75% and max) of numerical columns only and transpose the results

# Numerical data summary

4 marks

Give summary of all columns including categorical columns also( that means both numerical and categorical columns) and
transpose the results

4 marks

From the statistics summary, what useful insights can you derive?

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 3/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory

Data transformation for Mileage,Engine and Power columns using regex.


Example given for reference for Power column to replace bhp by ' '

Replace kmpl and km/kg by '' in Mileage column

Replace km/kg by '' in Mileage column

Replace CC by '' in Engine column

#remove bhp

cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')

cars_dataset.head()

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-25600cc4d0fb> in <cell line: 3>()
1 #remove bhp
2
----> 3 cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')
4
5 cars_dataset.head()

NameError: name 'cars_dataset' is not defined

SEARCH STACK OVERFLOW

print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())

Remove whitespaces from columns - Mileage, Engine and Power - either one by one or via a
function with for loop

# Creating a function which will remove extra leading


# and tailing whitespace from the data.
# pass dataframe and columns as input to function
#Can Use str.strip() function

# Creating a function which will remove extra leading


# and tailing whitespace from the data.
# pass columns

cols_to_remove_space =['Mileage','Engine','Power']

def whitespace_remover(df, cols):


for col in cols:
df[col] = df[col].str.strip()

# Applying whitespace_remover function on the DataFrame


whitespace_remover(cars_dataset, cols_to_remove_space)

cars_dataset.head()

print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())

Change 'null' to np.nan in Power column and display unique values of Power column

#Change 'null' to np.nan


cars_dataset['Power'] = cars_dataset['Power'].replace(regex='null',value=np.nan)
#check unique values now
cars_dataset.Power.unique()

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 4/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory

Changing datatypes of columns -Engine,Power to float


Use below as an example

df['Mileage'] = df['Mileage'].astype('float')

4 marks

Imputing missing values from required columns at Brand level.

using the example below, fill the missing values at brand level for Mileage and Power columns also

print('Imputing median values based on Brand for Mileage, Engine, Power')


cars_dataset['Engine'] = cars_dataset.groupby(['Brand'])['Engine'].apply(lambda val:val.fillna(val.median()))

cars_dataset.describe()

6 marks

Display list of all the Numerical and Categorical columns in the data in separate
variables

6 marks

Plots and graphs

Matplotlib is a Python 2D plotting library used to draw basic charts

Seaborn is a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots

Univariate analysis can be done for both Categorical and Numerical variables.

Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.

Numerical Variables can be visualized using Histogram, Box Plot, etc.

Distribution on the basis of skewness value:

Skewness = 0: Then normally distributed.

Skewness > 0: Then more weight in the left tail of the distribution, i.e. right skewed

Skewness < 0: Then more weight in the right tail of the distribution, i.e. left skewed

Univariate analysis

Numerical columns univariate analysis

6 marks

Categorical columns univariate analysis

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 5/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory

6 marks

From the numerical and categorical columns' univariate analysis what useful insights can you
draw?

2 marks

Bivariate analysis

Bivariate Analysis helps to understand how variables are related to each other and the relationship between dependent and independent
variables

For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate Analysis.

A Stacked bar chart can be used for categorical variables if the output variable is a classifier.

Bar plots can be used if the output variable is continuous

Draw a pairplot using seaborn library for this cars data and derive insights from that

6 marks

Bar plots can be used to show the relationship between Categorical variables and continuous variables

# Calculate correlation of numerical variables and store in a variable called as corr_num

4 marks

https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 6/6

You might also like