Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory

21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.
ipynb - Colaboratory
Double-click (or enter) to edit
Problem Objective :
1.Explore and analyze the dataset.
2.Peform necessary Univariate and Bivariate analysis of features highlighting the insights.
3.Preprocess the data and find out duplicates, missing values and treatment, outliers and treatment, bad data.
Data Dictionary
S.No. : Serial Number
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase Cities
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car. (Automatic / Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC.
Power : The maximum power of the engine in bhp.
Seats : The number of seats in the car.
New_Price : The price of a new car of the same model in INR Lakhs.
Price : The price of the used car in INR Lakhs
Import necessary libaries - pandas,numpy,seaborn,matplotlib,datetime and set option to display

maximum 5000 rows and 5000 columns
2 marks
Code Text
Read the dataset from given csv file
2 marks
# from google.colab import drive

# drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
Check the number of rows and column in data
2 marks
Visualize the first 10 and last 10 records of the dataset
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 1/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
2 marks
Check the column names of the dataset
2 marks
Describe the numerical columns in data
2 marks
Check datatype of each column
4 marks
Check for Unique values' count in each column
6 marks
Missing values analysis
Check count of missing values in each column
2 marks
Check percentage of missing values in each column
4 marks
Dropping SNo column as it like an index
# Remove S.No. column from data
2 marks
Feature Engineering
Determining used car's age in years.
date.today().year command gives you current year and
age of the car can be determined by current year - Year column in the dataset
6 marks
Getting brand name and model of the car
### Get brand
4 marks
### Get model
4 marks
Find all the records which have Brand names - ISUZU,Mini,Land
Use the brand column that you just created in above command
4 marks
Replace all below brand names as below:
- ISUZU with Isuzu

- Mini with Mini Cooper
- Land with Land Rover
4 marks
Statistical summary
Give summary(count,mean,std,min,25%,50%,75% and max) of numerical columns only and transpose the results
# Numerical data summary
4 marks
Give summary of all columns including categorical columns also( that means both numerical and categorical columns) and
transpose the results
4 marks
From the statistics summary, what useful insights can you derive?
Data transformation for Mileage,Engine and Power columns using regex.

Example given for reference for Power column to replace bhp by ' '
Replace kmpl and km/kg by '' in Mileage column
Replace km/kg by '' in Mileage column
Replace CC by '' in Engine column
#remove bhp
cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')
cars_dataset.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-25600cc4d0fb> in <cell line: 3>()
1 #remove bhp
2
----> 3 cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')
4
5 cars_dataset.head()
NameError: name 'cars_dataset' is not defined
SEARCH STACK OVERFLOW
print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())
Remove whitespaces from columns - Mileage, Engine and Power - either one by one or via a
function with for loop
# Creating a function which will remove extra leading

# and tailing whitespace from the data.
# pass dataframe and columns as input to function
#Can Use str.strip() function
# Creating a function which will remove extra leading

# and tailing whitespace from the data.
# pass columns
cols_to_remove_space =['Mileage','Engine','Power']
def whitespace_remover(df, cols):

for col in cols:
df[col] = df[col].str.strip()
# Applying whitespace_remover function on the DataFrame

whitespace_remover(cars_dataset, cols_to_remove_space)
cars_dataset.head()
print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())
Change 'null' to np.nan in Power column and display unique values of Power column
#Change 'null' to np.nan

cars_dataset['Power'] = cars_dataset['Power'].replace(regex='null',value=np.nan)
#check unique values now
cars_dataset.Power.unique()
Changing datatypes of columns -Engine,Power to float

Use below as an example
df['Mileage'] = df['Mileage'].astype('float')
4 marks
Imputing missing values from required columns at Brand level.
using the example below, fill the missing values at brand level for Mileage and Power columns also
print('Imputing median values based on Brand for Mileage, Engine, Power')

cars_dataset['Engine'] = cars_dataset.groupby(['Brand'])['Engine'].apply(lambda val:val.fillna(val.median()))
cars_dataset.describe()
6 marks
Display list of all the Numerical and Categorical columns in the data in separate
variables
6 marks
Plots and graphs
Matplotlib is a Python 2D plotting library used to draw basic charts
Seaborn is a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots
Univariate analysis can be done for both Categorical and Numerical variables.
Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
Numerical Variables can be visualized using Histogram, Box Plot, etc.
Distribution on the basis of skewness value:
Skewness = 0: Then normally distributed.
Skewness > 0: Then more weight in the left tail of the distribution, i.e. right skewed
Skewness < 0: Then more weight in the right tail of the distribution, i.e. left skewed
Univariate analysis
Numerical columns univariate analysis
6 marks
Categorical columns univariate analysis
6 marks
From the numerical and categorical columns' univariate analysis what useful insights can you
draw?
2 marks
Bivariate analysis
Bivariate Analysis helps to understand how variables are related to each other and the relationship between dependent and independent
variables
For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate Analysis.
A Stacked bar chart can be used for categorical variables if the output variable is a classifier.
Bar plots can be used if the output variable is continuous
Draw a pairplot using seaborn library for this cars data and derive insights from that
6 marks
Bar plots can be used to show the relationship between Categorical variables and continuous variables
# Calculate correlation of numerical variables and store in a variable called as corr_num
4 marks

Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.

Double-click (or enter) to edit

1.Explore and analyze the dataset.

S.No. : Serial Number

Year : Manufacturing year of the car

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

Price : The price of the used car in INR Lakhs

Import necessary libaries - pandas,numpy,seaborn,matplotlib,datetime and set option to display

Read the dataset from given csv file

# from google.colab import drive

Check the number of rows and column in data

Visualize the first 10 and last 10 records of the dataset

Check the column names of the dataset

Describe the numerical columns in data

Check datatype of each column

Check for Unique values' count in each column

Missing values analysis

Check count of missing values in each column

Check percentage of missing values in each column

Dropping SNo column as it like an index

# Remove S.No. column from data

Determining used car's age in years.

Getting brand name and model of the car

### Get brand

### Get model

Find all the records which have Brand names - ISUZU,Mini,Land

Replace all below brand names as below:

- ISUZU with Isuzu

# Numerical data summary

Data transformation for Mileage,Engine and Power columns using regex.

Replace kmpl and km/kg by '' in Mileage column

Replace km/kg by '' in Mileage column

Replace CC by '' in Engine column

NameError: name 'cars_dataset' is not defined

SEARCH STACK OVERFLOW

# Creating a function which will remove extra leading

# Creating a function which will remove extra leading

def whitespace_remover(df, cols):

# Applying whitespace_remover function on the DataFrame

#Change 'null' to np.nan

Changing datatypes of columns -Engine,Power to float

Imputing missing values from required columns at Brand level.

print('Imputing median values based on Brand for Mileage, Engine, Power')

Plots and graphs

Matplotlib is a Python 2D plotting library used to draw basic charts

Numerical Variables can be visualized using Histogram, Box Plot, etc.

Distribution on the basis of skewness value:

Skewness = 0: Then normally distributed.

Numerical columns univariate analysis

Categorical columns univariate analysis

Bar plots can be used if the output variable is continuous

# Calculate correlation of numerical variables and store in a variable called as corr_num

You might also like