You are on page 1of 32

 Title: Car Price Prediction

 Abstract
- The purpose of this project is to for a machine learning model to learn how to predict the
price of a car. In order for that to do we had to get a database and edit the database in a way
that would work with the original database that the Python program intended. Every database
is different, so we need to edit every database to ensure we input as much important
information as we need for the machine to predict the price without confusing it. In our case,
we have used only the Audi car company to predict their types of car prices. The database
has 10000 types of different cars from the Audi car brand. They all differ in price, year,
kilometers are driven, engine size or type, model, and fuel consumption. The program can
work with every dataset but our focus in this project was the audi dataset because it was
simplier and it was the biggest dataset we could find with over 10,000 entries in the dataset.
Meaning it will be more accurate than only 200 entries we received from the original dataset
which had different types of cars, meaning no 2 same cars were inputted into that dataset.

The project is based on this link:


 https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-
learning/

The project was made on:


- https://colab.research.google.com

 Introduction (Why you choose this project?)


- We chose this project because I liked cars and I knew about them also learning machine
models are a big part of technology right now and in the future. This program can be
implemented in car websites to predict the price of the car customers look for to make sure a
customer will not get ripped off. Meaning that we will also support the society and we will do
good so why not do good for our society. Also this program is very helpful because it is not
limited ot just vehicle industry. It could also be used for other different industries such as mobile,
computer, food and etc.. With a few slight changes it could be made into a bigger and more
purpofesul program. This program can also be used for our country for the biggest car market
place which is merrjep.com and veturaneshitje.com. These websites will need a predictor for the
price of the car because they are missing one. Car websites in USA and germany like mobile.de
they all have a car price prediction. This would be essential for website like them to improve
their user experience and satisfaction. This would also make the website more futuristic and a
better overall car searching and buying process. This program we also think we could make a
different purpose other than car price prediction, we think we could make this program work
with other industries like computers, phones, food and etc.. with very few changes.

 Summary of any one research paper


1
- The research paper shows how essential price preditiction machine learnings are. It
included statistics about used and new car prices, showing the differences between them.
Also it included how websites like eBay needed machine learning to predict the price of
products which was essential for the user to know if they are getting a good deal which is
what a website is all about. This research paper taught us that the used car marketplace Is
very important because the used car sales would quadruple the sales of the new cars in
some years. This means that making a data learning machine that could predict the price of
the car in a huge market is a big deal. The research paper showed a lot of information on
their marketplace. Information with graphs such as price of the car based on the kilometers
traveled, the average price of the car in a chart, the average kilometers traveled from cars
in a chart, the amount of cars depending on the manufacturer showed in a chart and etc..
The research paper also showed us another important chart which is the price of the car
depending on the type of maufacturer and the kilometers traveled on different type of
manufacturers of cars. The chart also showed us how much the car value will go down
depending on the car manufacturer. Meaning it will show the linear regression value of the
market and random forest and SVM. The research paper also showed us data about how
accurate the data learning machine it was on predicting the car value. As we could see
from the chart, the data learning machine they had developed looked very promising for
the future.

2
 Data Set Description
https://www.kaggle.com/datasets/rohitagrawal362/audi-car-price-prediction
model year price transmis mileage fueltype tax highway enginesi
sion mpg ze
A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
A6 2016 16500 Automat 36203 Diesel 20 64.2 2
ic
A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
A4 2017 16800 Automat 25952 Diesel 145 67.3 2
ic
A3 2019 17300 Manual 1998 Petrol 145 49.6 1
Data Base 1: /content/audi.csv

- Table 1. The first 5 rows of the database excel sheet of audi. Which is the main database.

The number of rows: 10669


The number of columns: 10
Attributes: Model, Year, Price, Transmission, Mileage, Fuel Type, Tax, Highway Mpg,
Engine Size
Attributes Explanation:
Model: The name of the car model.
Year: The year of the car's manufacture.
Price: The price of the car.
Transmission: The type of transmission the car has.
Mileage: The total distance the car has traveled in miles.
Fuel Type: The type of fuel the car uses.
Tax: The tax amount associated with the car.
Highway Mpg: The car's estimated miles per gallon on the highway.
Engine Size: The size of the car's engine in liters.

Data Base 2: https://raw.githubusercontent.com/amankharwal/Website-data/master/


CarPrice.csv

Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan

3
continuing… ↓

Drive Engine Wheelbas Engin Fuel Bor Strok Compressio Horsepowe


whee locatio e e size Syste e e n ratio r
l n m ratio
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

- Table 2. The first 6 rows of the second database.

Number of Rows: 206 & Number of Columns: 22


Attributes:
car_ID: index for each row in the table.
Symbolling: The insurance risk rating symbol associated with the car.
CarName: The name or model of the car.
Fuel type: The type of fuel the car uses
Aspiration: The type of aspiration or turbocharging the car has
Door number: The number of doors on the car.
Carbody: The type of car body or body style.
Drivewheel: The type of drivewheel or wheel drive configuration
Engine location: The location of the car's engine (e.g., front, rear).
Wheelbase: The distance between the centers of the front and rear wheels in inches.
Engine size The size or displacement of the car's engine in cubic centimeters (CC).
Fuel system: The type of fuel delivery system used in the car
Bore ratio: The bore ratio of the car's engine.
Stroke: The stroke ratio of the car's engine.
Compression ratio: The compression ratio of the car's engine.
Horsepower: The power output of the car's engine in horsepower.
Peak rpm: The peak revolutions per minute (rpm) at which the car's engine generates its max
power.

4
Citympg: The car's estimated fuel efficiency in miles per gallon (mpg) during city driving.
Highway mpg: The car's estimated fuel efficiency in miles per gallon (mpg) during highway
driving.
Price: The price of the car.

 Algorithm
The algorithm for the project operates as follows:
1. The program needs Input, this input will be the database we will get from an Excel
document.
2. Edit the database in the proper way or order to work with the program because every
database is different and the program requires it made in a specific way.
3. Edit the program so it can work for its intended purpose this includes but is not limited to the
types of input the program receives meaning the information about the car. Some databases
had the dimensions of the car, but the database we chose does not have them since the size of
the car does not matter. The model, engine, year and etc matter more to predicting the price
of the car in our opinion.
4. After the program and dataset is editted, the program will make calculations with the imports
such as pandas with the functions like dataframes.
5. After the calculations are made from different functions then the program will output the
dataframes, the chart or the colored matrix with values in it that the the machine data learning
will need.
6. The output of the program will be different based on different datasets but the accuracy will
be the same.
7. The outputed of this project is to be in a website that sells cars like mobile.de. You search for
a car, you press on it to see its description and its price, and then you see how much the seller
is selling the car and then right below the sellers price you will see the data learning machine
price. This could be used in different types of markets but we have only used the audi
marketplace or also known as the audi dataset.
8. This program is not limited to only vehicles, it could also be used for phones and pc prices.
Meaning we could implement this feature not just in automobile industry but in other
industries aswell.

5
 Flowchart:

Figure 1. Firstly we start with the data collection which for us is the excel database, And then we get
lasso regression which is regularization technique. It is used foran accurate prediction of the car. The
lasso regression is split with linear and ridge and then the program compiles the results of both. Then
the program will use the best model of the results and then it will display the used car pridiction price.

6
 Experiment results (Entire code "change all the variable names", All outputs, All figures
outputs with explanations)
- The entire code of the first database with the variable names changed:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()
audi.isnull().sum()
audi.info()
print(audi.describe())
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(audi)

7

The entire code of database 1 explained in chunks(First a chunk of the code is
showed and then the output of the chunk code is shown and then explained):
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()

- In this lines of code as we can see we import a few modules like pandas for the porject to
work. And then after the import it outputs the first 6 lines of the excel database sheet of
audi.
id model year price transmission mileage fueltype tax highwaympg enginesize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0

- Table 3. As we can see in this table right here the python program has printed the first 6
lines of the database in excel we have inputted which is audi.csv.

audi.isnull().sum()
model 0
year 0
price 0
transmission 0
mileage 0
fueltype 0
tax 0
highwaympg 0
enginesize 0
dtype: int64

- Table 4. This table shows the command isnull, which Is a panda function which will verify
if there is an empty cell in the excel sheet blank or null. If there is then there will be a true
expression instead of false. Meaning the output will be 1 instead of 0 here. Which means
the database we inputted is working fine and as intended.

8
audi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10668 entries, 0 to 10667
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10668 non-null object
1 year 10668 non-null int64
2 price 10668 non-null int64
3 transmission 10668 non-null object
4 mileage 10668 non-null int64
5 fueltype 10668 non-null object
6 tax 10668 non-null int64
7 highwaympg 10668 non-null float64
8 enginesize 10668 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 750.2+ KB

- Table 5. In this line of code we see every technical information we can get from the
database inlcuding the size of the file, memory usage, how many entries etc.. it also shows
that our database has 9 columns and it has 3 different types of data. Also it shows it has
the size of 750.2 KB. The table shows every column of the dataset we have inputted.
Which also shows every type of data we the dataset has which is model, year, price etc.. It
also shows that the dataset we have inputted has 3 kinds of different types of data which is
float, integer and object.

9
print(audi.describe())

type year price mileage tax highwaympg \


count 10668.000000 10668.000000 10668.000000 10668.000000 10668.000000
mean 2017.100675 22896.685039 24827.244001 126.011436 50.770022
std 2.167494 11714.841888 23505.257205 67.170294 12.949782
min 1997.000000 1490.000000 1.000000 0.000000 18.900000
25% 2016.000000 15130.750000 5968.750000 125.000000 40.900000
50% 2017.000000 20200.000000 19000.000000 145.000000 49.600000
75% 2019.000000 27990.000000 36464.500000 145.000000 58.900000
max 2020.000000 145000.000000 323000.000000 580.000000 188.300000

continuing… ↓

enginesize
count 10668.000000
mean 1.930709
std 0.602957
min 0.000000
25% 1.500000
50% 2.000000
75% 2.000000
max 6.300000

- Table 6. This command describes the database in a technical way in this example in a
dataframe which contains numerical data. It shows the average value or also known as the
standart deviation. This command .describe will describe it in such a way the machine
learning will use the result to determine the price of the car. As we can see it has a few
rows. The count is how many times that specific column data input has been inputted or
written in the excel sheet, the mean is the result of mean deviation and std is the standard
deviation result. Min is the smallest value, 25% is which data is the closest to that
percentage. Meaning in enginesize around 25% of the cars had 1.5 liters of engine size.
Same thing with with 50% and 75%. It shows which data comes closest to that percentage
as I explained earlier. Max is the largest value.

10
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()

Figure.2. As we can see this code will make us a graph with the average price of the car. The
graph will change based upon the database we input because of the price differ in the excel
sheet. Meaning the market if the consider the database the input of the market. In this picture
as we can see most cars have the price of around 20,000 dollars. This chart will differ
depending on the database or excel file we input. It is very important to note that this chart
will be used for the data learning machine to inform the user of the price of the car in a
informative chart which will help them in understanding the cars value better.

11
sns.distplot(audi.price)
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()

<ipython-input-9-0bb2be9b5c0c>:1: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(audi.corr())
<ipython-input-9-0bb2be9b5c0c>:3: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = audi.corr()

year price mileage tax highwaympg enginesize


year 1.000000 0.592581 -0.789667 0.093066 -0.351281 -0.031582
price 0.592581 1.000000 -0.535357 0.356157 -0.600334 0.591262
mileage -0.789667 -0.535357 1.000000 -0.166547 0.395103 0.070710
tax 0.093066 0.356157 -0.166547 1.000000 -0.635909 0.393075
highwaympg -0.351281 -0.600334 0.395103 -0.635909 1.000000 -0.365621
enginesize -0.031582 0.591262 0.070710 0.393075 -0.365621 1.000000

- Table 7. In this table is represented the dataframe from the pandas import function. This
dataframe is an important as it helps the data machine learning to predict the car price.
Every column and every row will have atleast once the value 1 in it because of the same
type of data inputted. The dataframe works by gathering different types of data and the
panda function will make its logical and mathematical functions. This type of dataframe of
pandas is heavily used in datamachine learning, data science and many other scientific
studies making this import of pandas a great use for our project.

-As we can see the output showed a warning. It is a panda warning that may interfere in the
future. Below the warning it is the table with dataframe information. And below the table is
the figure which shows the chart with different colors.

“The figure is placed in the next page due to its size”

12
Figure 3. This dataframe chart shows how the different categories of the database we input. It is
working fine since the we have 1 in a diagonal way which was the way it was intended to work.
It differs between the cells and this will be used to determice the price of the car which the data
learning machine will use. Every row and column will have at some point the value 1 in it due to
the fact that at some point the row and the column will have the same data collideded together
meaning it will make it 1 if they are the same data combined. As mentioned earlier the diagonal
red kind of line proves that.

13
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)

<ipython-input-19-a1c592d6430c>:3: FutureWarning: In a future version of pandas all


arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0

- In this output we see a panda warning and after it we see i=only the number 1. This is to
show that the program with the pandas used works as intended. Meaning our program
worked as we intended. Meaning this is just to show if everything in this program works
as the way it was intended. It is a dataframe test to make sure the dataframe has all the
important data it needs to make the dataframe accurate or working state.

14
print(audi)

- enginesize highwaympg price


0 1.4 55.4 12500
1 2.0 64.2 16500
2 1.4 55.4 11000
3 2.0 67.3 16800
4 1.0 49.6 17300
... ... ... ...
10663 1.0 49.6 16999
10664 1.0 49.6 16999
10665 1.0 49.6 17199
10666 1.4 47.9 19499
10667 1.4 47.9 15999

[10668 rows x 3 columns]

- Table 8. This output will show the modified version of the database we inputted. It has
been reduced to only 3 rows because this was the only way to make the database work
with the program. It needs to be modified in order to work. For this program we only used
only 3 types of data because the other data like mpg will not effect the value of the car.
Sometimes it could interfere with the data learning machine to have data types such as car
id inputted into it due to the fact that vehicle id will not change the value of the car in any
way, shape or form. In some cases it could become an error

15
 The entire code of the second database with the variable names changed:

import seaborn as sns


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
DB2 = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-DB2/master/
CarPrice.csv")
DB2.head()
DB2.isnull().sum()
DB2.info()
print(DB2.describe())
DB2.CarName.unique()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(DB2.price)
plt.show()
sns.distplot(data.price)
print(DB2.corr())
plt.figure(figsize=(20, 15))
correlations = DB2.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
DB2 = DB2[["symboling", "wheelbase", "carlength",
"carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg", "price"]]
x = np.array(DB2.drop([predict], 1))
y = np.array(DB2[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error

16
model.score(xtest, predictions)
print(DB2)

 The entire output of second database with explanations:

Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓
Drive Engine Wheelbas Engin Fuel Bor Strok Compressio Horsepowe
whee locatio e e size Syste e e n ratio r
l n m ratio
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

- Table 9. This table wil simply shows the input which is the database we inputted. As we
can see it just shows the rows and the columns of the excel file we inputted which is
CarPrice.csv. This database It has only printed out the first 6 rows of the excel sheet we
have given the program which is DB2.

17
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 car_ID 205 non-null int64
1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 205 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB

- As we can see in this output it just shows the non null count and the type of the input
we have put into the database. IT also shows how many different data types and also the
memory usage. As we can see the database we inputted has a size of 41.8 KB and also it has
205 rows and 26 columns. In this table we also can see that it has showed us every type of
column or also known as data type in our dataset. This also showed us what type of data we
have inputted in that colums or data type which could be integer, float and object.

18
car_ID symboling wheelbase carlength carwidth carheight \
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 98.756585 174.049268 65.907805 53.724878
std 59.322565 1.245307 6.021776 12.337289 2.145204 2.443522
min 1.000000 -2.000000 86.600000 141.100000 60.300000 47.800000
25% 52.000000 0.000000 94.500000 166.300000 64.100000 52.000000
50% 103.000000 1.000000 97.000000 173.200000 65.500000 54.100000
75% 154.000000 2.000000 102.400000 183.100000 66.900000 55.500000
max 205.000000 3.000000 120.900000 208.100000 72.300000 59.800000

curbweight enginesize boreratio stroke compressionratio \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 2555.565854 126.907317 3.329756 3.255415 10.142537
std 520.680204 41.642693 0.270844 0.313597 3.972040
min 1488.000000 61.000000 2.540000 2.070000 7.000000
25% 2145.000000 97.000000 3.150000 3.110000 8.600000
50% 2414.000000 120.000000 3.310000 3.290000 9.000000
75% 2935.000000 141.000000 3.580000 3.410000 9.400000
max 4066.000000 326.000000 3.940000 4.170000 23.000000

horsepower peakrpm citympg highwaympg price


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 104.117073 5125.121951 25.219512 30.751220 13276.710571
std 39.544167 476.985643 6.542142 6.886443 7988.852332
min 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 116.000000 5500.000000 30.000000 34.000000 16503.000000
max 288.000000 6600.000000 49.000000 54.000000 45400.000000

- This command describes the database in a technical way in this example in a dataframe
which contains numerical data. It shows the average value or also known as the standart
deviation. As we can see the numbers differ depending on the type of row it is. The count
is how many times that specific data was inputted, mean shows the mean deviation and
also std shows standard deviation result. Min shows minimal value. 25% shows the
average value of 25% smallest values. Same goes for 50%, 75%. Max shows the
maximum value the data was inputted.

19
<ipython-input-2-3b6c97159ec3>:7: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

<ipython-input-2-3b6c97159ec3>:9: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(data.corr())
<ipython-input-2-3b6c97159ec3>:11: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = data.corr()

- In this output we can see a few warnings. The warnings are to edit the program in the
future because it may not be valid. This is due to the fact of the import of the function of
seaborn, it may change in a future update. Currently as for the date of 4/29/2023 it works
fine. For the future it will be changed to its working state if needed.

20
Figure 4. This figure shows the average price of a car in a chart. As we can see the average price of the
car here lies between 5k upto 50k. With the average used car price being 8-10k$. This will differ for
every different type of database we give to the program.

21
car_ID symboling wheelbase carlength carwidth \
car_ID 1.000000 -0.151621 0.129729 0.170636 0.052387
symboling -0.151621 1.000000 -0.531954 -0.357612 -0.232919
wheelbase 0.129729 -0.531954 1.000000 0.874587 0.795144
carlength 0.170636 -0.357612 0.874587 1.000000 0.841118
carwidth 0.052387 -0.232919 0.795144 0.841118 1.000000
carheight 0.255960 -0.541038 0.589435 0.491029 0.279210
curbweight 0.071962 -0.227691 0.776386 0.877728 0.867032
enginesize -0.033930 -0.105790 0.569329 0.683360 0.735433
boreratio 0.260064 -0.130051 0.488750 0.606454 0.559150
stroke -0.160824 -0.008735 0.160959 0.129533 0.182942
compressionratio 0.150276 -0.178515 0.249786 0.158414 0.181129
horsepower -0.015006 0.070873 0.353294 0.552623 0.640732
peakrpm -0.203789 0.273606 -0.360469 -0.287242 -0.220012
citympg 0.015940 -0.035823 -0.470414 -0.670909 -0.642704
highwaympg 0.011255 0.034606 -0.544082 -0.704662 -0.677218
price -0.109093 -0.079978 0.577816 0.682920 0.759325

carheight curbweight enginesize boreratio stroke \


car_ID 0.255960 0.071962 -0.033930 0.260064 -0.160824
symboling -0.541038 -0.227691 -0.105790 -0.130051 -0.008735
wheelbase 0.589435 0.776386 0.569329 0.488750 0.160959
carlength 0.491029 0.877728 0.683360 0.606454 0.129533
carwidth 0.279210 0.867032 0.735433 0.559150 0.182942
carheight 1.000000 0.295572 0.067149 0.171071 -0.055307
curbweight 0.295572 1.000000 0.850594 0.648480 0.168790
enginesize 0.067149 0.850594 1.000000 0.583774 0.203129
boreratio 0.171071 0.648480 0.583774 1.000000 -0.055909
stroke -0.055307 0.168790 0.203129 -0.055909 1.000000
compressionratio 0.261214 0.151362 0.028971 0.005197 0.186110
horsepower -0.108802 0.750739 0.809769 0.573677 0.080940
peakrpm -0.320411 -0.266243 -0.244660 -0.254976 -0.067964
citympg -0.048640 -0.757414 -0.653658 -0.584532 -0.042145
highwaympg -0.107358 -0.797465 -0.677470 -0.587012 -0.043931
price 0.119336 0.835305 0.874145 0.553173 0.079443

22
compressionratio horsepower peakrpm citympg \
car_ID 0.150276 -0.015006 -0.203789 0.015940
symboling -0.178515 0.070873 0.273606 -0.035823
wheelbase 0.249786 0.353294 -0.360469 -0.470414
carlength 0.158414 0.552623 -0.287242 -0.670909
carwidth 0.181129 0.640732 -0.220012 -0.642704
carheight 0.261214 -0.108802 -0.320411 -0.048640
curbweight 0.151362 0.750739 -0.266243 -0.757414
enginesize 0.028971 0.809769 -0.244660 -0.653658
boreratio 0.005197 0.573677 -0.254976 -0.584532
stroke 0.186110 0.080940 -0.067964 -0.042145
compressionratio 1.000000 -0.204326 -0.435741 0.324701
horsepower -0.204326 1.000000 0.131073 -0.801456
peakrpm -0.435741 0.131073 1.000000 -0.113544
citympg 0.324701 -0.801456 -0.113544 1.000000
highwaympg 0.265201 -0.770544 -0.054275 0.971337
price 0.067984 0.808139 -0.085267 -0.685751

highwaympg price
car_ID 0.011255 -0.109093
symboling 0.034606 -0.079978
wheelbase -0.544082 0.577816
carlength -0.704662 0.682920
carwidth -0.677218 0.759325
carheight -0.107358 0.119336
curbweight -0.797465 0.835305
enginesize -0.677470 0.874145
boreratio -0.587012 0.553173
stroke -0.043931 0.079443
compressionratio 0.265201 0.067984
horsepower -0.770544 0.808139
peakrpm -0.054275 -0.085267
citympg 0.971337 -0.685751
highwaympg 1.000000 -0.697599
price -0.697599 1.000000

- This chart shows the information that the data learning machine will use in order to predict
the car price. In this output we can see that the information on the chart below with colors.
Which can be seen more understanding to some people.

23
Figure 5. This chart shows how the different categories of the database we input. It is working fine since the we have
1 in a diagonal way which was the way it was intended to work. It differs between the cells and this will be used to
determice the price of the car which the data learning machine will use.

24
<ipython-input-3-09e4e61e658b>:7: FutureWarning: In a future version of pandas all arguments
of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0
 I- In this output as we can see only the number 1.0 is showed. This is the way the
program was intended to work because if 1 does not show then 0 will show, which
would mean there is a fatal error in the program. Meaning this whole output will tell us
the program is working fine. If it is it will show 1.0 if not then 0.0. This is a dataframe
test or one of the pandas functions to determine if the dataframe is working as
intended.

symboling wheelbase carlength carwidth carheight curbweight \


0 3 88.6 168.8 64.1 48.8 2548
1 3 88.6 168.8 64.1 48.8 2548
2 1 94.5 171.2 65.5 52.4 2823
3 2 99.8 176.6 66.2 54.3 2337
4 2 99.4 176.6 66.4 54.3 2824
.. ... ... ... ... ... ...
200 -1 109.1 188.8 68.9 55.5 2952
201 -1 109.1 188.8 68.8 55.5 3049
202 -1 109.1 188.8 68.9 55.5 3012
203 -1 109.1 188.8 68.9 55.5 3217
204 -1 109.1 188.8 68.9 55.5 3062

enginesize boreratio stroke compressionratio horsepower peakrpm \


0 130 3.47 2.68 9.0 111 5000
1 130 3.47 2.68 9.0 111 5000
2 152 2.68 3.47 9.0 154 5000
3 109 3.19 3.40 10.0 102 5500
4 136 3.19 3.40 8.0 115 5500
.. ... ... ... ... ... ...
200 141 3.78 3.15 9.5 114 5400
201 141 3.78 3.15 8.7 160 5300
202 173 3.58 2.87 8.8 134 5500
203 145 3.01 3.40 23.0 106 4800
204 141 3.78 3.15 9.5 114 5400

25
citympg highwaympg price
0 21 27 13495.0
1 21 27 16500.0
2 19 26 16500.0
3 24 30 13950.0
4 18 22 17450.0
.. ... ... ...
200 23 28 16845.0
201 19 25 19045.0
202 18 23 21485.0
203 26 27 22470.0
204 19 25 22625.0

[205 rows x 15 columns]



- This output will show the edited version of the input we did. In this case we are missing a
few columns because these were not used or important to determine the price of the car
such as car id, fuel system, engine location and etc.. Meaning this is the edited version of
the database. We are not missing any rows but we are missing 11 columns because as we
stated earlier they were not important to determine the price of the car. It also would
interfere with the data learning machines to add such data in the data learning machine
such as car id because the car id has no value or meaning to the price of the car, therefore
they have to be removed.

26
 Compare a minimum 2 datasets with all outputs
- The datasets have similar answers in some parts. I am going to explain the part where they
differ. The common parts like warnings will not be explained.

Dataset 1 Dataset 2

- As we can see here the datasets here have a huge difference. They do have a common part
which is the number 1 is shown in a diagonal line. This is due to the fact that you have
every type of data in each column and row. When the same data types come together they
will form the number 1. The biggest difference as we can see here is the number of
columns and rows. The first dataset which is audi.csv has a lot less rows and columns due
to the fact it has less data types. Meaning the second data base has a lot more data types
therefore it will have more squares or rows and columns. The types of colors seemed to be
the same excpet that in the second dataset there are a bigger variety due to its size.

27
Dataset 1 Dataset 2

- - The description for these pictures is above. Here is only the comparative description.
The biggest difference here we can see is the density amount and the price amount. At the
first dataset we have the density from 0 to 5 but in the second one we have from 0 to
0.00010. This is due to the fact that the first database has more cars(rows) inputted in it.
The frst database also seems to have an average price higher than the second one. The
price difference is drastic. The price range in the first dataset ranges from 0 to 150,000
while in the second one from 0 to 55,000$. The average car in the first dataset seems to be
around 20,000$ while in the second one around 10,000$.

28
 Conclusion
In conclusion, we find that both of the databases we edited and inputted into the program
worked. They differed in the number of columns or attributes but the program worked since
we edited them. It also showed some statistics of the database as we see in the graph. It was
very effective in showing the average price of the cars we inputted(around 10,000 cars both)
in an effective graph. Showing how the prices differ based on different databases. We could
also use different databases as inputs for different markets. Meaning we could see the price of
different car models in a graph and that would a very efficient way to create a graph based on
different markets. Also we see a big difference in the colored table because the second
database was much bigger in columns and rows due to having more attributes than Database
1. They both had similarities in colors meaning in the same value but they did have their
differences. We found out that this project could help websites in our country like
merrjep.com with their vehicle prices. The data learning machine could predict the price even
in a small market but not very accurately. It would be a good test for bigger application like
ebay.com and mobile.de and etc… This also would be very helpful for the future because data
learning machines are being more and more demanded. This would also help us teaching data
learning machines on how to work better and more efficiently in our career path in IT. We
also learned that not only this could be used in vehicles, but in also other categories like items
we use everyday like computers, phones and etc.. This would definetely be useful in websites.

29
 Reference

Abdul-Rahman, S., Zulkifley, N. H., Ibrahim, I., & Mutalib, S. (2021). Advanced
machine learning algorithms for house price prediction: Case study in kuala lumpur.
International Journal of Advanced Computer Science & Applications,
12(12)https://doi.org/10.14569/IJACSA.2021.0121291

Amik, F. R., Lanard, A., Ismat, A., & Momen, S. (2021). Application of machine
learning techniques to predict the price of pre-owned cars in bangladesh. Information (Basel),
12(12), 514. https://doi.org/10.3390/info12120514

Awan, F. M., Saleem, Y., Minerva, R., & Crespi, N. (2020). A comparative analysis of
Machine/Deep learning models for parking space availability prediction. Sensors (Basel,
Switzerland), 20(1), 322. https://doi.org/10.3390/s20010322

Brahimi, N., Zhang, H., Dai, L., & Zhang, J. (2022). Modelling on car-sharing serial
prediction based on machine learning and deep learning. Complexity (New York, N.Y.), 2022, 1-
20. https://doi.org/10.1155/2022/8843000

Fathalla, A., Salah, A., Li, K., Li, K., & Francesco, P. (2020). Deep end-to-end learning
for price prediction of second-hand items. Knowledge and Information Systems, 62(12), 4541-
4568. https://doi.org/10.1007/s10115-020-01495-8

García Sánchez, J. M., Cardona, X. V., & Martín, A. L. (2022). Influence of car
configurator webpage data from automotive manufacturers on car sales by means of correlation
and forecasting. Forecasting, 4(3), 634-653. https://doi.org/10.3390/forecast4030034

Grigorev, A. (2021). Machine learning bookcamp: Build a portfolio of real-life projects.


Manning Publications Co. LLC.

Jang, H., Chang, T., & Kim, S. (2023). Prediction of shipping cost on freight brokerage
platform using machine learning. Sustainability (Basel, Switzerland), 15(2), 1122.
https://doi.org/10.3390/su15021122

Li, J., Pan, S., Huang, L., & Zhu, X. (2019). A machine learning based method for
customer behavior prediction. Tehnički Vjesnik, 26(6), 1670-1676. https://doi.org/10.17559/TV-
20190603165825

Li, J., Yu, Y., Wang, Y., Zhao, L., & He, C. (2021). Prediction of transient NOx emission
from diesel vehicles based on deep-learning differentiation model with double noise reduction.
Atmosphere, 12(12), 1702. https://doi.org/10.3390/atmos12121702

30
Li, X., Gao, J., Wang, C., Huang, X., & Nie, Y. (2022). Ride-sharing matching under
travel time uncertainty through data-driven robust optimization. IEEE Access, 10, 116931-
116941. https://doi.org/10.1109/ACCESS.2022.3218700

Liu, E., Li, J., Zheng, A., Liu, H., & Jiang, T. (2022). Research on the prediction model
of the used car price in view of the PSO-GRA-BP neural network. Sustainability (Basel,
Switzerland), 14(15), 8993. https://doi.org/10.3390/su14158993

Malibari, N., Katib, I., & Mehmood, R. (2021). Predicting stock closing prices in
emerging markets with transformer neural networks: The saudi stock exchange case.
International Journal of Advanced Computer Science & Applications,
12(12)https://doi.org/10.14569/IJACSA.2021.01212106

Mohamed, M. A., El-Henawy, I. M., & Ahmad, S. (2022). Price prediction of seasonal
items using machine learning and statistical methods. Computers, Materials & Continua, 70(2),
3473. https://doi.org/10.32604/cmc.2022.020782
Ou-Yang, C., Chou, S., & Juan, Y. (2022). Improving the forecasting performance of
taiwan car sales movement direction using online sentiment data and CNN-LSTM model.
Applied Sciences, 12(3), 1550. https://doi.org/10.3390/app12031550

Shahbazi, Z., & Byun, Y. (2022). Blockchain and machine learning for intelligent
multiple factor-based ride-hailing services. Computers, Materials & Continua, 70(3), 4429.
https://doi.org/10.32604/cmc.2022.019755

Siva, R., & M, A. (2022). Linear regression algorithm based price prediction of car and
accuracy comparison with support vector machine algorithm. ECS Transactions, 107(1), 12953-
12964. https://doi.org/10.1149/10701.12953ecst

Tien Bui, D., Hoang, N., & Samui, P. (2019). Spatial pattern analysis and prediction of
forest fire using new machine learning approach of multivariate adaptive regression splines and
differential flower pollination optimization: A case study at lao cai province (viet nam). Journal
of Environmental Management, 237, 476-487. https://doi.org/10.1016/j.jenvman.2019.01.108

Wang, F., Zhang, X., & Wang, Q. (2021). Prediction of used car price based on
supervised learning algorithm. Paper presented at the 143-147.
https://doi.org/10.1109/NetCIT54147.2021.00036

Xia, Z., Xue, S., Wu, L., Sun, J., Chen, Y., & Zhang, R. (2020). ForeXGBoost: Passenger
car sales prediction based on XGBoost. Distributed and Parallel Databases : An International
Journal, 38(3), 713-738. https://doi.org/10.1007/s10619-020-07294-y

31
Yan, H., & Ouyang, H. (2018). Financial time series prediction based on deep learning.
Wireless Personal Communications, 102(2), 683-700. https://doi.org/10.1007/s11277-017-
5086-2
Zhou, S., Chen, B., Liu, H., Ji, X., Wei, C., Chang, W., & Xiao, Y. (2021). Travel
characteristics analysis and traffic prediction modeling based on online car-hailing operational
data sets. Entropy (Basel, Switzerland), 23(10), 1305. https://doi.org/10.3390/e23101305

Research paper:
Bukvić, Lucija & Škrinjar, Jasmina & Fratrović, Tomislav & Abramović, Borna. (2022).
Price Prediction and Classification of Used-Vehicles Using Supervised Machine Learning.
Sustainability. 14. 17034. 10.3390/su142417034.
https://www.researchgate.net/publication/
366407644_Price_Prediction_and_Classification_of_Used-
Vehicles_Using_Supervised_Machine_Learning

32

You might also like