You are on page 1of 15

Analysis of Super Car

Dataset
FIT5147 Data Exploration and Visualization
Semester 1, 2018
Data Exploration Project

Name: VARUN MATHUR


ID: 28954114
Contents

1. Introduction ........................................................................................................................ 1

1.1 Motivation ........................................................................................................................ 1

1.2 Problem Description ......................................................................................................... 1

1.3 Some Questions ................................................................................................................ 1

2 Data Wrangling and Data Checking ....................................................................................... 1

3. Data Exploration .................................................................................................................... 4

4. Conclusion ........................................................................................................................... 12

5. Reflection ............................................................................................................................. 13

6. References ............................................................................................................................ 13
1. Introduction
1.1 Motivation
Being a super car enthusiast, finding this dataset was really a blessing in disguise as this has
motivated me to analyse different supercars in terms of their power, top speed etc. It would be
interesting to plot these parameters in terms of good visualization graphs such as bar plots,
scatter plots etc.
1.2 Problem Description
Super car dataset is a great dataset for performing data analysis. Just by looking at the dataset
various questions came up in my mind regarding super cars like: how much horse power does
the Mercedes Benz have? How has the trend of the car changed over the years? How does
these cars vary with the other supercars in terms of power, torque, top speed? All these
questions can be answered by these datasets However a little data cleaning and manipulation
needs to be done to analyse the data. We will need to use some data cleaning libraries to
reshape our data for analysis.
This data can then be used to plot various visualizations like scatterplots, histograms, bar
charts etc.
1.3 Some Questions
Some of the interesting questions that can be answered using the datasets are as follows:
A) What were the top speeds of the super cars in various decades and how did its trend change
over time? When does the spike in the top speed happen?
B) How does the horsepower V/s speed change over the decades? What is the most common
range of horsepower for the supercars?
C) Which are some of the top speed cars? How does their top_speed vary to each other? Can
this be shown with the help of a bar graph perhaps?
D) How does the acceleration of the super car (time it takes for the super car to reach a speed
of 0 mph to 60 mph) vary with the horsepower of the supercar? Does it increase or
decrease? How does the curve look like?
E) Similarly, how does the speed of the car vary with that of the horsepower_per_ton of the
supercar? Does adding the weight parameter affect the way the plot changes as compared
to question D?
F) What is the most common top speed of the different supercars? In what range does this
value hold? Can this be described using a density plot?

2 Data Wrangling and Data Checking


The following are a few of the data sources that has helped me in to get some of the parameters
of various supercars. These have also provided me with other information about the cars such
as types, models etc.
https://www.supercars.net/blog/ : This source basically gives a brief description and interesting
facts of the various supercars such as Porsche, BMW, RUF SCR, Ferrari, Audi, etc.

1|Page
https://www.thesupercarscollective.com/inside-supercars/ : This source basically gives an
overview of the technical specifications, types of super cars, how their tyres are remodified in
accordance to the speed of the vehicle etc.
I have used 6 .txt files which consist of all the data about the super cars.
These are as follows:
i) auto-snout_0-60-times_DATA.txt: Consists of 400 rows * 2 columns. This provides
information of cars with their time in seconds to reach speed of 60kmph.

ii) auto-snout_engine-size_DATA.txt: Consists of 1580 rows * 3 columns. Provides the car


names with their engine sizes.

iii) auto-snout_horsepower_DATA.txt: Consists of 1581 rows * 3 columns. This provides


horsepower information.

iv) auto-snout_power-to-weight_DATA.txt: Consists of 1581 rows * 3 columns. Provides


horsepower per ton_bhp information.

v) auto-snout_top-speed_DATA.txt: Consists of 1573 rows * 3 columns. Provides the top


speed of the car in mph and kph.

vi) auto-snout_torque_DATA.txt: Consists of 1580 rows * 3 columns. Provides the torque


information of the super cars.

Some of the steps taken for the data wrangling/ Data cleaning process are as follows:
1) We search for some duplicate records across all the datasets. This is because the duplicate
datasets can give a faulty join. We use the library ‘dplyr’ for this purpose. We aggregate
each of the data frames by the ‘car_full_nm’ and count the number of records for each car
and filtering the result in such a way that the result shows the cars that have more than one
record.
After this process, it is observed that 3 duplicate records are found, which are as follows:
1 Chevrolet Chevy II Nova SS 283 V8 Turbo Fire - [1964] 2
2 Koenigsegg CCX 4.7 V8 Supercharged - [2006] 2
3 Pontiac Bonneville 6.4L V8 - [1960] 2

These duplicate records would cause problems when we try to do our analysis. So we
basically need to remove the duplicate records. This can be done using the distinct()
function. So, all the records that have the same ‘car_full_nm’ will be omitted.
2) Now we join the 6 datasets into one large dataset that would contain all the specifications of
all the supercars.
This joining is performed in the following manner:
• We start with the horsepower dataset and join the torque data set to it.
This joining is performed on the common attribute name “car_full_nm”.
• Then, one by one the remaining datasets are joined to this dataset to create one
large dataset.
• We use a left_join() because this join will give us the same number of records
after every join.

2|Page
Finally, we check the data again and there are no more duplicates found.

3) Now we go ahead and add a few variables to our dataset in the wrangling process. We shall
use the ‘Mutate’ function for this.
• We use regular expressions to initially remove the ‘year’ part from the
‘car_full_nm’ and insert this ‘year’ column back in our main dataset. We then use
the ‘substring’ function to get the initial 3 characters of the year and depending on
the these extracted values we replace it with a ‘0s’. For example, if the 3
characters extracted are ‘193’ then we replace it with ‘1930s’.
• After this, by using regular expressions again, we extract the ‘make’ of the car.
This is the brand name of the car. It is stored in another column named
’make_nm’. This column is then inserted in the main dataset.
For example,
A) From the car name, ‘Bugatti Veyron 8.0 litre W16 Super Sport’, make_nm is
‘Bugatti’.
B) From the car name, ‘Porsche 9FF GT9R’, make_nm is ‘Porsche’.

• Now, we calculate the car_weight in tons which is equal to the horsepower_bhp


divided by the horsepower_per ton bhp. This new column is then inserted into the
main data set.
car_weight in tons= horsepower_bhp/ horsepower_per ton bhp
4) We go ahead and inspect the data again. This is done to check whether the new variables
that have been added to the new dataset have been correlated successfully. This is done by
using the library ‘dplyr’ and the ‘%>%’ operator. Finally, we create two frequency tables
which are as follows:
• We check the number of rows that are present for each of the decades. This is
done by using the group_by function and the summarize () function. We get a
count of the number of rows for each of the decades.
Example: decade count

1930s 2
1940s 7
1950s 57 etc.

• We check the number of rows that are present for each ‘make’ of the car. This is
again done by using the group_by function and the summarize () function. We
finally get the number of rows for each ‘make’ of the car.
Example: make_nm make_count

Ford 110
Audi 98
Porsche 95 etc.

3|Page
3. Data Exploration
We use the tool R for our data exploration.
We shall create some themes. As this will help us to use these themes in plotting multiple
charts later.
Now we shall start exploring the data and answer the above questions.

A) What were the top speeds of the super cars in various decades and how did its trend
change over time? When does the spike in the top speed happen?
• Plots number 1, 2 and 3 shall answer this question.

1) Firstly, we Plot a histogram of Top Speeds.

Here we notice that there are a large number of cars whose top speed maxes out at 150 to
155 miles per hour. But this is still not giving us a clear detail on when does the actual
spike in the top speed happen. To explore this further, we can plot a bar chart
concentrating in the range of about 150 to 155 mph.

2) Plotting Bar Chart of top speeds between 149 and 159 miles per hour.

4|Page
Now we see that there is a huge spike in the top speed of the super cars at 155 mph while
the other top speeds are quite close to each other. This answers the 2nd part of our
question.

3) Plotting a histogram of top speeds by decade

We use the concept of faceting for this purpose. We plot a histogram of the top speeds by
the different decades.

5|Page
Here, we notice that the top speed is quite less in the initial decades. It is negligible in the
1930s and the 1940s. It starts to increase gradually from the 1950s. An actual spike in
the top speed begins sometimes in the 1990s. An actual spike is observed in the 2000s
and is close to the value of 150 mph.

B) How does the horsepower V/s speed change over the decades? What is the most
common range of horsepower for the supercars?
• Plots number 4 and 5 shall answer this question.

4) We initially start by plotting Horsepower vs. Top Speed

6|Page
This is a basic graph of Horsepower vs Top Speed. We see that as the Horsepower
increases the top speed also increases. The data is mostly concentrated near the
horsepower value of about 150bph -230 bph and top_speed of about 100 mph-150 mph.

5) We plot Horsepower vs Top speed by decade

We use concept of faceting for this purpose.

7|Page
We see that in the early ‘60s’ and the ‘70s’ there is a substantial increase in the
horsepower. But, in the ‘80s’, the correlation between the horsepower and the top speed
becomes more and more tight. Later, through the ‘80s’ and the ‘90s’, we see a mild
increase in the horsepower and the speed.
However, in the ‘2000s’, we see a huge increase in the horsepower and the top speed of
the supercars. The horsepower nearly crosses 750bph in the ‘2000s’ and it even crosses
1000bph in the ‘2010s’. This shows the eventual evolution of the supercars in terms of
the horsepower and the top speed.

C) Which are some of the top speed cars? How does their top_speed vary to each
other? Can this be shown with the help of a bar graph perhaps?
• Plot number 6 shall answer this question.

6) Plotting a bar graph of the top 10 fastest cars.

This graph gives us a overview of the top 10 fastest cars. The bar graph is arranged in a
descending order from the right. Some geom_points have been added to the plot to make
the visualization more attractive. The car_full_names are displayed on the x axis.

8|Page
We see that the fastest car is “SSC Ultimate Aero TT – [20008]”, followed by
“Koenigseegg Agera R 5.0 V8 – [2012]” and so on.

D) How does the acceleration of the super car (time it takes for the super car to reach a
speed of 0 mph to 60 mph) vary with the horsepower of the supercar? Does it
increase or decrease? How does the curve look like?
• Plot number 7 shall answer this question.

7) Plotting 0 to 60 times vs. horsepower

Now, the column 0_to_60_times of the dataset gives us the time in seconds it takes
for the supercar to reach a speed from 0 mph to 60 mph. We plot a scatterplot of this
vs. the horsepower. We include some smoothers as well in our visualization to make
it more interactive and see the curve.

9|Page
Here, we can see that as the horsepower increases, the time taken for the supercar to reach
a speed from 0 mph to 60mph decreases. This is quite evident because more the
horsepower of the car, the less time it would take for the supercar to achieve the desired
speed.

E) Similarly, how does the speed of the car vary with that of the horsepower_per_ton of
the supercar? Does adding the weight parameter affect the way the plot changes as
compared to question D?
• Plot number 8 shall answer this question.

8) Plotting 0 to 60 times vs. horsepower-per_tonne

10 | P a g e
Similarly, we plot 0_to_60_times vs the horsepower_per_ton. The graph is as
follows:

Here also, we can see that as the horsepower-per-tonne increases, the time taken for the supercar
to reach a speed from 0 mph to 60mph decreases. This graph gives us a more precise and clear
understanding of the relation between the 0_to_60_times data and the horsepower_per_tonne
data.

F) What is the most common top speed of the different supercars? In what range does this
value hold? Can this be described using a density plot?
• Plot number 9 answers this question.

9) Plotting the density plot for top speed mph

11 | P a g e
As seen from this density plot, the top speed is densely populated around 155mph. This
was shown in the previous plot of bar graph as well. The density reduces to negligible
values for speed over 250 mph.

4. Conclusion
I did an in-depth analysis and exploration of the super car dataset. I was able to learn a
great deal about the data wrangling and the data cleaning process. Various plots were
plotted for the various parameters of the dataset such as top speed, horsepower, the
horsepower_per_tonne, etc and got some useful insights. We were able to get the
answers for the initial questions that that were discussed. We started with the plotting of
the top speed by decades. This gave us an insight of how the top speed changes over the
decades. Different bar charts and Scatter plots were plotted to depict this information.
Similarly, the horsepower vs the top speed was plotted and shown in the form of facets.
This gave us some insights on how the horsepower affects the speed of the car. We were

12 | P a g e
also able to figure out the range for the most common top speed and horsepower of the
supercar and figured out as to when does the actual spike in the top speed happen. All
this analysis and exploration really helped me to develop my data wrangling and data
exploration skills.

5. Reflection
Since my keen interest in supercars, I was lucky enough to find this interesting dataset
on supercars. This project really helped me to gain some understanding about the data
wrangling and the data cleaning process. I initially started the wrangling/ cleaning process
by checking for some duplicate records. This is because duplicate records tend to give
faulty results. The ‘dplyr’ package was used for this purpose. These duplicate records
were then removed using the distinct function. Next since there were multiple .txt files, I
used the left outer join command to join all the datasets together to create one big master
dataset. This dataset then held all the columns of the respective datasets. Later I was able
to learn the use of the mutate function, as this was used to add additional variables to our
dataset. I also learnt how to create some frequency tables.

Next, I started with the exploration part. I came up with some interesting questions to
explore the data. I learnt how to go about plotting different kinds of plots such as a
scatterplot, a geom_bar plot, a histogram, a density plot etc. I learnt on how to do faceting
for different variables and show the trends of how it changes over the time. This gave me
an understanding of the usage of ggplot functionality in detail. I used some smoothers in
some of the plots to make the visualization more interesting and to depict and get a clearer
information from the graph. Overall, with this exploration project, I was able to learn a
great deal about supercars and their specifications and their interesting facts.

6. References
Inside Supercars. (2011). Retrieved from https://www.thesupercarscollective.com/inside-
supercars/

Exotics, Sports Cars, & Supercars | Pics, Reviews, & More. (2018). Retrieved from
https://www.supercars.net/blog/

The Supercar Blog. (2018). Retrieved from http://www.thesupercarblog.com/

13 | P a g e

You might also like