You are on page 1of 8

Project 02: Fuel economy data from 1999

to 2008 for 38 popular models of cars

Statistics
2022-2023

3rd year Bachelor of Civil Engineering


Institute of Technology of Cambodia

Group members:
AN Bronith ID: e20190006
CHEA Puthpisakh ID: e20200343
CHHUON Ratanakvatey ID: e20200131
HAM Chetra ID: e20200546
HOR Songhak ID: e20200010

Submission Date: 30 January, 2023

Lecturers:
Mr. PHOK Ponna (Course)
Mr. TOUCH Sopheak (TD)
1 Introduction
In this project, we use complete case analysis in the dataset which the interested variables
are manufacturer, model and cty. There will be summary statistics table, present the
number of cars for each sub-brand and its average, variance, standard deviation, minimum
and maximum of city mile per gallon. The aim of this project is to analyze fuel consumption
in the city of cars the same brand.

2 Methods and Materials


2.1 Data Description
The dataset for this project is mpg. It contains a subset of the fuel economy data that the
EPA makes available on fueleconomy.gov. It contains only models with had new release
every year between 1999 and 2008, this was used as a proxy for the popularity of the car.
The dataset frame is with 234 rows and 11 variables: manufacturer, model, displ
(engine displacement, in litre), year (year of manufacture), cyl (number of cylinders),
trans (type of transmission), drv (the type of drive train, where f = front-wheel drive, r
= rear wheel drive, 4 = 4wd), cty (city miles per gallon), hwy (highway miles per gallon),
fl (flue type), and class (type of car). 1
We got this dataset by using the following R code:

library(ggplot2)
data(mpg)

2.2 Data Exploration


There are two types of graph that will be used:

1. A histogram for the city mile per gallon of sub-brand of “Audi”.

2. Two boxplots for the city mile per gallon for sub-brand of “Audi” and “Volkswagen”.

2.3 (One-way) ANOVA


An Analysis of variance (ANOVA) test is a way to find out if survey or experiment results
are significant. In other words, they assist you in determining whether you should accept
the alternative hypothesis or reject the null hypothesis. In essence, you are comparing
groups to see if there is a difference. In this case, fuel consumption of cars.
A one way ANOVA is used to compare two means from two independent (unrelated)
groups using the F-distribution. The null hypothesis for the test is that the two means are
equal. Therefore, a significant result means that the two means are unequal. [1]
1
For more details on the dataset, rdocumentation.org

1
2.4 Statistical Software
R Programming will be used for statistical analysis.

3 Results
3.1 Data Exploration
3.1.1 Summary statistics table

Table 1: Summary statistics table for sub-brand of “Audi”

Model N. Mean Variance Std. Deviation Min. Max.


A4 7 18.9 3.48 1.86 16 21
A4 Quattro 8 17.1 3.27 1.81 15 20
A6 Quattro 3 16 1 1 15 17

There were 3 sub-brand of Audi car. A4 and A4 Quattro had similar outcome except A6
Quattro that had too little amounts (3), A4 Quattro had the highest amount (8). In city
mile per gallon with the minimum of 15 and the maximum of 21. From Table 1 however,
in average A4 was higher (18.9), variance (3.48) and standard derivation (1.86).

Table 2: Summary statistics table for sub-brand of “Volkswagen”

Model N. Mean Variance Std. Deviation Min. Max.


GTI 5 20 4 2 17 22
Jetta 9 21.2 23.7 4.87 16 33
New Beetle 6 24 42.4 6.51 19 35
Passat 7 18.6 3.62 1.90 16 21

There were 4 sub-brand of Volkswagen car. The dataset had the following structures.
From Table 2, Jetta had the highest amount (9) and GTI had the least (5). In city
mile per gallon, the minimum was 16 (Jetta and Passat) and the maximum was 35 (New
beetle). Similarly, in average Passat was the lowest (18.6) and New Beetle was the highest
(24). however, Variance of Jetta and New Beetle were extremely higher than it’s standard
deviation (23.7 and 24 vs. 4.87 and 6.51 respectively). Therefore, it indicates that the data
points are very spread out from the mean, and from one another.

2
3.1.2 Graphs

Figure 1: Histogram of the distribution of the city mile per gallon for sub-brand of “Audi”.

Figure 1 shows the distribution of the variable rate which is the rate of the city mile
per gallon for sub-brand Audi A4 (blue) and Audi A4 Quattro (red). In this graph, a
histogram of 15 rates for 2 sub-brand is shown. It is quite clear that the histogram suggests
the presence of some heterogeneity as not all sub-brand seem to have the same rate for
reaching the target. This indicates that the group of sub-brand may be clustered with
respect to their rate of reaching the target. Therefore, the finite mixture model is needed.

Figure 2: Boxplots of the city mile per gallon for sub-brand of “Audi”.

Figure 2 indicates the range in which the middle 50% of all values lie. It shows that city
mile per gallon of Audi A4 Quattro is normally distributed and Audi A4 has right skewed

3
distribution. There are also no outlier in both dataset.

Figure 3: Boxplots of the city mile per gallon for sub-brand of “Volkswagen”.

Figure 3 indicates the range in which the middle 50% of all values lie. It shows that
city mile per gallon of sub-brand Volkswagen Passat and Volkswagen New Beetle is right
skewed distributed, Volkswagen Jetta and Volkswagen GTI have left skewed distribution.
It’s worth to point out that sub-brand Jetta and GTI has the same median. However, Jetta
has one outlier (around 33) that is needed to address before further calculation.

3.2 Statistical Analysis

Table 3: ANOVA of city mile per gallon for “Audi A4” and “Audi A4 Quattro”.

Source of Variation SS df MS F P-value F crit


Between Groups 11.20 1 11.20 3.33 0.091 4.667
Within Groups 43.73 13 3.36

Test H0 : µ1 = µ2 vs. Ha : µ1 ̸= µ2
From table 3
P-value = 0.091 > α = 0.05
Then H0 is not rejected.
Hence, in the city “Audi A4” consume fuel as equal as “Audi A4 Quattro”.

4
Table 4: ANOVA of city mile per gallon for sub-brand of “Volkswagen”.

Source of Variation SS df MS F P-value F crit


Between Groups 100.58 3 33.53 1.76 0.184 3.028
Within Groups 439.27 23 19.10

Test H0 : µ1 = µ2 = µ3 = µ4 vs. Ha : at least two µi are different.


From table 4
P-value = 0.184 > α = 0.05
Then H0 is not rejected.
Hence, in the city “Volkswagen GTI”, “Volkswagen Jetta”, “Volkswagen New Beetle”, and
“Passat”, consume fuel equally.

4 Conclusion
By using Analysis of variance (ANOVA) test, we simply test H0 for the same mean of city
mile per gallon for each sub-brand against Ha for at least two are different. We chose 5%
of statistical significance (α). Finally we can conclude that in theses two car brand (Audi
and Volkswagen), all their sub-brands consume fuel equally in the city. However, at the
same time we don’t have enough evidence to conclude that it’s equal too.

5
References
[1] Kim, H.Y., 2014. Analysis of variance (ANOVA) comparing means of more than two
groups. Restorative dentistry & endodontics, 39(1), pp.74-77.

[2] Alam, M. (2021) Reading and interpreting summary statistics, Medium. Towards Data
Science. Available at: towardsdatascience.com (Accessed: January 22, 2023).

[3] Test, Chi-square, ANOVA, regression, correlation... Available at: datatab.net (Ac-
cessed: January 22, 2023).

Appendix

Listing 1: R code
library(ggplot2) #for dataset and plotting
library(magrittr) #for pipe operator
library(dplyr) #for group_by

data(mpg) #dataset
view(mpg) #to see the dataset
names(mpg) #to view the variable
mpg_no_na <- na.omit(mpg) #to remove missing data

#Select interested the variable


int_var <- subset(mpg_no_na, select = c(manufacturer, model, cty))

#select brand "Audi" and "volkswagen"


audi <- subset(int_var, manufacturer == "audi")
volkswagen <- subset(int_var, manufacturer == "volkswagen")

#Statistic table
audi %>%
group_by(model) %>%
summarise(Amount = n(), Average = mean(cty), Var = var(cty), Std = sd(cty),
Min. = min(cty), Max = max(cty))

volkswagen %>%
group_by(model) %>%
summarise(Amount = n(), Average = mean(cty), Var = var(cty), Std = sd(cty),
Min. = min(cty), Max = max(cty))

6
###select only a4 and a4 quattro
ad_no_a6 <- rbind(subset(audi, model == "a4"),
subset(audi, model == "a4␣quattro"))
#histogram
ggplot(ad_no_a6, aes(cty,fill = model)) +
geom_histogram(binwidth = 1,color = "black",
alpha = 0.8, position = "identity")+
scale_fill_manual(values = c("blue", "red"))+
ggtitle("Histogram␣of␣the␣city␣mile␣per␣gallon␣for␣sub-brand␣of␣"Audi"")+
labs(x = "City␣miles␣per␣gallon", y = "Frequency")+
theme_classic()
ggsave("ad_hist.png", width = 6, height = 5, dpi = 300) #save

#boxplot
##audi
ggplot(ad_no_a6, aes(cty, model, fill = model)) +
geom_boxplot(show.legend = FALSE)+
ggtitle("Boxplots␣of␣the␣city␣mile␣per␣gallon␣for␣sub-brand␣of␣"Audi"")+
labs(x = "City␣miles␣per␣gallon", y = "Audi’s␣Model")+
theme_classic()
ggsave("ad_boxplot.png", width = 7, height = 2, dpi = 300) #save
##volkswagen
ggplot(volkswagen, aes(cty, model, fill = model)) +
geom_boxplot(show.legend = FALSE)+
ggtitle("Boxplots␣of␣the␣city␣mile␣per␣gallon␣for␣sub-brand␣of␣"Volkswagen"")+
labs(x = "City␣miles␣per␣gallon", y = "Volkswagen’s␣Model")+
theme_classic()
ggsave("vol_boxplot.png", width = 7, height = 3.5, dpi = 300) #save

#anova test
ad_ano <- aov(cty ~ model, data = ad_no_a6)
summary(ad_ano)

vol_ano <- aov(cty ~ model, data = volkswagen)


summary(vol_ano)

The End

You might also like