Professional Documents
Culture Documents
Los Angeles
by
Xiaofei Yan
2008
c Copyright by
Xiaofei Yan
2008
XU, HONGQUAN
ii
iii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Time Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
Yearly Data . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
Quarterly Data . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3
Monthly Data . . . . . . . . . . . . . . . . . . . . . . . . .
Site Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.3
3.1
3.2
10
3.1.1
10
3.1.2
By Region . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.1.3
By Years . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1.4
By Quarter . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.1.5
By Month . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1.6
By Sitecode . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.1
25
3.2.2
Regional Results . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.3
Yearly Results . . . . . . . . . . . . . . . . . . . . . . . . .
27
iv
3.2.4
Quarterly Results . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.5
Monthly Results . . . . . . . . . . . . . . . . . . . . . . .
28
3.2.6
Sitecode Results . . . . . . . . . . . . . . . . . . . . . . . .
28
Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.1
By Region . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.2
By Year . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3.3
By Quarter . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3.4
By Month . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.5
By Sitecode . . . . . . . . . . . . . . . . . . . . . . . . . .
32
32
3.4.1
AR(1) Model . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.4.2
38
3.4.3
41
47
3.3
3.4
4.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
List of Figures
2.1
Data by Regions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
2.4
3.1
12
3.2
13
3.3
14
3.4
16
3.5
. . . . . . . . . . . . . . . . . . . . . . . . .
17
3.6
18
3.7
19
3.8
21
3.9
22
23
24
34
35
36
37
39
40
vi
40
42
3.20 The Means and Medians of AR(1) Coefficients for Eight Years . .
42
43
3.22 The Means and Medians of AR(1) Coefficients for Six Years . . .
44
45
3.24 The Means and Medians of AR(1) Coefficients for Three Years . .
45
vii
List of Tables
2.1
2.2
2.3
Data by Years . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Data by Quarters . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
Data by Months . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
11
3.2
13
3.3
14
3.4
15
3.5
16
3.6
18
3.7
19
3.8
20
3.9
20
21
22
25
26
26
27
27
viii
28
29
30
31
31
ix
Acknowledgments
Many people have played an important part in the production of this thesis.
The people who have made excellent contributions and suggestions to this work
include Professor Jan de Leeuw, IT consultant from United Oil Company Stanley
Hecht, Lawyer Mark B. Gilmartin, CEO of United Oil Company Jeff Appel and
many people who I have not met but helped the the evolvement of this work.
I am particularly grateful to Professor Jan de Leeuw, who supervised me
on this project, guided me through the research and gave me many comments
that led to significant improvement, and to Stanley Hecht, who provided me
important informational support during the development of the thesis. Many
thanks to Professor Hongquan Xu and Professor Paik Schoenberg Frederic R. to
be in my master thesis committee.
Special thanks to my mother Fei Chen and my father Xingbo Yan who gave
me so much understanding and support. Without their help, I would not be able
to live in this beautiful campus and get a M.S. degree from UCLA.
Last but not the least, I would like to express my acknowledgement to the
Department of Statistics of UCLA and the United Oil Company. The opportunity they gave me to study and work on this project was definitely a precious
experience of my life and completed my study in UCLA.
Thanks again to all the friends I met in United States, Canada and China,
for the joy and laugh you shared with my life.
Vita
1983
20012005
20052007
M.A.S. (Materials Science and Engineering), McMaster University, Hamilton, Ontario, Canada.
20072008
xi
Xiaofei Yan
Master of Science in Statistics
University of California, Los Angeles, 2008
Professor DE LEEUW, JAN , Chair
It has been known in petroleum industry that fuel expands when the temperature increases and contracts when the temperature decreases. Gasoline has 1%
volumetric change for every 15 degree F change, and diesel has 0.6% volumetric
change for every 15 degree F change [1]. There have been a 60 degree Fahrenheit
standard for temperature correction of bulk deliveries of petroleum products.
However, as the temperatures the fuel was sold at were above 60 degree F at
many states, it is said every year overall 670 million additional gallons of gasoline
and 90 million additional gallons of diesel were consumed in U.S. because of the
fuel expansion [2]. Laws and rules about installation temperature-corrected kits
at dispensers are already on the way in many U.S. states. However, besides the
interestes of the individual consumers, the retailers who purchase bulks of fuel
from wholesaler are also experiencing the loss caused by fuel expansion. For the
sake of fairness, it is necessary to find evidence of fuel loss for the retailers if there
are litigations in the future. The purpose of this project is to determine whether
retailers were experiencing constant fuel loss, and what environmental elements
might influence on such fuel loss. It is also interesting to see whether the pattern
xii
xiii
CHAPTER 1
Introduction
Fuel experiences significant expansion when temperature rises and contracts when
temperature falls. The energy content and the weight of the fuel does not change
with the temperature. The practice of selling fuel without temperature correction
seems deceptive to whoever buys it by volumetric gallon. For this reason, the
temperature effect on volume has been taken into account by the petroleum
industry.
Both gasoline and diesel are less dense than water. Gasoline easily evaporates
when it is exposed to room-temperature air while diesel does not. The gasoline
sold in California is actually a mixture of ethanol and gasoline, both of which
are volatile.thanol was blended with gasoline as an oxygen-enhancing additive for
gasoline [3]. The volume of the gasoline and ethanol mixture is larger than the
sum of the individual parts [3]. There are no API tables to calculate a volume
correction factor when mixing ethanol with gasoline. Hence the gasoline and
ethanol is blended by ratio onto the truck before the custody transfer, in order
to get rid of the volume expansion effects [3].
At wholesale level transactions, a 60 degree Fahrenheit standard is used for
temperature correction of gasoline by gallon [4] ,however, at retail sales, the
situation is different. No technology has been implied for the compensation to
the retailers of gasoline in United States. The effective price per gallon increases
as the amount of gasoline by weight decreases when the temperature exceeds 60
CHAPTER 2
Background
The losses of gasoline and diesel at 106 gas stations across four regions are
recorded monthly from September 2001 to July 2008. The loss of fuel is calculated as in Equation 2.1,
LOSS OF F U EL = P U RCHASE + ST ART EN D SALES.
(2.1)
where the PURCHASE represented the amount of fuel the gas station bought
during the month, the START and END represented the fuel stored at the beginning and the end of the month, the SALES represented the amount of fuel sold
during that month.
Hence, the fuel loss rate was calculated as a percentage of the monthly purchased amount, as seen in Equation 2.2,
LOSS RAT E = LOSS P U RCHASE 100.
(2.2)
These gas stations had different operation periods. Some of them started
since 2001, some started later, some shut down forever and some reopened and
was running till now. The gasoline and diesel are not both sold at every gas
station. Generally, the gasoline was sold more widely among these gas stations
than diesel. Some of these gas stations even switched from diesel to gasoline or
vice versa.
Hence, missing data is a big problem for this data set, especially for diesel.
Since we will first to see whether the gasoline had much loss than diesel, and if
this is proved to be true later, we can use the time series analysis model ARIMA
to handle the missing values.
2.1
Regions
The regions involved are Los Angeles, Santa Barbara, San Diego and Orange
County. Almost 75 % of the monthly records are from Los Angeles.The number
of the gas stations of each regions are listed as in Table 2.1 and the Figure 2.1.
Table 2.1: Regional Data Distribution
Regions
LA
SD
SD
OC
Total
monthly observations
5287
1226
702
423
7638
72
18
10
106
LA
SD
OC
SB
Looking into the data of Gasoline and Diesel separately, their structure are
presented as the Table 2.2 and Figure 2.2 and Figure ?? respectively,
Table 2.2: Data of Gasoline and Diesel
Regions
LA
SD SB OC TOTAL
Observations by Month
Gasoline
5087
1176
666
422
7351
Diesel
2359
793
119
409
3680
Gasoline
72
18
10
106
Diesel
34
12
54
LA
LA
SD
SD
SB
SB
OC
OC
2.2
Time Variables
The data can be divided into groups according to the year, the quarter and the
month it was recorded. Determining the effects of the time variables on the gas
shortage is an essential task of this project. Details are illustrated as in the table
and graphs.
2.2.1
Yearly Data
According to the year it was recorded, the data was grouped into 8 sub-data
sets, as seen in Table 2.3. By doing this, we will be able to see the effects of
the different years on the fuel loss rate. If there was significant difference among
these years, we can further study on how and why the year had influenced on
the fuel loss rate compared to others. If not, we will be able to exclude the year
variable in our further analysis.
Since it covered the time period from September 2001 to July 2008, it can
be easily seen that the size of the sub-data set of 2001 was substantially smaller
than the size of the others. The size of the sub-data set of 2008 was comparably
smaller than the other years but much larger than that in 2001. These will be
considered at next step of analysis.
Table 2.3: Data by Years
Year
Number of observations
2.2.2
2004
2005
2006
2007
2008
352
1040
1076
1129
1197
707
881
969
Quarterly Data
Our first suspicion is that temperature would be one of the major factors on the
fuel loss rate. This would easily direct us to detect any seasonal effect that exited
among the fuel loss rates. Hence, the data was divided into 4 groups according
to their time varialbes, as seen in Table 2.4. The first quarter was from January
to March, the second quarter was from April to June, the third quarter was from
July to September, and the last quarter was from October to December.
Table 2.4: Data by Quarters
Seasons
quarter 1
quarter 2
quarter 3
quarter 4
Gasoline
1842
1890
1782
1837
Diesel
920
968
880
912
From the Table 2.4 it can been seen that, the number of observations for each
quarter were generally even through the years.
2.2.3
Monthly Data
In order to see any significant effects of particular months, the data was also
analyzed based on the month it was recorded, as seen in the Table 2.5. The
numbers of observations are quite even over the 12 months.
Table 2.5: Data by Months
Month
Gasoline 611
Diesel
2.3
330
10
11
12
640
633
545
604
605
615
617
328
317
273
290
302
302
308
Site Code
80
60
Number of Obervations
40
20
0
0
20
40
60
80
GAS STATIONS
100
80
60
Number of Obervations
40
20
0
0
10
20
30
40
GAS STATIONS
50
CHAPTER 3
Methodology and Results
In order to detect the effects of the variables on the fuel loss rates, we used two
ways to test the equality of the average fuel loss rates. With the hypothesis
that all the means of the fuel loss rates of different groups were equal, as seen in
Equation 3.1,
H0 : 1 = . . . = k .
(3.1)
we used 2 test to test the validity of the hypothesis and Tukey method to
determine the pairs of treatments differed most.
3.1
The distributions of the fuel loss rates can be roughly described by their means,
variance and standard deviations. Statistical analysis of the data under different
regions, years, quarters and months would be able to give indications of the effects
of these factors on fuel loss rates.
3.1.1
The Table 3.1 listed the means, variances and standard deviations of gasoline and
diesel. The 95% confidence intervals were calculated as [0.21, 0.85] for gasoline
and [1.67, 1.29] for diesel respectively.
10
0.3157
-0.1912
Variances
0.0730
0.5678
Standard deviations
0.2702
0.7536
The distribution of the gasoline and diesel loss rates can aslo be described
by box plots, as seen in Figure 3.1. The box plots showed the locations of the
median, first and third quartiles of the data. It also measured the dispersion
of the data by presenting the possible outliers. Horizontal line were drawn at
the median, 25% and 75% quartiles and were joined by vertical lines to produce
the box. Besides, The whiskers extended from the quartiles and reached out to
the most extreme data points that were no more than one and half times of the
corresponding interquartile ranges from the box. The data beyond the vertical
lines were marked as asterisks or dots and usually considered as outliers [6]. The
graphs showed that the distribution of the gasoline and diesel were generally
symmetric. The whole box of gasoline was above zero while the median of diesel
was below zero. The number and distributions of outliers in the plots indicated
that the data of diesel were more dispersed than gasoline, which was consistent
with the large variance of diesel data.
11
1.5
1.0
0.5
0.0
0.5
1.5
1.0
Gasoline
Diesel
By Region
The Table 3.2 presented the means, variances and standard deviations of gasoline
of the four regions. The 95% confidence intervals were calculated as [0.15, 0.85]
for Los Angeles, [0.32, 0.82] for Santa Barbara, [0.40, 0.83] for San Diego, and
[0.03, 0.68] for Orange County. The distributions of the regional gasoline loss
rates were presented by box plots in Figure 3.2.
It was shown that all the boxes in the Figure 3.2 were above zero. The medians
and means of gasoline loss rates were all positive. Moreover, the medians and
means of gasoline loss rates of Los Angeles and Orange County were close to each
other and were larger than those of Santa Barbara and San Diego. The variances
of gasoline loss rates of Santa Barbara and San Diego were larger than those of
the other two regions. Besides, it was also noticed that most of the data points
12
SB
OC
0.3454 0.2113
0.2489
0.3539
Variance
0.0652
0.0986
0.0855
0.0274
Standard deviation
0.2553
0.3140
0.2923
0.1655
SD
0.5
0.0
0.5
1.5
1.0
1.0
1.5
Mean
LA
LA
SB
SD
OC
13
were also much larger than those of gasoline. The data points were distributed
more evenly around zero.
Table 3.3: Regional Means and Variances of Diesel Data
LA
SD
SB
OC
Mean
-0.1771
-0.2309
-0.2022
-0.1923
Variance
0.6190
0.6534
0.1777
0.2187
Standard deviation
0.7868
0.8084
0.4215
0.4677
0.5
0.0
0.5
1.5
1.0
1.0
1.5
Region
LA
SB
SD
OC
3.1.3
By Years
The yearly means and variances of Gasoline and Diesel were shown in Table
3.4. The 95% confidence interval of gasoline were calculated as [0.48, 0.79]
for 2001, [0.23, 0.73] for 2002, [0.19, 0.85] for 2003, and [0.10, 0.87] for 2004,
14
[0.23, 0.96] for 2005, [0.24, 0.88] for 2006, [0.17, 0.80] for 2007, and [0.12, 0.69]
for 2008. The distributions of the yearly gasoline loss percentages were shown in
Figure 3.4.
The boxes in the Figure 3.4 were all above zero except the one of year 2001.
There were significant differences among the distributions of the data points of
the year 2001,2002 and 2003. And the medians of the data became largest in the
year 2004.
Table 3.4: Yearly Means and Variances of Gasoline
Year
2003
2004
0.1565 0.2499
0.3276
0.3817
Variance
0.1039
0.0591
0.0704
0.0611
Standard deviation
0.3223
0.2432
0.2654
0.2473
Year
2005
2006
2007
2008
0.3623 0.3176
0.3186
0.2848
Variance
0.0927
0.0819
0.0614
0.0430
Standard deviation
0.3045
0.2862
0.2479
0.2073
Mean
Mean
2001
2002
Similarly, with the means, variances and standard deviations in Table 3.5,
the 95% confidence intervals of yearly diesel data were calculated as [3.18, 2.50]
for 2001, [1.26, 0.85] for 2002, [2.14, 1.74] for 2003, and [1.29, 1.04] for 2004,
[1.99, 1.68] for 2005, [1.27, 0.85] for 2006, [1.25, 0.90] for 2007, and [1.29, 0.81]
for 2008. The distributions of the yearly diesel loss rates were shown in Figure
3.5.
Compared with gasoline data, the means and medians of diesel data were
negative. From the box plots, it can be seen that, most of the data points tended
to distributed below zero. The variances of the diesel data were much larger than
15
1.5
1.0
0.5
0.0
0.5
1.5
1.0
2001
2002
2003
2004
2005
2006
2007
2008
2001
2002
2003
2004
Mean
-0.3413
-0.2053
-0.2001
-0.1270
Variance
2.0994
0.2873
0.9763
0.3518
Standard deviation
1.4489
0.5360
0.9881
0.5932
Year
2005
2006
2007
2008
Mean
-0.1581
-0.2070
-0.1738
-0.2365
Variance
0.8782
0.2920
0.3027
0.2872
Standard deviation
0.9371
0.5404
0.5502
0.5359
16
1.5
1.0
0.5
0.0
0.5
1.5
1.0
2001
2002
2003
2004
2005
2006
2007
2008
By Quarter
The quarterly means, variances and standard deviations of Gasoline were shown
in Table 3.6. The 95% confidence intervals of gasoline were calculated as [0.16, 0.76]
for the 1st quarter, [0.30, 0.91] for the 2nd quarter, [0.21, 0.89] for 3rd quarter,
and [0.17, 0.81] for the 4th quarter. The distributions of the quarterly gasoline
loss rates were illustrated in Figure 3.6. Again, both the means and medians of
gasoline were positive. And the difference of quarterly gasoline loss rates was
largest between the 2nd and 3rd quarters.
17
quarter 2
quarter 3
quarter 4
Means
0.2965
0.3072
0.3397
0.3206
Variances
0.0550
0.09524
0.0777
0.0626
Standard deviation
0.2346
0.3086
0.2788
0.2502
0.5
0.0
0.5
1.5
1.0
1.0
1.5
Seasons
quarter 1
quarter 2
quarter 3
quarter 4
18
data.
Table 3.7: Quarterly Means and Variances of Diesel
quarter 1
quarter 2
quarter 3
quarter 4
Means
-0.2041087
-0.1923
-0.1721
-0.1953
Variances
0.3283
0.6104
0.9806
0.3673
Standard deviation
0.5730
0.7813
0.9902
06061
0.5
0.0
0.5
1.0
1.5
1.0
1.5
Seasons
quarter 1
quarter 2
quarter 3
quarter 4
19
3.1.5
By Month
The monthly means, variances and standard deviations of Gasoline were in Table
3.8. The 95% confidence intervals of gasoline were listed in the table 3.9. The
distributions of the monthly gasoline loss rates were illustrated in Figure 3.8. It
was noticed again that all the monthly gasoline loss rates had positive means and
medians. Besides, both the means and medians became largest in the August.
Table 3.8: Monthly Means and Variances of Gasoline
Months
Means
0.2948
Variances
0.0463
0.3034 0.2911
0.3062
0.3075
0.3078
0.0564
0.0625
0.1027
0.0869
0.0967
0.3205
0.2949
0.3110
10
11
12
Months
Means
0.3482
0.3530 0.3187
0.3178
0.3249
0.3192
Variances
0.0727
0.0800
0.0806
0.0693
0.0707
0.0482
0.2632
0.2659
0.2196
[ -0.16, 0.77 ]
[ -0.20, 0.78 ]
[ -0.32, 0.93 ]
[ -0.30, 0.92 ]
[ -0.18, 0.88 ]
[ -0.20, 0.91 ]
10
11
12
[ -0.20, 0.83 ]
[ -0.20, 0.85 ]
[ -0.11, 0.75 ]
20
1.5
1.0
0.5
0.0
0.5
1.5
1.0
10
11
12
Means
-0.2645
-0.1734
-0.1756
-0.1624
-0.1757
-0.2374
Variances
0.4437
0.1952
0.3442
0.2247
0.2115
1.3769
Std. deviation
0.6661
0.4418
0.5867
0.4740
0.4599
1.1734
Months
10
11
12
Means
-0.1491
-0.2595
-0.1151
-0.2430
-0.1116
-0.2307
Variance
0.3069
2.4646
0.3164
0.6774
0.1449
0.2734
Std. deviation
0.5540
1.5699
0.5624
0.8231
0.3806
0.5228
21
[ -1.04, 0.69 ]
[ -1.33, 0.97 ]
[ -1.09, 0.77 ]
[ -2.54, 2.06 ]
[ -1.23, 0.94 ]
[ -3.34, 2.82 ]
10
11
12
[ -1.85, 1.37 ]
[ -0.86, 0.63 ]
[ -1.26, 0.79 ]
0.5
0.0
0.5
1.5
1.0
1.0
1.5
10
11
12
By Sitecode
The distribution of the gasoline loss rates for the 106 gas stations were illustrated
in Figure 3.10. Most of the gas stations had positive medians of gasoline loss rates
except the number 130 gas station, of which the box was the only one below zero
in the Figure 3.10. It was found out that this gas station only sold gasoline for
22
0.0
1.0
0.5
0.5
1.0
three months but ended with negative loss rates for that periods.
101
110
124
134
145
154
164
173
184
514
523
532
Site Code
23
2
1
Diesel Loss Percentages
0
1
2
102
119
132
144
152
160
168
504
515
521
Site Code
24
530
3.2
2 Test
With the means of fuel loss rates mi , and the corresponding variances s2i , we
minimize
c() =
n1
n2
nk
(m1 )2 + 2 (m2 )2 + . . . + 2 (mk )2 .
2
s1
s2
sk
(3.2)
over , where k is the number of regions, years, quarters, months, gas station site
codes [7]. The minimized c(
) should be asymptotically a 2 with k 1 degree
of freedom if the k groups have the same mean. In this way, both of the
and
the c(
) were obtained. And a corresponding p-value was converted from the 2
statistics for making the decision of rejection [8].
3.2.1
It was observed that the gasoline loss rates tended to be positive while the diesel
loss rates tended to be negative. Suspicion had been around that the loss of
gasoline per gallon can be higher than that of diesel. Undetermined volumetric
changes might be led to by some unknown chemical and physical reactions between the gasoline and ethanol. Comparing the loss rates of gasoline and diesel
sold at same gas station would help reveal the facts.
The 2 test results were presented in Table 3.12. The p-value was almost 0,
which indicated an extreme significance level and a rejection of the null hypothesis
that Gasoline and Diesel experienced same volumetric loss rates.
Table 3.12: Chi-square on Brands of Oil
Results
c()
pvalue
0.2851
1564.610
25
3.2.2
Regional Results
As illustrated in the Table 3.2 and the Table 3.3, the gasoline loss rates had
significant regional variations, while the Diesel did not. The 2 tests on regional
gasoline data and diesel data were listed in Table 3.13. The small p value of gasoline data indicated there were detectable differences among gasoline loss rates of
the 4 regions. The much larger p value of diesel data showed that at a significance
level of 4% there were no detectable differences among diesel loss rates of the 4
regions.
Table 3.13: Chi-square test of regional data
Results
c()
pvalue
Gasoline
0.3258
244.6932
9.1950e-53
Diesel
-0.1917
2.7471
0.4323
c()
pvalue
LA vs. SB
0.3367
66.0619
4.3700e-16
LA vs. OC 0.3468
0.9166
0.3383
LA vs. SD
0.3278
186.1303
2.2236e-42
SB vs. OC
0.3186 57.04805
4.2529e-14
SB vs. SD
0.2262
6.6531
0.0099
OC vs. SD
0.2917
136.6207
1.4596e-31
26
3.2.3
Yearly Results
The fuel losses varied each year. The 2 test was performed on the 8-year data
to find out if any detectable difference existed. The pvalue of gasoline data
indicated that there was significant variations of losses among the eight years.
The comparably much larger p value of diesel data indicated that there was no
detectable difference existed at a significance level of 5%.
Table 3.15: Chi-square test on yearly data
Results
c()
pvalue
Gasoline
0.3146
267.3266
5.654e-54
Diesel
3.2.4
-0.1884 12.51671
0.0848
Quarterly Results
Because the the data recorded in 2001 started from September to December and
the means and variances of year 2001 were relatively lower than the others, as
seen in Table 3.4, the equality of the quarterly loss rates were tested to see any
seasonal effects existed. The small p value in the Table 3.16 indicated that
significant differences were among the quarterly gasoline loss rates and there was
seasonal effect on gasoline loss rates. The large p value of diesel data indicated
that there was no detectable differences among the quarterly diesel loss rates and
hence no seasonal effects on the diesel loss rates.
Table 3.16: Chi-square test on quarterly data
Results
c()
pvalue
Gasoline
0.3150
27.2049
3.5343e-07
Diesel
-0.1972
0.7769
0.8550
27
3.2.5
Monthly Results
Detecting any surges of fuel loss rates in certain months and finding out the
difference of these months would help determine the causes of fuel loss. In the
Table 3.17, the small p values of gasoline and diesel indicated that big difference existed among monthly loss rates. The much smaller p values of gasoline
indicated that the significance for gasoline data was more extreme than that of
diesels. The variations among the gasoline loss rates should be relatively larger
than that of diesels if measured at certain significance level.
Table 3.17: Chi-square test on monthly data
Results
c()
pvalue
Gasoline
0.3146
34.3155
0.0003
-0.1705 25.0979
0.0088
Diesel
3.2.6
Sitecode Results
Determine whether all the gas stations were experiencing same losses would help
detect any specific gas station that had significant loss compared with others.
The performed 2 test results were illustrated in the Equation 3.3 for gasoline
and Equation 3.4 for diesel. The results showed that the both of them were
extremely significant, and there were no doubt that losses of each gas station
were statistically different. By looking at the box plots in Figure 3.10 and Figure
3.11 and the means of the fuel loss rates, it was found that, the gas stations
101, 102, 108, 160 and 182 had the highest positive gasoline loss rates, while gas
stations 146,177 and 537 had the most negative diesel loss rates.
28
(3.3)
= 0.0526, c() = 1310.876, p value = 0
3.3
(3.4)
Multiple Comparisons
After testing the equality of data groups, it is interesting to continue testing the
mutual differences of group members that had been identified as different. We
used the Tukey Method to perform the analysis [9]. The algorithm of this method
was illustrated in the Equation 3.5. The t statistic to test i = j vs. i 6= j was
mj mi
tij = q
n1j + n1i
where
=
(3.5)
where qk,N k, was the upper quantile of the Studentized range distribution,with
parameter k and N k degree of freedom.
3.3.1
By Region
At 10% significance level, if |tij | > 3.1143, i differs from j . Comparing with the
numbers in Table 3.18, we concluded that, the pairs of LA and SB, LA and SD,
SB and OC, OC and SD were significantly different, while the pairs of LA and
OC, SB and SD were close.
Table 3.18: Tukey test on regional data
LA vs. SB LA vs. OC
-8.8563
0.6298
LA vs. SD
SB vs. OC
SB vs. SD
OC vs. SD
-15.6679
6.3784
-2.9282
-9.4961
29
3.3.2
By Year
Since the 2 tests on diesel yearly loss rates indicated there were not much detectable difference among the yearly data, we applied the Tukey test only to the
gasoline data. The Tukey results were as presented in the Table 3.19, where 1
stands for 2001, 2 stands for 2002 and so on so forth. The loss rates of 2004 and
2005 were close, and the loss rates of 2003, 2006 and 2007 were close. The loss
rates of 2001, 2002 and 2008 were significantly different from any of the other
years respectively. Combining this statement with the numbers in Table 3.4, it
can told that the gasoline loss rates increased from the year 2001 to 2003, and
peaked around the years 2004 and 2005. Afterwards, it dropped during 2006, of
which the gasoline loss rates were close to that of the year 2003. After 2007, the
level of gasoline loss rates only went down little in the 2008.
Table 3.19: Tukey test on yearly data
3.3.3
1 vs. 2
1 vs. 3
1 vs. 4
1 vs. 5
1 vs. 6
1 vs. 7
1 vs.8
5.5984
10.3918 13.8015
12.6674
9.9717
10.1019
7.4340
2 vs. 3
2 vs. 4
2 vs. 5
2 vs. 6
2 vs. 7
2 vs. 8
6.3090
10.8770
9.3494
5.6887
5.8458
2.6129
3 vs. 4
3 vs.5
3 vs. 8
4 vs.6
4 vs.7
4 vs.8
4.5768
2.9603
-3.2706
-5.6379
-5.6274
-7.51010
5 vs.6
5 vs.7
5 vs.8
6 vs.8
7 vs.8
-3.9685
-3.9349
-6.0494
-2.5812
-2.6894
By Quarter
Similarly, since the 2 tests on diesel quarterly loss rates indicated there were
not much detectable difference exsited, we applied the Tukey test only to the
30
gasoline data. The results listed in the Table 3.20 showed that, the quarterly
gasoline losses were mutually close except the pairs of quarter 1 and quarter 3,
and quarter 2 and quarter 3. Combining these statement with the figures in Table
3.6, it can be concluded that gasoline loss rates were stable during the 1st and
2nd quarters. It increased significantly in the 3rd quarter and dropped a little
in the 4th quarter and went back to the level that was close to that of the 2nd
quarter.
Table 3.20: Tukey test on quarterly data
3.3.4
4.9150
3.3493
By Month
Since both the p values of gasoline and diesel data were small by the 2 test,
Tukey tests were performed on gasoline monthly data and diesel monthly data
respectively. The test results for gasoline monthly data were shown in Table
3.21. Combining the results in Table 3.21 with the numbers in Table 3.8, it
can be concluded that the gasoline loss rates were stable from January to June,
increased in July and August and then dropped little a bit in September and
stayed stable afterwards. The results of Tukey test on diesel data showed there
was no one pair was significantly different.
Table 3.21: Tukey test on monthly data
1 vs. 8 3 vs. 7
3 vs. 8
3.7331
3.9688
3.8030
31
3.3.5
By Sitecode
As concluded in Equation 3.3 and Equation 3.4, the fuel loss rates varied significantly among different gas stations. Tukey results showed that, at the significant
level of 1%, huge difference of gasoline loss rates existed among 2220 pairs of gas
stations. Only the pair of gas station 160 and 177 was significantly different in
diesel loss rates at the significant level of 5%.
3.4
It is interesting to do some time series analysis on the monthly recorded gas loss
data. As there were too many discontinuities in the observations for diesel, the
work were focused on the gasoline data.
The gasoline loss rates of Los Angeles area from the year 2001 to the year
2008 were presented in the first graph of the Figure 3.12. The corresponding
zoomed plot was presented in the second graph of the Figure 3.12. The graphs
described the loss rates of eight randomly selected gas stations for Los Angeles
area. It can be seen that some gas station had more than one peak, and some
had its negative peak just followed by its positive peak. It was also noticed that
the gasoline loss rate drop and rise within one year. Although the fluctuations
of the curves, an rough increasing trend can be observed from the year 2001 to
the year 2003. The peaks of the curves concentrated during period of 2004 and
2005, especially the middle of the year 2005.
The gasoline loss rates of most of the gas stations in Santa Barbara area were
shown in the Figure 3.13. All the peaks were in the summer months of that
year, and most of them were in the year 2004 and 2005. It was observed that
at some months of the year, symmetric peaks appeared together at the two sides
32
of zero. It was suspected that, because of a false record of gasoline loss in the
previous month, it had been corrected by a corresponding gain immediately in
the following month. Taking these pairs of symmetric peaks out of the curves,
an increasing trend of gasoline loss rate can be observed from the year 2001 to
2003.
The gasoline loss rates of gas stations in San Diego area were presented in
the Figure 3.14. Besides the similar fluctuations and symmetric peaks observed
in Figure 3.13, a decreasing trend of gasoline loss rates was observed during the
period of 2001 to 2003. The curves fluctuated at zero and slightly rose up and
peaked between 2004 and 2005.
In the Figure 3.15, an increasing trend was observed among the gasoline loss
rates of gas stations in Orange County area. The curves peaked mostly in the
year 2004 and 2005 and significant increases of gasoline loss rates were observed
generally in the summer months of the year.
From the observations above, it can be concluded that gasoline loss rates were
similarly between Los Angeles and Orange County area and were different from
those in the Santa Barbara and San Diego areas.
33
6
4
2
0
2
4
Station 101
Station 102
Station 104
Station 107
Station 118
Station 144
Station 152
Station 508
2002
2003
2004
2005
2006
2007
2008
2006
2007
2008
1.5
1.0
0.5
0.0
0.5
1.0
2.0
Time
2002
2003
2004
2005
Time
Figure 3.12: Gasoline Loss Rate of Gas Stations in Los Angeles Area
34
6
4
2
0
2
4
Station 112
Station 136
Station 147
Station 153
Station 156
Station 516
2002
2003
2004
2005
2006
2007
2008
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
Time
2002
2003
2004
2005
2006
2007
2008
Time
Figure 3.13: Gasoline Loss Rate of Gas Stations in Santa Barbara Area
35
6
4
2
0
2
4
Station 132
Station 517
Station 518
Station 519
Station 521
Station 523
Station 524
Station 526
2002
2003
2004
2005
2006
2007
2008
2006
2007
2008
1.5
1.0
0.5
0.0
0.5
1.0
2.0
Time
2002
2003
2004
2005
Time
Figure 3.14: Gasoline Loss Rate of Gas Stations in San Diego Area
36
4
3
2
1
0
1
2
Station 127
Station 158
Station 159
Station 162
Station 515
2002
2003
2004
2005
2006
2007
2008
2006
2007
2008
1.0
0.5
0.0
0.5
1.0
Time
2002
2003
2004
2005
Time
37
3.4.1
AR(1) Model
Yt = Yt1 + et ,
(3.6)
The innovation term et is uncorrelated with the past Yt1 , Yt2 , ... and is
added for explaining the new things happen in the series at time t. In the
application of the AR1, we will focus on the coefficient of the model and the
difference of the s of gas stations.
As seen in the Figure 3.16, the median of the P hi was 0.75, the range of the
coefficients is [0.1, 0.98]. The upper and lower quartiles are within the range
of [0.5, 1.0]. It can be concluded that the gasoline loss rates were influenced by
their past values and most of the series of values were positively related.
3.4.2
Taking the effect of months into consideration, we regressed the gasoline loss
rates on the twelve months. The value of the month was set as Xi = 1 if it is at
the ith month, otherwise it is 0. The mi is the coefficient of the variable Xi . The
model had thirteen variables and was presented in the Equation 3.7.
Yt = Yt1 +
i=12
X
mi Xi + et ,
(3.7)
i=1
The distributions of the P his and the coefficients of the twelve months were
38
1.5
1.0
0.5
0.5
0.0
AR(1) Coefficients
Phi
39
3
2
1
1
AR(1) Coefficients
Phi
Jan
Feb
Mar
Apr
May
Jun
July
Aug
Sept
Oct
Nov
Dec
0.50
0.40
0.35
0.25
0.30
Regressed Coefficients
0.45
Median
Mean
Phi
Jan
Feb
Mar
Apr
May
Jun
July
Aug
Sept
Oct
Nov
Dec
40
3.4.3
Again, due to the discontinuous operations in the 106 gas stations it was not
possible to regressed the gasoline loss rates on all the eight years. With this
concern, the gas stations were grouped according to their operation periods and
regressed on the years besides themselves. For example, the model of the eight
year operation was presented in the Equation 3.8. Similarly, the value of the year
was set as Xi = 1 if it was operated in the ith year, otherwise it was 0. The mi
was the coefficient of the variable Xi .
Yt = Yt1 +
i=2008
X
mi Xi + et ,
(3.8)
i=2001
The distributions of the regressed coefficients of the P hi and the eight variables were presented in the Figure 3.19. It can be seen that, most of the variables
coefficients were above zero. In this yearly AR(1) model, the value of the P hi
was quite close to zero. The influence of gasoline loss rate on itself would be small
compared to the other variables. The large median of the year 2004 indicated its
significant effect on the gasoline loss rates.
The means of the coefficients had the highest value in the year of 2002. The
big difference between the median and mean of year 2002 can be explained by
the outliers above the box in the Figure 3.20.
The predications by the medians seemed to be more consistent with the analysis results already obtained. Combining the observations in Figure 3.19 and
Figure 3.20, it can be at least concluded that the in this model the variables
2002 and 2008 as well as 2004 should be significant. There were increasing trends
between the year 2001 and 2002, and between the year 2003 and 2004.
41
2.0
1.5
1.0
0.5
1.0
0.5
0.0
AR(1) Coefficients
Phi
2001
2002
2003
2004
2005
2006
2007
2008
1.0
0.6
0.4
0.0
0.2
Regressed Coefficients
0.8
Median
Mean
Phi
2001
2002
2003
2004
2005
2006
2007
2008
Figure 3.20: The Means and Medians of AR(1) Coefficients for Eight Years
42
The distributions of the regressed coefficients for the gas stations operated
from 2003 to 2008 were presented in the Figure 3.21. The model was explained
as in the Equation 3.9.
Yt = Yt1 +
i=2008
X
mi Xi + et ,
(3.9)
i=2003
In Figure 3.21, it was noticed that most of the P his coefficients were below
zero. In the Figure 3.22, both the median and mean of P hi were negative. This
indicated that the series of the gasoline loss rates should be negatively related
with themselves. Again the largest median was in the year 2004. However, the
largest mean was in year 2008. As we know, the data of 2008 only included
the months from January to July. The incomplete data might overestimate the
0.5
0.0
0.5
1.0
AR(1) Coefficients
1.0
1.5
2.0
Phi
2003
2004
2005
2006
2007
2008
43
0.4
0.2
0.2
0.0
Regressed Coefficients
0.6
Median
Mean
Phi
2003
2004
2005
2006
2007
2008
Figure 3.22: The Means and Medians of AR(1) Coefficients for Six Years
As seen in the Figure 3.23 and the Figure 3.24, the regressed coefficients
showed that the series of gasoline loss rates had major effects from themselves.
Both the median and mean curves showed a decreasing significance of effects on
gasoline loss rates for the four variables. These gas stations distributed in Los
Angeles, Santa Barbara and San Diego areas and there was nothing related with
the regions.
In summery, the number of time variables in the model decreased, the significance of P hi increased. In the model with nine regressors, the Phi most
close to zero and the gasoline loss rates had the lowest correlation with itself.
When the regressors were reduced from nine to seven, the importance of the past
gasoline loss rate increased and the importance of the other time variables decreased. When the regressors were reduced from seven to four, the effect of the
past gasoline loss rates dominated and the other regressors continuously lost their
44
1.0
0.5
0.5
0.0
AR(1) Coefficients
Phi
2006
2007
2008
0.30
0.20
0.15
0.10
0.00
0.05
Regressed Coefficients
0.25
Median
Mean
Phi
2006
2007
2008
Figure 3.24: The Means and Medians of AR(1) Coefficients for Three Years
45
importance in the model. Generally, we believe the year 2004 had the most significant effects on the gasoline loss rates when it was included in the model. And
the model with nine regressors was the most acceptable one among the three.
46
CHAPTER 4
Conclusions and Discussions
4.1
Conclusions
1.Most of the gasoline loss rates under study were above zero, while most of
the diesel loss rates were below zero.The average loss rates of gasoline tended
to exceed the average loss rates of diesel. While the means and medians of the
gasoline loss rates were above zero, the means and medians of the diesel loss rates
were much closer to zero or below zero.
2.The Los Angeles and Orange County gasoline loss rates were close and were
a little higher than those of Santa Barbara and San Diego areas. For diesel, there
was not a measurable difference among these regions.
3. There was an increasing trend in gasoline loss rate from year 2001 to 2003.
The loss peaked in 2004 and 2005, and slightly went down during 2006 and was
stable in 2007 and 2008. The variations of diesel yearly loss rates were barely
detectable compared with gasoline data.
4. There was a little increase of gasoline loss rates in the third quarter, which
may be related to the high temperature of summer time. Generally, the variations
of quarterly gasoline loss rates were not much. The situation of diesel was different
from gasoline. The distribution of diesel loss rates were more symmetric around
zero. And there was barely difference among the quarterly diesel loss rates.
47
5. There was a slow increasing trend of gasoline loss rates from January to
August, after that the curve dropped slightly and was stable after September.
There was no obvious variations in the monthly diesel loss rates. It repeatedly
fluctuated at a certain level. There was much less detectable difference among
the diesel loss rates than gasoline loss rates.
6. The 2 test on gasoline and diesel showed that the loss rate of the two were
significantly different. Combining the observations in the first point, it can be
concluded that the loss rate of gasoline was generally higher than that of diesel.
7.The 2 test on regional gasoline loss rates and diesel loss rates indicated
that the loss rates of gasoline were related with regions while those of diesel were
not. Combining the observations stated in the second point, it can be concluded
that the loss rates of gasoline in LA and OC were higher than those in SB and
SD.
8.The 2 test on yearly, quarterly, and monthly gasoline and diesel loss rates
indicated that the loss rates of gasoline were related to the time periods, while
those of diesel were not.
9. The 2 test on gasoline and diesel loss rates of gas station marked by site
codes indicated that, the loss rates of gasoline and diesel were different respectively at each gas station.
10. The Tukey test on regional loss rates confirmed that the gasoline loss
rates of LA and OC were significantly different from those of SB and SD, where
the most different pair was LA and SD.
11. The Tukey test confirmed the increasing trend of loss rates from the year
of 2001 to 2004 and the decreasing trend from 2004 to 2008. The most different
pair was determined to be years of 2001 and 2004.
48
12. The Tukey test again confirmed the slight increasing trends of gasoline loss
rates from the first quarter to the third quarter. The same test on the monthly
data confirmed that the highest gasoline loss rate was around August, while the
lowest one was around January. Combining these two statements above, it can
be suggested that the gasoline loss rates increased in the time periods when the
temperatures were high.
13. The Tukey test confirmed the big variations among the loss rates of gas
stations for both gasoline and diesel.
14. The time series analysis confirmed that most of the gasoline loss rates
peaked in the year 2004.
15. The gasoline loss rates were not much related with their past values when
the data was as well regressed on the eight year variables. This correlation tended
to be increased when the number of time variables decreased.
4.2
Discussion
1. The higher loss rates of gasoline compared with diesel might be related to the
environmental temperatures. The mixture of ethanol and gasoline are volatile,
significant loss might be resulted when gasoline was exposed to air.
2. The relatively higher loss rates of gasoline in LA and OC than in SB and
SD might be related to the local environmental temperature as well. This needs
to be studied with the weather of LA, OC, SB and SD for during the period of
time.
3. To determine what made the gasoline loss rates increased from 2001 to
2004, the weather and temperatures of the regions involved in this study would
be needed to look at. Besides, the temperature differences among years and the
49
50
References
[1] J. Siebert, Temperature Compensation at the Retail Pump, Tech. rep.,
Owner-Operator Independent Drivers Association (2007).
[2] www.users.qwest.net/ taaaz/AZgas.html, Buying gasoline in arizona why
not fairness at the pump.
[3] www.toptech.com, Application: Ethanol blending with multiload.
[4] D. J. Kucinich, Staff report of the domestic policy subcommittee majority
staff oversight and government reform committee house of representatives
(2007).
[5] www.api.org/Standards/, Api.
[6] J. A. Rice, Mathematiccal Statistics and Data Analysis (Duxbury Press,
1995), chap. 10, p. 372, 2nd ed.
[7] J. A. Rice, Mathematical Statistics and Data Analysis (Duxbury Press,
1995), chap. 14, p. 507, 2nd ed.
[8] J. A. Rice, Mathematiccal Statistics and Data Analysis (Duxbury Press,
1995), chap. 13, p. 483, 2nd ed.
[9] J. A. Rice, Mathematical Statistics and Data Analysis (Duxbury Press,
1995), chap. 12, p. 451, 2nd ed.
[10] K. S. C. Jonathan D. Cryer, Time Series Analysis With Application in R
(Springer, 2008), 2nd ed.
51