Vogel 1!

Abby Vogel
c177 Final Project
3 May 2017
East Bay Gas Prices
On a global scale, gasoline and fuel prices are determined by a range of factor. National
tariffs, taxes, and regulations, as well as state taxes and ever-changing market supply and
demand. At the local level, gas prices are nuanced, seemingly defined by an array of social,
political and spatial factors. The goal of this project is to determine the driving factors of
differential gas prices in four East Bay cities; Berkeley, Emeryville, Oakland, and Piedmont. The
relative strength of each of the explanatory variables is assessed using two forms of regression,
Ordinary Least Squares (OLS) and Geographic Weighted Regression (GWR).

• Research Question: What factors contribute to differences in East Bay gas prices?
Are these trends spatially dependent?
• Null Hypothesis: Variation in gas prices are random and not spatially dependent.
Any variation is due to chance.
• Study Area: Cities of the East San Francisco Bay Area in California—Berkeley,
Emeryville, Oakland, and Piedmont. [Figure 1]
Vogel 2!

Data Acquisition, Cleaning and Projection
All data in this analysis is aggregated to the Census Block Group level for regressions.
Gas station location data is from Google Maps and exported to KML. Gas price data is from Gas
Buddy (gasbuddy.com) with prices of April 27th, 2017. Using the “KML to Layer” tool in
ArcMap, the gas price location is added as a layer and projected to State Plane CA Zone III (US
Feet). [Figure 2].
To aggregate the point data to the Block Groups for meaningful regression analysis, I
used network analyst to assign gas stations to each block group. I created buffers (2, 4, 6, 8, and
10 minute) around each gas station along the transportation network. These generated polygons
were spatially joined to the existing block group to assign the most probable gas price. In the
case that more than one network buffer spatially overlapped a block group, the station with the
lower price was assigned to mirror expected consumer preference. [Figures 3 and 4].
Other data was acquired from the 2015 American Communities Survey (5 Year
Estimates). This table data was joined to the TIGER Shapefiles and projected to State Plane CA
Zone III (US Feet). Variables include in this were population, median rent, median household
income, and commuter data. [Figures 5-11]. Once processing was complete, the full list of
explanatory variables is as follows:
• Population Density
• Proportion of population that drives 30 minutes or more to work in a car or truck
• Proportion of population that drives alone to work in a car or truck
• Median Income
• Median Rent
• Presence of Competition (another station within 0.2 miles)
• Distance to major highway or road
• Brand
Vogel 3!

Variable Description Source Original Original Scale

price/station price as of April 27, Gas Buddy, WGS84 (KML) Point data
location 2017 Google Maps

t30_prop proportion of ACS 2015 (5 Year NAD83 GCS Census Block
people who travel Estimates) (TIGER) Group
to work 30 minutes
or longer by car or

dalone_prop proportion of ACS 2015 (5 Year NAD83 GCS Census Block
people who drive a Estimates) (TIGER) Group
car or truck alone
to work

popdens Population Density ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group

med_income Median Income ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group

med_rent Median Rent ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group

comp Indicator of Buffer of station State Plane CA Point data
different gas location Zone 3
station within 2/10

Near_dist Euclidean distance East Bay Network ???? Distance in US
to nearest major Data feet to vector data
road/highway (13,
24, 80, 680, etc.)

brand Gas station brand Gas Buddy, WGS84 (KML) Point data
aggregated to 76, Google Maps
ARCO, Chevron,
Mobil, Quik Stop,
Shell, Valero and
Vogel 4!

Exploratory Data Analysis

Linear Regression: Y = c + β1x1 + β2x2 + …+βnxn
Linear regression was used to compute the significance of each of the explanatory
variables. Exploratory regression and Ordinary Least Squares were both used to determine the
factors that were responsible for variation in gas prices by station. Modeling was performed in
both ArcMap and in R. OLS was performed in R to be able to use brand as a explanatory
Vogel 5!

variable, using the linear modeling function. Additionally, R was used to test a multiplicative
model that incorporated the effects of interaction terms.
Geographically Weighted Regression: GWR was used to assess the spatial significance
of each explanatory variable. GWR creates local regression estimates and works better to predict
spatially-dependent phenomenon. If the OLS creates spatially autocorrelated standardized
residuals, indicated by a statistically significant Moran’s I, then GWR will perform better.
However, if the GWR creates spatially autocorrelated standardized residuals, then the model is
still missing key explanatory variables. GWR was first used without brand as an explanatory
variable, and compared to a model with a dummy variable of brand added to compare the
relative strengths of the models.

OLS: The exploratory regression of all variables (except brand) returned extremely low
R2 values. Running this entire field of variables under the OLS model:
price ~ t30_prop +dalone_prop + popdens + med_income + med_rent + near_dist + comp
resulted in an adjusted R2 value of 0.0025. This model failed to explain the variation in gas
prices, and resulted in highly spatially autocorrelated standard residuals. Finally, this model
yielded significant Koenker and Jarque-Bera Statistics, indicating that there is non-stationarity
and non-normally distributed residuals.

Next, “brand” was added to the linear model to see the if there was a change in the
performance of the OLS. Additionally, the model was limited to strictly significant terms. Using
R, the OLS model became:
price ~ brand + med_income + near_dist
Vogel 6!

This addition improves the predictive power of the OLS, with an adjusted R2 of 0.4336. To test
the normality of the standard residuals, both a Quantile-Quantile Plot and histogram were
generated. Excluding an extremely low outlier, the QQ Plot shows a generally linear relationship
with extreme trends away from normally distributed residuals at the low and high quantiles. This
indicates a normal curve with positive kurtosis, which is also shown in the histogram. The
standardized residuals were tested for spatial autocorrelation, with a statistically significant
Moran’s I, meaning that this model is missing a key explanatory variable.

Finally, the last linear model that was tested was a multiplicative Ordinary Least Squares
model. This model incorporates interaction terms for each of the variables:
price ~ brand * med_income * near_dist
This model yields the strongest predictive power of all of the OLS models, with an adjusted R2
of 0.4998. This model has more approximately normal standardized residuals shown in the QQ
plot and histogram. Using ArcMap, the autocorrelation of the standardized was calculated. This
model has the least spatially autocorrelated standardized residuals, with a z-score of 3.07 for
Moran’s I. However, this value is still significant with a p-value of 0.002.
Vogel 7!

GWR: First, a GWR of all variables excluding brand was modeled in ArcMap. This
model has an adjusted R2 of 0.3337, an improvement in efficiency over the OLS model of the
same variables. However, this model yields spatially autocorrelated standard residuals indicating
under-performance in the model. [Figure 12]
Brand of the gasoline was added to the model in the form of a dummy variables
indicating which corporation owned the station. The levels of this variable were aggregated to
76, ARCO, Chevron, Mobil, Quik Stop, Shell, Valero and Other to limit the effect of unique gas
brands, such as “Berkeley Smog and Gas” on over-fitting the model. Using the model price ~
brand + med_income + near_dist, the adjusted R2 improved to 0.7385. This model accounts for
the highest amount of variation in gas prices over the study area. However, this model also yields
spatially autocorrelated standard residuals with a z-score of 5.62 of Moran’s I. [Figure 13]

Model Adjusted R2 Z-value of Moran’s I P-value of Moran’s I

OLS : t30_prop 0.0025 15.61 0.0000
+dalone_prop +
popdens + med_income
+ med_rent + near_dist
+ comp

OLS : brand + 0.4336 3.97 0.0000
med_income + near_dist

OLS (with interaction): 0.4998 3.07 0.0020
brand * med_income *

GWR : t30_prop 0.3337 7.72 0.0000
+dalone_prop +
popdens + med_income
+ med_rent + near_dist
+ comp

GWR : brand + 0.7385 5.62 0.0000
med_income + near_dist

With this final result it is evident that there are key missing variables in this regression.
Further iteration of this analysis is needed to build a model that has strong explanatory power
Vogel 8!

without issues of spatial autocorrelation and non-normal standardized residuals. Overall, the
multiplicative OLS model with interaction of
price ~ brand * med_income * near_dist
produces the least autocorrelated standardized residuals. The GWR model of the same
explanatory variables yields the highest adjusted R2, but with autocorrelated residuals. From this
we can conclude that Brand, Median Income, and Distance to Major Transit Roads are the most
important explanatory variables considered in this analysis. We can reject the null hypothesis that
variation in gas price data is random and spatially independent.

Error and Data Availability
There are multiple sources of error and uncertainty in this analysis. For the gas station
data, only stations with prices up-to-date on April 27, 2017 were included in this analysis. This
resulted in about 10% non-response over the study region. This non-response is likely not
missing at random, meaning that the variable of interest (price) is correlated with the likelihood
of being included in the same. This introduces bias into the results and modeling. Further
ground-truthing of prices would improve this study. Stations advertise separate “Cash” and
“Card” prices, and GasBuddy.com does not differentiate these prices in their data.
Additionally, all ACS data is estimates from a national survey with inherent variance
from the random sampling process. While the estimators used by ACS are unbiased and robust,
the only way to eliminate this error is to use non-sampled data.

Support Software
ArcGIS version 10.4 and R version 3.3.2 were used in this analysis.
Vogel 9!

Figure 1. Map of East Bay
Vogel 1! 0

Figure 2. Map of Gas Stations
Vogel 1! 1

Figures 3 and 4. Network Analyst Map
Vogel 1! 2

Figures 5-11 : Explanatory Variables
Vogel 1! 3
Vogel 1! 4

Figures 12 and 13
Vogel 1! 5

R Code
gas<- read.csv("~/Desktop/gas_27_TableToExcel.csv")
hist(gas$price, main="Distribution of East Bay Gas Prices", xlab="Price (US Dollars)")
plot(price~Name, data=gas, main="Price by Brand", ylab="US Dollars")
gas$Name[64] <- "Valero"
for(i in 1:length(gas$Name))
if(!(gas$Name[i] %in% c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero")))
levels(gas$Name) <- c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero", "Other")
write.csv(gas, "~/Desktop/gas2.csv")
gas2 <- read.csv("~/Desktop/gas2.csv")
gas$Name<- droplevels(gas$Name)
full <- read.csv("~/Desktop/last.csv")
full$comp <- as.factor(full$comp)
ols1 <- lm(full$price~full$t30_prop+full$popdens+full$dalone_pro+full$med_income+full$med_rent+full$brand+full
ols2 <- lm(full$price~full$t30_prop+full$popdens+full$dalone_pro+full$med_income+full$med_rent+full$NEAR_DIST+full
levels(full$brand) <- c(levels(full$brand), "Other")
for(i in 1:length(full$brand))
if(!(full$brand[i] %in% c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero")))
full$brand<- droplevels(full$brand)
ols3 <- lm(full$price ~ full$brand+full$med_income+full$NEAR_DIST)
qqplot(x=rnorm(length(ols.stdres1)),y=ols.stdres1, xlab="Generated Normal Values", ylab="Std Residuals", main="QQ Plot")
abline(b=1, a=0)
hist(ols.stdres1, breaks=40, main="Histogram of Std Residuals", xlab="StdResid")
ols.stdres <- rstandard(ols3)
ols.stdres1 <- ols.stdres[-42]
ols4 <- lm(full$price ~ full$brand*full$med_income*full$NEAR_DIST)
qqplot(x=rnorm(length(ols.stdres3)),y=ols.stdres3, xlab="Generated Normal Values", ylab="Std Residuals", main="QQ Plot")
abline(b=1, a=0)
hist(ols.stdres3, breaks=40, main="Histogram of Std Residuals", xlab="StdResid")
ols.stdres2 <- rstandard(ols4)
ols.stdres3 <- ols.stdres2[-42]
last$multiplic <- ols.stdres2
last$addit <- ols.stdres
write.csv(last, "~/Desktop/last2.csv")