Professional Documents
Culture Documents
Exploring San Francisco Bay Area's Bike Share System - DataScience+
Exploring San Francisco Bay Area's Bike Share System - DataScience+
· ·
About Us
Become an Author
Sign Up
Log In
Search Search
Visualizing Data in R
2,209 reads
37 shares
0 comments
18 min read
Introduction Getting Data Data Management Visualizing Data Basic Statistics Regression Models Advanced
Modeling Programming Tips & Tricks Video Tutorials
Congested streets and slow-crawling traffic are a fact of life in many metropolitan areas, such as New York City, Los Angeles, and
Chicago. Bike sharing is an innovative solution for such problems, and it works by dispersing a large fleet of publicly-available bikes
throughout crowded cities for personal transport.
Implemented in 2013, Ford GoBike is the first bike-sharing system introduced in the US West Coast. Its 540 stations and 7,000 bikes
sprawl across five cities in San Francisco Bay Area. A docked bike can be checked out at any station and must be returned to a station
when the trip is complete. As an avid GoBike user and a data fanatic, I am excited to explore, using R, the trip data that has been
collected in 2018.
Highlights
The number of rides per month more than doubled from January to June of 2018.
Monthly and yearly subscribers have completed more than five times the number of rides compared to customers who hold single-
use or day-ride passes.
The ratio of male to female users is 3:1.
The median age of users is 35 with a standard deviation of 10.5.
More than 90% of the rides are under 30 minutes.
The peak hours across all rides in 2018 are 8 AM and 5 PM, which align with normal working hours.
The bike share system in San Francisco is popular among those who commute from outside of the city for work.
Data Acquisition
First, I downloaded the system data from the Ford GoBike website and saved them in my working directory. Currently, eight data files
are available online, a collective file for 2017 and one for each month of 2018 up to August. In this analysis, I will solely focus on data
that has been generated so far in 2018.
# List of data file names
file_names = list.files(pattern = ".csv")
file_names
## [1] "201801-fordgobike-tripdata.csv" "201802-fordgobike-tripdata.csv"
## [3] "201803-fordgobike-tripdata.csv" "201804-fordgobike-tripdata.csv"
## [5] "201805-fordgobike-tripdata.csv" "201806-fordgobike-tripdata.csv"
## [7] "201807-fordgobike-tripdata.csv" "201808-fordgobike-tripdata.csv"
Copy
To read the files into R, I iterate through file_names using the lapply function:
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 1/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
# Read data (this will take a second)
df_list = lapply(file_names,
function(x) read.csv(x, stringsAsFactors = FALSE))
Information on ride duration, starting and ending time/location, and user descriptions are provided. Notice that the data contains a
column called "bike_share_for_all_trip", which tracks members who are enrolled in the Bike Share for All program for low-income
residents.
We will collapse the data frames into one and examine the data structure:
# Cast certain columns as numeric to enable row binding
df_list[[6]]$start_station_id = as.numeric(df_list[[6]]$start_station_id)
df_list[[6]]$end_station_id = as.numeric(df_list[[6]]$end_station_id)
df_list[[7]]$start_station_id = as.numeric(df_list[[7]]$start_station_id)
df_list[[7]]$end_station_id = as.numeric(df_list[[7]]$end_station_id)
df_list[[8]]$start_station_id = as.numeric(df_list[[8]]$start_station_id)
df_list[[8]]$end_station_id = as.numeric(df_list[[8]]$end_station_id)
# Glimpse at df
glimpse(df)
## Observations: 1,210,548
## Variables: 16
## $ duration_sec <int> 75284, 85422, 71576, 61076, 39966, 647...
## $ start_time <chr> "2018-01-31 22:52:35.2390", "2018-01-3...
## $ end_time <chr> "2018-02-01 19:47:19.8240", "2018-02-0...
## $ start_station_id <dbl> 120, 15, 304, 75, 74, 236, 110, 81, 13...
## $ start_station_name <chr> "Mission Dolores Park", "San Francisco...
## $ start_station_latitude <dbl> 37.76142, 37.79539, 37.34876, 37.77379...
## $ start_station_longitude <dbl> -122.4264, -122.3942, -121.8948, -122....
## $ end_station_id <dbl> 285, 15, 296, 47, 19, 160, 134, 93, 4,...
## $ end_station_name <chr> "Webster St at O'Farrell St", "San Fra...
## $ end_station_latitude <dbl> 37.78352, 37.79539, 37.32600, 37.78095...
## $ end_station_longitude <dbl> -122.4312, -122.3942, -121.8771, -122....
## $ bike_id <int> 2765, 2815, 3039, 321, 617, 1306, 3571...
## $ user_type <chr> "Subscriber", "Customer", "Customer", ...
## $ member_birth_year <int> 1986, NA, 1996, NA, 1991, NA, 1988, 19...
## $ member_gender <chr> "Male", "", "Male", "", "Male", "", "M...
## $ bike_share_for_all_trip <chr> "No", "No", "No", "No", "No", "No", "N...
Copy
Data cleaning is essential for downstream analysis. This entails formatting the columns into appropriate data types and extracting
essential data. First, I will isolate the hour, day of the week, and month from start_time for visualizing bike usage over time. In
addition, I will convert member_birth_year into member_age. Moreover, I will convert duration_sec into duration_min. Finally, columns
of type character should take on type factor.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 2/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
# Format data structure
library(stringr) # for regular expression
df = df %>%
# Extract day and month from start_time
mutate(start_month = sub(" .*", "", start_time)) %>%
mutate(start_month = as.POSIXct(start_month)) %>%
mutate(start_day = weekdays(start_month)) %>%
mutate(start_month = format(start_month, "%B")) %>%
# Glimpse at cleaned df
glimpse(df)
## Observations: 1,210,548
## Variables: 14
## $ start_station_name <fct> Mission Dolores Park, San Francisco Fe...
## $ start_station_latitude <dbl> 37.76142, 37.79539, 37.34876, 37.77379...
## $ start_station_longitude <dbl> -122.4264, -122.3942, -121.8948, -122....
## $ end_station_name <fct> Webster St at O'Farrell St, San Franci...
## $ end_station_latitude <dbl> 37.78352, 37.79539, 37.32600, 37.78095...
## $ end_station_longitude <dbl> -122.4312, -122.3942, -121.8771, -122....
## $ user_type <fct> Subscriber, Customer, Customer, Custom...
## $ member_gender <fct> Male, , Male, , Male, , Male, Male, Ma...
## $ bike_share_for_all_trip <fct> No, No, No, No, No, No, No, No, Yes, Y...
## $ start_month <fct> January, January, January, January, Ja...
## $ start_day <fct> Wednesday, Wednesday, Wednesday, Wedne...
## $ start_hour <fct> 22, 16, 14, 14, 19, 22, 23, 23, 23, 23...
## $ member_age <dbl> 32, NA, 22, NA, 27, NA, 30, 38, 31, 24...
## $ duration_min <dbl> 1254.733333, 1423.700000, 1192.933333,...
Copy
The cleaned data frame contains 14 descriptors for over a million rides.
Data Analysis
Bike Usage by Month and by Day in 2018
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 3/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 4/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
The first figure shows a consistent growth in the number of rides over the months with the exception of August. Six months into 2018,
the bike usage more than doubled. In addition, there's a large increase in the number of rides from April to May, which may be
associated with warmer weather in the summer months along with growing adoption of the bike-sharing service. The second figure
shows that the usage over weekdays is greater than over weekends.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 5/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
It is evident that the majority of the rides are made by subscribers while only a fraction (~16%) come from single-use customers. The
usage trends over time are very similar between customers and subscribers. The number of users in both categories spiked from April to
May and dipped in August.
# Plot barplot
ggplot(df_gender, aes(start_month, fill = member_gender)) +
geom_bar(position = "fill") +
xlab("Months in 2018") +
ylab("Proportion of Rides") +
ggtitle("Bike Usage by Gender over Time") +
scale_fill_discrete("Gender") +
theme_minimal()
Copy
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 6/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
There are roughly three times as many male as female users, and the proportions have remained steady over time.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 7/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
Take note that 82041 data points are omitted from the plot because age was not provided and that the service is restricted to those 18
years or older. Furthermore, there is reason to question the validity of the self-reported age since the oldest user is recorded as being 137.
Nonetheless, the distribution indicates that the ages of the majority of users range from mid-twenties to mid-thirties, with a median of 33
and mean of 35.3.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 8/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
Indeed, 90% of the rides are below 22.65 minutes and 97% are below 41.3666667, well within the 30- and 45-minute period. The
distribution is heavily skewed to the right, reinforcing the idea that the bikes are generally used for short trips.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 9/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
A bimodal curve with peaks at 8 AM and 5 PM confirms my gut instinct that the bike share system is busiest during commute hours.
Along the same lines, it should not come as a surprise that the quietest times are in the wee hours of the night.
For those residing in the East Bay, a rail system called Bay Area Rapid Transit (BART) offers an efficient way into the city.
Alternatively, a trans-bay bus service picks up from the East Bay and drops off at the Transbay Bus Terminal in San Francisco. Yet
another option is the ferry that connects Oakland, Alameda, and Vallejo to the city. For those living in the South Bay (San Jose, for
example), a rail system called Caltrain is a popular choice. See the map of the Bay Area below for reference.
library(ggmap) # for retreiving and plotting Google Maps
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 10/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
Before delving into the analysis, here are all the active stations in San Francisco:
# Obtain map of San Francisco
SFmap = get_map("San Francisco",
maptype = "terrain",
source = "google",
zoom = 13)
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 11/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
Docks are peppered throughout the city with a fairly even distance between stations. At the heart of the coverage area is South of
Market, home to Uber and Twitter Headquarters. The downtown area is well-served by GoBikes, and they reach as far south as Bernal
Heights.
Now that we have a sense of the bike station locations, it is time to test my hypothesis that stations near transport corridors are more
widely used than other stations during working hours. First, I search on Wikipedia for the coordinates of major transit stations in San
Francisco. I store this data in df_transit. Next, I write a function to extract the total number of rides taken, in 2018, to and from each
stations at a given hour. The function map_trips will generate a figure displaying station usage as output. Check out the code below:
# Coordinates for stations of major transit systems
df_transit = data.frame(name = c("Embarcadeo BART Station", # station names
"Montgomery BART Station",
"Powell BART Station",
"Civic Center BART Station",
"16th St Mission BART Station",
"24th St Mission BART Station",
"King St CalTrain Station",
"22nd St CalTrain Station",
"Ferry Building",
"Transbay Transit Center"),
system = c(rep("BART", 6), rep("Caltrain", 2), "Ferry", "Transbay Bus Terminal"), # transit system
longitude = c(-122.3972, -122.4019, -122.4080, # coordinates
-122.4135, -122.4200, -122.4187,
-122.3944, -122.3925, -122.3937,
-122.3966),
latitude = c(37.79306, 37.78936, 37.78400,
37.77986, 37.76485, 37.75200,
37.77639, 37.75722, 37.79550,
37.7897))
peak_start = df %>%
group_by(start_hour, start_station_longitude, start_station_latitude) %>%
rename(station_longitude = start_station_longitude,
station_latitude = start_station_latitude) %>%
summarise(rides = dplyr::n()) %>% # ride count for each station
filter(start_hour == hour) %>%
mutate(type = "Start Stations")
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 12/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
# Filter for rides taken at 8 AM and grouped by start time and end station
peak_end = df %>%
group_by(start_hour, end_station_longitude, end_station_latitude) %>%
rename(station_longitude = end_station_longitude,
station_latitude = end_station_latitude) %>%
summarise(rides = dplyr::n()) %>% # ride count for each station
filter(start_hour == hour) %>%
mutate(type = "End Stations")
# Plot trips
ggmap(SFmap) +
geom_point(data = df_peak, # plot stations
mapping = aes(x = station_longitude,
y = station_latitude,
size = rides, # scale point size by number of rides
alpha = rides), # scale point transparency size by number of rides
color = "red") +
geom_point(data = df_transit, # plot transit systems
mapping = aes(x = longitude,
y = latitude,
shape = system),
size = 2.5) +
facet_grid(. ~ type) +
scale_x_continuous(limits = c(-122.45, -122.375)) + # trim x-axis range
scale_y_continuous(limits = c(37.74, 37.81)) + # trim y-axis range
scale_size_continuous("Number of Rides",
range = c(1, 8),
breaks = seq(0, 10000, by = 1000)) + # control point size
scale_alpha_continuous("Number of Rides",
range = c(0.1, 0.8),
breaks = seq(0, 10000, by = 1000)) + # control transparency
scale_shape_discrete("Transit Systems") +
ggtitle(plot_title) +
theme_void() # suppress axes
}
Copy
Finally, let's compare the bike trips at 8 AM and 5 PM. The size and shading of a red circle corresponds to the number of rides at a
station. Bigger and darker equates to more rides. In addition, the bus, train, and ferry stations are marked with symbols.
# Plot trips at 8 AM
map_trips("08", "Bike Trips at 8 AM")
Copy
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 13/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
# Plot trips at 5 PM
map_trips("17", "Bike Trips at 5 PM")
Copy
At 8 AM, a high volume of rides start near key transit stations as well as the outskirts of the city and finish in downtown San Francisco
and the Financial District. Notably, a large number of rides begin at the northernmost Caltrain and BART station and the ferry stop. This
is consistent with my prediction that many bike share users commute from outside of the city and rely on bikes to bridge the gap
between transit terminals and their workplaces. At 5 PM, we observe a complete reversal of the trend. The bikes move out of the
business centers to transit stations and the suburban parts of town.
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 14/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
One important caveat is that the current analysis includes all trips, even those on weekends. Therefore, one may argue that the morning
and afternoon ride patterns are influenced by weekend getaways to San Francisco. While a valid point, our previous analyses reduce the
weight of this concern. The bar chart for Bike Usage by Day of the Week shows that more rides are made on the weekdays than
weekends. In addition, the line plot for Total Rides by Hour shows peak usage at 8 AM and 5 PM. Both suggest that shared bikes are
employed as a mean of commuting to work. Even though the weekend data certainly confound our results, it is unlikely that they dictate
the emerged trends.
Closing Remarks
Huge respect for those who have made it to the end of this long tutorial. I had an absolute blast probing the data on the bike share system
in San Francisco Bay Area.
I encourage those galvanized by this work to explore further on their own. Ford GoBike has a cousin called Citi Bike that operates in
New York City, of which the system is more established and mature with data records going back to 2013. Happy data mining!
Author
Tony Lin
Junior Specialist, UCSF
Disclosure
Tony Lin does not work or receive funding from any company or organization that would benefit from this article. Views
expressed here are personal and not supported by university or company.
Our Authors
Archives
Contribute
Log In
Legal
Privacy Policy
Terms of Service
Disclosure
Contact Us
Articles
Introduction
Getting Data
Data Management
Visualizing Data
Basic Statistics
Regression Models
Advanced Modeling
Programming
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 15/16
12/31/2019 Exploring San Francisco Bay Area’s Bike Share System | DataScience+
Connect with Us
© 2015 - 2019 DataSciencePlus.com
https://datascienceplus.com/exploring-san-francisco-bay-areas-bike-share-system/ 16/16