You are on page 1of 46

Mexico City Urban Bicycle

Network Analysis
Dario Diaz Cuevas
Abstract
Mexico City's public bike sharing system (ECOBICI) public data was analyzed,
focusing on usage patterns and bike trip duration. Visualizations were used in
order to identify different user behaviors. A statistical Welch t-test was employed
to show that female users tend to make longer trips than male users. Lastly, a
Random Forest algorithm was trained in order to predict the duration of bike trips
with RMSE of 2.5769 and 6.2845 on training and test sets respectively.
Motivation
Mexico City is a large urban area with a massive transportation network. However,
the city is overpopulated, has serious traffic problems, and numerous people
come to work from neighboring cities and towns, which contributes to further
saturating the transportation services and traffic.
ECOBICI has been adopted as an efficient transportation alternative to move
around the city, not only because it complements the massive transportation
network, but also because of the health, environmental, and time-saving benefits
that contribute to a better quality of life.
Understanding and predicting user behavior and bike usage patterns is the key to
provide a better service and contribute to further growth and expansion of
ECOBICI within the city, allowing more people to benefit from it.
ECOBICI Network
• ECOBICI started operating in
February 2010 with 84 bike
stations and 1,200 bikes.

• 400% growth in 6 years.

• Currently 480 bike stations,


more than 6,000 bikes, and
over100,000 users benefit
from this service.

• 35 km2 area coverage.


Service Usage Information
• ECOBICI allows registered users to take a bike at any bike station and return
it to the bike station closest to their destination, unlimited trips of a maximum
duration of 45 minutes are permitted. Anyone who wants to access the
ECOBICI system can pay a subscription for one year, one week, three days
or one day.

• In order to withdraw and return a bike, an ECOBICI card must be scanned at


the bike station.

• The hours of service are Mon-Sun, from 5:00 am to 12:30 am.


Dataset(s)
1. Usage dataset: ECOBICI open data consists of a set of historical files of
usages ranging from February 2010 to present date, available for download as
monthly reports in csv format on the following link in English:

https://www.ecobici.cdmx.gob.mx/en/informacion-del-servicio/open-data

The monthly report from May 2019 was employed for the analysis. This
dataset contains 750,910 instances, and 9 variables, namely: user gender,
user age, bike number, departure and arrival station, departure/arrival date,
departure/arrival time.
Dataset(s)
2. Bike station dataset: There is also an API where a complete list of bike
stations and live bike availability data can be queried from.

The following link leads to an English translation of the API manual:

https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng_
final.pdf

The bike station data was obtained through the ECOBICI API. The table
contains 480 observations (each corresponding to a bike station) and 11 fields,
out of which only station ID number, bike type and geographic location
(latitude, longitude) were used.
Data Preparation and Cleaning
Usage data was loaded from a csv file. No missing values were present.
Dates and times were combined into new variables containing the complete date
and time information for departures and arrivals. Using these new variables, the
total duration (in minutes) of bike trips was computed.

Irregularities were detected, such as negative or zero duration trips, as well as


extreme travel times as large as 2,550 hours. Percentile analysis was used to
assess the occurrence of outliers, and only observation corresponding to positive
trip durations no larger than 80 mins were kept.
Data Preparation and Cleaning
Departure day of the week, day of the month and hour of the day were extracted
using datetime Pandas formatting.

Usage data was combined with the bike station dataset in order to add a
departure and arrival geographic location to each trip, as well as bike type.

In order to feed variables to models, preprocessing steps were taken, such as:
logarithmic transformations, categorical data specification, variable
standardization, one-hot encoding, and feature engineering
Research Question(s)
• Is there a difference in the duration of bike trips made by male and female
users?

• Can the duration of a bike trip be estimated from the user gender and age, as
well as departure date and time, and start/end points?

• Can different types of user behaviors and patterns be identified using the
available data? (TSNE + clustering w/one hot encoding)
Methods
• Pandas and matplotlib were used for exploratory data analysis and visualization
techniques such as histograms, bar charts, line plots, and heatmaps were used
to analyze user patterns and behaviors.

• In addition to visual tools, Welch’s t-test was used to determine if there is a


statistically significant difference between female and male users’ bike travel
times. Contrary to Student’s t-test, variances need not to be assumed equal,
but independence between users and a Gaussian distribution are needed.

• Using sklearn, a regression tree-based algorithm (Random Forest) was


employed to predict the duration of bike trips.
Findings
Visualization of user patterns and trends

• Number of bike trips made my male


users is almost three times the number
of trips made by female users.
• Age distribution of users is positively
skewed.
• Most trips are made by people aged
around 30.
• People in their early 20's and younger,
tend to use ECOBICI way less than
people in older age groups.
Findings
• Seasonal pattern.
• Peak times = weekdays and
workday start/end hours.
• Users seem to mainly utilize
bikes as part of their daily
commute to work.
• Labour day = pattern
disruption.
• This agrees with histogram
showing it is mostly people in
working age who use ECOBICI
bikes.
Findings

• Skewed distribution heavily biased


towards shorter trips.

• Summary statistics suggest that


female users tend to make slightly
longer bike trips than male users.

• Most users seem to follow the 45-


minute trip rule established by
ECOBICI.
Findings
Statistical tests

• Hypotheses: 𝜇𝑀 = 𝜇𝐹 vs 𝜇𝑀 ≠ 𝜇𝐹 , and 𝜇𝑀 ≥ 𝜇𝐹 vs 𝜇𝑀 < 𝜇𝐹


𝜇𝑀 and 𝜇𝐹 represent the mean trip duration for male and female users
respectively.
2
𝑆𝑀 𝑆𝐹2
• Welch’s t-test statistic: 𝑡 = (𝑋ത𝑀 −𝑋ത𝐹 )/ +𝑁
𝑁𝑀 𝐹

𝑋, 𝑆2 and 𝑁 denote sample mean, sample variance, and sample size.

• Trip durations have a skewed (non-Gaussian) distribution; hence a logarithmic


transformation is applied to normalize data.
Findings
Transformed
data has a
more symmetric
distribution that
looks more like
a Gaussian.

Welch’s t-test is run on the transformed data: (significance level = 0.05)

• t-statistic = −29.8334 Null hypotheses are


rejected in favor of the
• p-value ≈ 0
alternative hypotheses
Findings

Predictive model: Random Forest Regression

• Ensemble of regression trees → prediction = average of individual predictions.


• Nice algorithm due to power and little required tuning.
• Data shuffled and randomly separated into training and test set (70%-30%)
• Departure hour is cyclical → defined sin(2𝜋hour/24) and cos(2𝜋hour/24)
• Categorical variables were hot-one encoded.
• Due to high computational cost, all parameters to default except for maximum
tree dept (set to 35) and random subset size (square root of feature number).
Root mean squared errors on
Findings training and test data were 2.5769
and 6.2845 respectively.

A variable importance score is


automatically computed when
age fitting a Random Forest by
accumulating the improvement in
the split criterion, over all the
day
trees in the forest separately for
each variable.

Most important variables: location,


age, departure day of the month,
hour location and departure sin/cos(hour).
Limitations
The analysis presented in this project used only data from May 2019. I did not
consider all months of 2019 because of the extremely large number of
observations and the immense computation times involved training models on my
laptop.
Adding data from other months to the models can make them richer and allow
them to generalize better.
Conclusions
• Female users tend to take slightly longer trips than male users.

• The number of male users is almost three times the number of female users.

• Most people who ride ECOBICI bikes use the service as part of their daily
commute.

• ECOBIC travel times can be estimated without much error, from user variables
such as age and gender, as well as departure/destination location, departure
day/time/day of the week, and bike type. However, the most important variables
seem to be age, departure day/hour, and departure/destination location.
Acknowledgements
I acknowledge Eduardo Moreno and Elena Villalobos who worked with me on a
school project back in June 2020, using ECOBICI open data. Our previous work
focused on analysis of variance test to compare bike usage data from 2019 and
2020 in order to assess the Covid-19 pandemic’s effect on the bike network, as
well as spline interpolation and time series forecasting using ARIMA models.
References
• https://www.ecobici.cdmx.gob.mx/en
• https://github.com/dominoFire/ecobici-python/blob/master/ecobici.py
• https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng
_final.pdf
• https://en.wikipedia.org/wiki/Welch%27s_t-test
• https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
• https://towardsdatascience.com/understanding-random-forest-58381e0602d2
• https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegress
or.html
FinalProject

December 22, 2020

1 Final Project: Mexico City Urban Bicycle Network Analysis


This project aims to analyze a public data base from Mexico City’s public bike sharing system
(ECOBICI), obtained from the official website (available both in Spanish and English).
As described on the website, ECOBICI started operating in February 2010 with 84 bike stations
and 1,200 bikes. In only 6 years the system has grown 400% due to users demand. There are
currently 480 bike stations, more than 6,000 bikes and more than 100,000 users benefit from this
service from Monday to Sunday inside a 35 km2 area.
The public bike sharing system ECOBICI has been adopted as an efficient transportation alternative
to move around Mexico City, not only because it complements the massive transportation network,
but also because of the health, environmental, and time-saving benefits that contribute to a better
quality of life.
A map of the current ECOBICI coverage within Mexico City is shown below for ilustration purposes.
[1]: from IPython.display import Image
Image(filename='ECOBICI_map.png')
[1]:

1
1.0.1 Usage information
ECOBICI allows registered users to take a bike at any bike station and return it
to the bike station closest to their destination, unlimited trips of a maximum
duration of 45 minutes are permitted. Anyone who wants to access the ECOBICI
system can pay a subscription for one year, one week, three days or one day.
In order to withdraw and return a bike, an ECOBICI card must be scanned at the
bike station. The hours of service are Mon-Sun, from 5:00 am to 12:30 am.

1.0.2 Research questions


The purpose of this final proyect is to analyze the available data in order to
attempt to give an answer to the following questions:

2
• Is there a difference in the duration of bike trips made by male and female
users?
• Can the duration of a bike trip be estimated by the user gender and age, as
well as the date, time, and start/end points?
• Can I predict what area of the city you are going to? (Clustering de
estaciones, maps, aprox long-lat x coord cart)
• Can different types of user behaviours be identified using the available
data? (TSNE + clustering w/one hot encoding)

1.0.3 Data sources, loading, cleaning, and exploratory analysis


ECOBICI open data consists of a set of historical files of usages ranging from
February 2010 to present date, available for download as monthly reports in csv
format. There is also an API where a complete list of bike stations and live bike
availability data can be queried from.
Due to the Covid-19 pandemic, the total number of ECOBICI rides as well as the
typical user behaviour have dramatically changed since early 2020. For this
reason, only data from mid-2019 was considered in the analysis. Specifically,
usage data from May 2019 was employed*.
Bikes and bike stations are uniquely labeled by an integer number. It is worth
mentioning that bike station numeration and location are not related. Hence,
stations with similar numbers are not necessarily geographically close to each
other.
The structure and data types of the May 2019 usage data is shown in the following
cells of code.
*Variable names have been translated to English. Dates follow a DD/MM/YYYY
format.
[2]: #import pandas
import pandas as pd
import numpy as np

#read data from csv file


df_usage = pd.read_csv("2019-05.csv")
#rename columns in english
eng_names =␣
,→["user_gender","user_age","bike","departure_station","departure_date","departure_time",\

"arrival_station","arrival_date","arrival_time"]
d = {df_usage.columns[i] : eng_names[i] for i in range(len(eng_names))}
df_usage.rename(columns=d,inplace=True)
#show first 5 observations
df_usage.head()

[2]: user_gender user_age bike departure_station departure_date \


0 M 25 1427 372 01/05/2019
1 M 26 8431 202 01/05/2019

3
2 M 28 10212 340 01/05/2019
3 F 23 12098 290 01/05/2019
4 M 33 11352 290 01/05/2019

departure_time arrival_station arrival_date arrival_time


0 00:00:04 397 01/05/2019 00:05:18
1 00:00:26 318 01/05/2019 00:32:24
2 00:00:49 394 01/05/2019 00:11:31
3 00:00:51 292 01/05/2019 00:05:30
4 00:01:03 292 01/05/2019 00:05:31

[3]: #print table dimensions


print("Usage data table contains",df_usage.shape[0],"instances, and",df_usage.
,→shape[1],"variables.")

Usage data table contains 750910 instances, and 9 variables.

[4]: #information and data types


print("Variable list and types:\n")
df_usage.info()

Variable list and types:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750910 entries, 0 to 750909
Data columns (total 9 columns):
user_gender 750910 non-null object
user_age 750910 non-null int64
bike 750910 non-null int64
departure_station 750910 non-null int64
departure_date 750910 non-null object
departure_time 750910 non-null object
arrival_station 750910 non-null int64
arrival_date 750910 non-null object
arrival_time 750910 non-null object
dtypes: int64(4), object(5)
memory usage: 51.6+ MB
The data table is inspected to detect possible missing values, which is
fortunately not the case. For this reason, no row/column deletion nor data
imputation techniques need to be applied.
[5]: #missing value search
df_usage.isnull().any()

[5]: user_gender False


user_age False
bike False

4
departure_station False
departure_date False
departure_time False
arrival_station False
arrival_date False
arrival_time False
dtype: bool

Dates and times are combined into a new date-time formatted variables containing
the complete date and time information for departures and arrivals.
Using these new variables, the total duration (in minutes) of bike trips is
computed.
[6]: #conversion to date and time
df_usage["departure_datetime"] = pd.to_datetime(df_usage["departure_date"] + "␣
,→" + df_usage["departure_time"],dayfirst = True)

df_usage["arrival_datetime"] = pd.to_datetime(df_usage["arrival_date"] + " " +␣


,→df_usage["arrival_time"],dayfirst = True)

[7]: #trip duration in minutes


df_usage["travel_time"] = (df_usage["arrival_datetime"] -␣
,→df_usage["departure_datetime"]).dt.total_seconds()/60

Upon inspection of the newly defined travel time field, some irregularities can
be noted:
For instance, the minimum trip duration is negative, which is clearly a system
error. Similarly the maximum value is roughly equal to 2550 hours, which is
either an error, or that observation might correspond to a user who failed to
return the bike to a station on time. Since bike trip should be at most 45
minutes long, one would expect the standard deviation to be smaller than its
rather high value, which is thus a consequence of outliers.

[8]: #summary statistics


df_usage["travel_time"].describe()

[8]: count 750910.000000


mean 14.960994
std 221.671455
min -1429.066667
25% 6.733333
50% 10.883333
75% 17.566667
max 153132.816667
Name: travel_time, dtype: float64

Trips of zero-minute duration can be attributed to users who fail to undock the
bike from its anchor point in the station, as well as system errors. For that

5
reason, it makes sense to keep only instances where the travel time is strictly
positive. There exist cases where users take a bike and immediately return it
to the same station, as well as very short trips made to contiguous stations.
However, defining a cut-off minimum travel time would be rather arbitrary. Hence,
trips with positive durations are retained.
[9]: #filter negative durations
df_usage = df_usage[df_usage["travel_time"]>0]

As mentioned before, bike trips can be at most 45 minutes long. Exceeded travel
times might cause users to have their membership temporarily suspended, or even
terminated in extreme cases. By looking at trips longer than 45 minutes, it can
be seen that observations corresponding to large travel times are numerous.
After sorting in descending order, extremely large times are observed.

[10]: #filter long trips and sorting


df_usage[["travel_time"]][df_usage["travel_time"]>45].
,→sort_values("travel_time",ascending = False)

[10]: travel_time
584701 153132.816667
325373 96387.983333
242703 52802.016667
70163 14637.133333
571992 11703.916667
… …
33316 45.016667
254227 45.016667
373378 45.016667
257677 45.016667
627968 45.016667

[9669 rows x 1 columns]

By looking at the 99% percentile of the duration of all trips, it can be observed
that despite the existence of multiple users who made very long trips, 99% of
all trips are under one hour of duration. Thus, to avoid completely eliminating
moderately long trips, outlying instances correponding to durations greater than
80 minutes are dropped.
[11]: #99% percentile computation
np.percentile(df_usage[["travel_time"]].sort_values("travel_time",ascending =␣
,→False).values,99)

[11]: 48.36666666666667

[12]: #filter out trips longer than 2 hours


df_usage = df_usage[df_usage["travel_time"]<=80]

6
A histogram for travel time reveals a skewed distribution heavily biased towards
shorter trips. Since the number of observations is very large, a finer grid of
bins can be used to better appreaciate the shape of the distribution.
[22]: #duration distribution visualization
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(6,4))
plt.hist(df_usage.travel_time.values,color="darkviolet",bins=100)
plt.xlabel("Travel time")
plt.ylabel("Frequency")
plt.title("May 2019 trip duration distribution")
line_mean = plt.axvline(x=df_usage.travel_time.values.
,→mean(),color="black",linestyle="--")

line_median = plt.axvline(x=np.median(df_usage.travel_time.
,→values),color="black",linestyle="-")

plt.legend(handles=[line_mean,line_median],labels=["mean","median"])
plt.grid(True)
plt.savefig("duration_hist.png")
plt.show()

Once rows have been filtered by detecting errors and outliers corresponding to
trip durations, other variables can be explored.
The following histograms reveal that the number of bike trips made my male users

7
is almost three times the number of trips made by female users. Also shown is the
fact that the age distribution of users is positively skewed, meaning that most
trips are made by rather young users. The majority of trips is made by people
aged around 30. It is also interesting to note that people in their early 20's
and younger, tend to use ECOBICI way less than people in older age groups.

[21]: #gender and age distribution visualization


fig , axs = plt.subplots(figsize=(8,8),ncols=1,nrows=2)
df_usage["user_gender"].value_counts().
,→plot(kind='bar',ax=axs[0],color=["red","blue"])

axs[0].set_xlabel("User gender")
axs[0].set_ylabel("Frequency")
axs[0].grid(axis="y")
axs[0].set_title("May 2019 bike trips")
axs[1].hist(df_usage.user_age,bins=81,color="darkgray")
axs[1].set_xlabel("User age")
axs[1].set_ylabel("Frequency")
axs[1].plot(df_usage["user_age"][df_usage["user_gender"]=="M"].value_counts().
,→sort_index().index,df_usage["user_age"][df_usage["user_gender"]=="M"].

,→value_counts().sort_index().values,color="red",label="M")

axs[1].plot(df_usage["user_age"][df_usage["user_gender"]=="F"].value_counts().
,→sort_index().index,df_usage["user_age"][df_usage["user_gender"]=="F"].

,→value_counts().sort_index().values,color="blue",label="F")

axs[1].legend()#title="Usage by gender")
axs[1].grid(True)
plt.tight_layout()
plt.savefig("gender_age_hists")
plt.show()

8
Another interesting insight can be obtained by looking at the distribution of
bike trips across time and date. In order to visualize this distribution, the
departure hours, and well as the day of the week (Monday=0 to Sunday=6) and
day of the month are considered. Afterwards, a heatmap is used to observe the
two-dimensional distribution of trips.
[23]: #day of the week, day of the month, and hour of the day are extracted from␣
,→departure date-time variable

df_usage["departure_weekday"] = df_usage["departure_datetime"].dt.dayofweek
df_usage["departure_day"] = df_usage["departure_datetime"].dt.day
df_usage["departure_hour"] = df_usage["departure_datetime"].dt.hour

9
[48]: #2d histogram
fig = plt.figure(figsize=(10,10))
plt.
,→hist2d(df_usage["departure_day"],df_usage["departure_hour"],bins=[31,24],cmin=1,cmap="plasma

plt.xticks(range(1,32),rotation="vertical",labels=["Labour day"]+[str(i) if not␣


,→(i+1)%7==0 else "Monday" for i in range(2,32)])

plt.yticks(range(0,24))
plt.gca().invert_yaxis()
plt.xlabel("Day of the month")
plt.ylabel("Hour of the day")
plt.title("May 2019 bike trip departures 2D-histogram")
plt.colorbar()
plt.savefig("heatmap.png")
plt.tight_layout()
plt.show()

10
The previous plot shows that the number of daily trips has a seasonal component
on both day of the month and hour of the day. The peak times coincide with both
weekdays and workday start and end hours. This suggests that users of the ECOBICI
bike network utilize it as a method of transportation as part of their daily
commute to work, make up a large portion of the total number of trips. Notice
how Labour day (celebrated on May 1st) seems to disrupt the bike usage pattern.
These findings agree with the age histograms which show it is mostly people in
working age who ride ECOBICI bikes.
Mexico City is a large urban area with a massive subway network, and plenty of
bus, BRT, and train routes. However, the city is overpopulated, has serious
traffic problems, and numerous people come to work from neighboring cities and

11
towns, which contributes to further saturating the transportation services and
the city's traffic. ECOBICI might be actually helping thousands of employees to
travel between work and home without needing to spend longer times frustrated in
traffic. Besides allowing people to have a more enjoyable and healthy commute,
ECOBICI might also be contributing to lower pollution emissions in Mexico City by
offering a green transportation method.
In addition to bike usage data, a list of ECOBICI bike stations was obtained
through ECOBICI's API. The following cell of code is a modified version of a
class publicly available here, and it performs the requiered queries:
[25]: #import required libraries
import requests
import json

#personal id and secret keys


#removed from submitted jupyter notebook for security reasons
client_id = ''
client_secret = ''
#urls
base_url = "https://pubsbapi-latam.smartbike.com"
url_access = "{}/oauth/v2/token?
,→client_id={}&client_secret={}&grant_type=client_credentials".

,→format(base_url, client_id,client_secret)

#request access token


r_access = requests.get(url_access)
access_token = r_access.json()['access_token']

#url and request for station list


url = "{}/api/v1/stations.json?access_token={}".format(base_url, access_token)
r_stations = requests.get(url)

#save to pandas dataframe


stations = r_stations.json()
df_stations = pd.DataFrame(stations["stations"])

The station list can be extracted in json format and then saved to a pandas
dataframe.
Inspecting the table shows that each of the 480 bike stations appears only once
in the table, accompanied by the corresponding information such as ID, name,
address, geographical coordinates, etc.

[26]: df_stations.head()

[26]: id name \
0 124 124 CLAUDIO BERNARD-DR. LICEAGA
1 241 E241 EJERCITO NAL-JUAN VAZQUEZ DE LA MELLA

12
2 243 243 MIGUEL DE CERVANTES SAAVEDRA-LAGO FILT
3 350 350 JOSE CLEMENTE OROZCO-CORREGGIO
4 445 E445 RIFF-AVENIDA RIO CHURUBUSCO

address addressNumber zipCode \


0 124 - Claudio Bernard-Dr. Liceaga S/N 06500
1 241 - Ejercito Nacional-Juan Vazquez de la Mella S/N 11520
2 243 - Miguel de Cervantes Saavedra-Lago Filt S/N 11510
3 350 - Jose Clemente Orozco-Correggio S/N 3710
4 445 - Riff-Avenida Rio Churubusco S/N 3340

districtCode districtName altitude nearbyStations \


0 1 Ampliación Granada None [119, 133]
1 1 Ampliación Granada None [222, 460]
2 1 Ampliación Granada None [199, 242, 244]
3 1 Ampliación Granada None [349, 352]
4 1 Ampliación Granada None [431, 434]

location stationType
0 {'lat': 19.422392, 'lon': -99.150358} BIKE
1 {'lat': 19.43862, 'lon': -99.20758} ELECTRIC_BIKE
2 {'lat': 19.440839, 'lon': -99.196712} BIKE
3 {'lat': 19.384062, 'lon': -99.181482} BIKE
4 {'lat': 19.35827, 'lon': -99.156105} ELECTRIC_BIKE

[27]: #no duplicated ids


df_stations["id"].duplicated().any()

[27]: False

[28]: #print table dimensions


print("Bike station table contains",df_stations.shape[0],"instances,␣
,→and",df_stations.shape[1],"variables.")

Bike station table contains 480 instances, and 11 variables.

[29]: #variables and types


print("Variable list and type:\n")
df_stations.info()

Variable list and type:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 11 columns):
id 480 non-null int64
name 480 non-null object
address 480 non-null object

13
addressNumber 480 non-null object
zipCode 479 non-null object
districtCode 480 non-null object
districtName 480 non-null object
altitude 0 non-null object
nearbyStations 480 non-null object
location 480 non-null object
stationType 480 non-null object
dtypes: int64(1), object(10)
memory usage: 41.4+ KB
This table contains missing values in the zip code and altitude fields. However,
these variables will not be necessary and they can be deleted.
[30]: #detection of missing values
df_stations.isnull().any()

[30]: id False
name False
address False
addressNumber False
zipCode True
districtCode False
districtName False
altitude True
nearbyStations False
location False
stationType False
dtype: bool

[31]: #deletion of columns with missing values


del df_stations["altitude"]
del df_stations["zipCode"]

The location of each station is given by the latitude and longitude. These values
are contained in a column as dictionaries, from which they can be extracted to
define separate variables.
Since these variables will be used as predictors for the trip duration, they
will be standardized by substracting their mean and dividing by their standard
deviation.
[32]: #extract latitude and longitude
df_stations["latitude"] = pd.Series([d["lat"] for d in␣
,→list(df_stations["location"])])

df_stations["longitude"] = pd.Series([d["lon"] for d in␣


,→list(df_stations["location"])])

#standardize variables

14
df_stations["latitude"] = (df_stations["latitude"]-df_stations["latitude"].
,→mean())/df_stations["latitude"].std()

df_stations["longitude"] = (df_stations["longitude"]-df_stations["longitude"].
,→mean())/df_stations["longitude"].std()

By combining the usage and station tables, information about the departure and
arrival station location, as well as the bike type can be assigned to each trip
made by users.
[33]: #merge data sets
df_usage = pd.
,→merge(df_usage,df_stations[["id","latitude","longitude","stationType"]],how="left",left_on="

,→rename(columns={"latitude":"departure_latitude","longitude":

,→"departure_longitude","stationType":"bike_type"})

del df_usage["id"]
df_usage = pd.
,→merge(df_usage,df_stations[["id","latitude","longitude"]],how="left",left_on="arrival_statio

,→rename(columns={"latitude":"arrival_latitude","longitude":

,→"arrival_longitude"})

del df_usage["id"]

11 rows of the merged table have missing observations in the arrival location
variable. This occurs because the stations 1002 and 3000 do not appear in the
official list of 480 bike stations. Since this number of missing values is very
small, these rows are dropped.
[34]: df_usage[df_usage[["arrival_latitude","arrival_longitude"]].isna().any(axis=1)]

[34]: user_gender user_age bike departure_station departure_date \


53554 M 24 8198 116 03/05/2019
101609 F 24 12195 441 06/05/2019
186763 M 34 9794 265 08/05/2019
188612 F 39 11722 161 09/05/2019
197050 F 34 15207 449 09/05/2019
241586 M 34 11271 3 10/05/2019
253686 M 39 11094 119 11/05/2019
282677 M 34 8666 138 13/05/2019
302093 M 39 9466 107 14/05/2019
519621 M 42 7212 3 23/05/2019
640109 M 46 7382 291 28/05/2019
643460 M 24 10676 8 28/05/2019
701672 M 42 3441 113 30/05/2019
701675 M 29 10509 113 30/05/2019

departure_time arrival_station arrival_date arrival_time \


53554 13:17:38 1002 03/05/2019 13:37:28
101609 10:04:43 3000 06/05/2019 10:10:26

15
186763 22:19:01 1002 08/05/2019 22:28:26
188612 07:09:55 1002 09/05/2019 08:19:43
197050 10:49:31 1002 09/05/2019 11:37:56
241586 22:37:29 1002 10/05/2019 22:39:26
253686 22:38:49 1002 11/05/2019 22:42:27
282677 17:22:41 1002 13/05/2019 17:38:22
302093 09:38:29 1002 14/05/2019 09:46:49
519621 09:42:54 1002 23/05/2019 10:12:56
640109 15:08:49 1002 28/05/2019 16:19:26
643460 17:06:57 1002 28/05/2019 18:21:09
701672 13:50:51 1002 30/05/2019 14:25:21
701675 13:51:04 1002 30/05/2019 14:25:00

departure_datetime arrival_datetime travel_time \


53554 2019-05-03 13:17:38 2019-05-03 13:37:28 19.833333
101609 2019-05-06 10:04:43 2019-05-06 10:10:26 5.716667
186763 2019-05-08 22:19:01 2019-05-08 22:28:26 9.416667
188612 2019-05-09 07:09:55 2019-05-09 08:19:43 69.800000
197050 2019-05-09 10:49:31 2019-05-09 11:37:56 48.416667
241586 2019-05-10 22:37:29 2019-05-10 22:39:26 1.950000
253686 2019-05-11 22:38:49 2019-05-11 22:42:27 3.633333
282677 2019-05-13 17:22:41 2019-05-13 17:38:22 15.683333
302093 2019-05-14 09:38:29 2019-05-14 09:46:49 8.333333
519621 2019-05-23 09:42:54 2019-05-23 10:12:56 30.033333
640109 2019-05-28 15:08:49 2019-05-28 16:19:26 70.616667
643460 2019-05-28 17:06:57 2019-05-28 18:21:09 74.200000
701672 2019-05-30 13:50:51 2019-05-30 14:25:21 34.500000
701675 2019-05-30 13:51:04 2019-05-30 14:25:00 33.933333

departure_weekday departure_day departure_hour departure_latitude \


53554 4 3 13 0.791364
101609 0 6 10 -1.982620
186763 2 8 22 1.276576
188612 3 9 7 -0.168581
197050 3 9 10 0.430955
241586 4 10 22 0.900653
253686 5 11 22 0.701518
282677 0 13 17 0.329525
302093 1 14 9 0.723580
519621 3 23 9 0.900653
640109 1 28 15 -0.491209
643460 1 28 17 0.879614
701672 3 30 13 0.848015
701675 3 30 13 0.848015

departure_longitude bike_type arrival_latitude \


53554 0.902949 BIKE NaN

16
101609 0.215174 BIKE NaN
186763 1.179335 BIKE,TPV NaN
188612 0.380311 BIKE NaN
197050 -0.218579 ELECTRIC_BIKE NaN
241586 0.774832 BIKE,TPV NaN
253686 0.972440 BIKE NaN
282677 0.986781 BIKE NaN
302093 1.324256 BIKE NaN
519621 0.774832 BIKE,TPV NaN
640109 0.062356 BIKE NaN
643460 0.797006 BIKE NaN
701672 1.043259 BIKE NaN
701675 1.043259 BIKE NaN

arrival_longitude
53554 NaN
101609 NaN
186763 NaN
188612 NaN
197050 NaN
241586 NaN
253686 NaN
282677 NaN
302093 NaN
519621 NaN
640109 NaN
643460 NaN
701672 NaN
701675 NaN

[35]: df_stations[(df_stations["id"].isin([1002,3000]))]

[35]: Empty DataFrame


Columns: [id, name, address, addressNumber, districtCode, districtName,
nearbyStations, location, stationType, latitude, longitude]
Index: []

[36]: df_usage.dropna(axis=0,inplace=True)

Lastly, data type of departure and arrival station number, user gender, departure
weekday, and and bike type, is transformed to categorical in order to feed this
variables into the predictive model.
Additionally, the sine and cosine transformations

sin(2πx/period), cos(2πx/period)

are applied to the departure hour variable (period=24) to take into account its
cyclical nature and include it in the model.

17
[37]: #conversion to categorical type
for s in␣
,→["user_gender","departure_station","arrival_station","departure_weekday","bike_type"]:

,→

df_usage[s] = df_usage[s].astype("category")

#cyclical variables
period = 24
df_usage["departure_hour_sin"] = np.sin(2*np.pi*df_usage["departure_hour"]/
,→period)

df_usage["departure_hour_cos"] = np.cos(2*np.pi*df_usage["departure_hour"]/
,→period)

After the cleaning and data integration process, the usage data table structure
now contains the neccesary variables and types, and is displayed in the following
cell:
[38]: df_usage.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 748237 entries, 0 to 748250
Data columns (total 22 columns):
user_gender 748237 non-null category
user_age 748237 non-null int64
bike 748237 non-null int64
departure_station 748237 non-null category
departure_date 748237 non-null object
departure_time 748237 non-null object
arrival_station 748237 non-null category
arrival_date 748237 non-null object
arrival_time 748237 non-null object
departure_datetime 748237 non-null datetime64[ns]
arrival_datetime 748237 non-null datetime64[ns]
travel_time 748237 non-null float64
departure_weekday 748237 non-null category
departure_day 748237 non-null int64
departure_hour 748237 non-null int64
departure_latitude 748237 non-null float64
departure_longitude 748237 non-null float64
bike_type 748237 non-null category
arrival_latitude 748237 non-null float64
arrival_longitude 748237 non-null float64
departure_hour_sin 748237 non-null float64
departure_hour_cos 748237 non-null float64
dtypes: category(5), datetime64[ns](2), float64(7), int64(4), object(4)
memory usage: 107.8+ MB

18
1.0.4 Models and findings:
In order to address the first research question, summary statistics are first
obtained, grouping observations by user gender.

[40]: #summary statistics by gender


df_usage.groupby("user_gender")[["travel_time"]].describe()

[40]: travel_time \
count mean std min 25% 50%
user_gender
F 188769.0 13.813889 9.394245 0.016667 7.1 11.300000
M 559468.0 13.277851 9.324944 0.016667 6.6 10.683333

75% max
user_gender
F 17.916667 79.816667
M 17.300000 79.983333

According to the mean and quartiles from the previous summary, female users
seem to be making slighly longer bike trips on average. In order to determine
whether the mean travel time of female and male users differ in a statistically
significant way, a two-sample statistical test is performed.
For this purpose, the Welch's t-test is employed. The test statistic is the
following:
X̄M − X̄F
t= √
SM2 S2
+ F
NM NF

where X̄M and X̄F represent the male and female mean durations respextively, and
2 and S 2 denote the sample variances for each gender, while N
similarly SM F M and
NF are the sample sizes.
This test assumes that the trip duration is normally distributed, as well as
independence between users. Contrary to the well-known Student's t-test, equal
population variances are not assumed. Given that these assumptions are satisfied,
the test statistic is approximately from the t-distribution.
The hypothesis being tested is µM = µF vs. µM = ̸ µF (two-sided test), where µM
and µF denotes the mean travel time for male and female users respectively.
Before running the test, a transformation needs to be applied to the travel
time variable in order to bring its highly skewed distribution shown above, to
a distribution close to Gaussian.
The proposed transformation is the natural logarithm, which is often used to
eliminate skew. The next plot compares the distribution of trip duration with
and without the logartimic transformation. Note how the transformed data has a
more symmetrical distribution that looks more like a Gaussian.

19
[41]: fig, axs = plt.subplots(figsize=(10,4),nrows=1,ncols=2)
axs[0].hist(df_usage["travel_time"],bins=100)
axs[0].set_ylabel("Frequency")
axs[0].set_xlabel("Travel time")
axs[0].set_title("Untransformed data")
axs[0].grid(True)
axs[1].hist(np.log(df_usage["travel_time"]),bins=100)
axs[1].set_ylabel("Frequency")
axs[1].set_xlabel("log(Travel time)")
axs[1].set_title("Transformed data")
axs[1].grid(True)
plt.savefig("log_duration.png")
plt.tight_layout()
plt.show()

The Welch's t-test is now applied to the transfomed data. The obtained p-value
is less than 0.05, which means the equal mean hypothesis is rejected a the 0.05
significance level.
The test statistic has a negative value and its p-value divided by two is still
less than 0.05. This implies that the one sided test for the hypothesis µM >= µF
would be rejected in favor of µM < µF at the 0.05 significance level.
These test allow us to conclude that there is statistically significant evidence
to support the idea that female users tend to make larger bike trips than male
users.
[42]: #import t test for independent samples from scipy
from scipy.stats import ttest_ind

#run the test with the durations for male and female users

20
ttest_ind(a=np.log(df_usage["travel_time"][df_usage["user_gender"]=="M"].
,→values),b=np.log(df_usage["travel_time"][df_usage["user_gender"]=="F"].

,→values),equal_var=False)

[42]: Ttest_indResult(statistic=-29.833493218651533, pvalue=2.604770940977746e-195)

The next model to be applied involves predicting the duration of bike trips
based on user and start/end point information. Travel time plays the role of
the response variable, while the variables to be included as predictors in the
model are the following: User gender, user age, departure station, arrival
station, departure day of the week, departure day of the month, departure
hour (cosine/sine transformations), bike type, departure latitude/longitude
(standardized), and arrival latitude/longitude (standardized).

[49]: #define response vector y and predictor matrix X


y = df_usage["travel_time"]
X =␣
,→df_usage[["user_gender","user_age","bike_type","departure_station","arrival_station","depart


,→"departure_hour_sin","departure_hour_cos","departure_latitude","departure_longitude","arriva

Due to the rather complex task of predicting travel times from such a limited
number of variables, it is unlikely that a linear model will perform well. For
this reason, a regression tree-based algorithm called Random Forest is chosen.
This algorithm trains a very large number of regression trees on resampled
datasets of equal size, considering only random subsets of variables at each
split. The algorithm's prediction is then computed as the average of the
predictions of each individual tree. The purpose of resampling and random
selection of variable subsets is to decorrelate the trees, while the averaging
procedure attempts to rduce the variance of the predictions.
This technique is nice because it is powerful, requires little tuning. More
information about the Random Forest algorithm can be found here.
Before fitting the model,the data is split into training and test datasets
(70%-30% respectively), and as a last preprocessing step, categorical variables
must be hot-encoded by dummy binary (0-1) variables. Note however that since
the label of bike stations is a categorical variable, hot encoding departure and
arrival stations will increase the number of variables dramatically. A very
high-dimensional setting can negatively impact the performance of the model.
For this reason, these variables are dropped, but their location information is
retained using latitude and longitude.

[50]: #preprocessing
from sklearn.model_selection import train_test_split
del X["departure_station"]
del X["arrival_station"]
X = pd.get_dummies(X)

21
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

The model is now fitted using the default parameters. Check documentation for
further information.
[52]: #define and fit the model
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(max_depth = 35, n_estimators=100, random_state =␣
,→42,max_features="sqrt")

rf.fit(X_train,y_train)

[52]: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=35,


max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)

Once the Random Forest estimator function has been learned, predictions are made
for both training and test data.

[53]: #predictions
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

Train and test model performance is then assesed by root mean squared error
(RMSE).
[55]: #training and test set performance
from sklearn.metrics import mean_squared_error
print("Train RMSE is",np.sqrt(mean_squared_error(y_train,y_train_pred)))
print("Test RMSE is",np.sqrt(mean_squared_error(y_test,y_test_pred)))

Train RMSE is 2.576982518919417


Test RMSE is 6.284552990250726
A variable importance score is automatically computed when fitting a Random
Forest by accumulating the improvement in the split criterion, over all the trees
in the forest separately for each variable.
[87]: #feature importance
figure = plt.figure(figsize=(8,6))
imp = rf.feature_importances_
plt.bar(range(1,len(imp)+1),imp,color="lightgreen")
plt.xticks(range(1,len(imp)+1),rotation="vertical")#,labels=[col for col in␣
,→X_train.columns])

plt.ylabel("Importance score")

22
plt.xlabel("Variable")
plt.title("Random Forest variable importance score")
plt.savefig("importance.png")
plt.show()

[89]: X_train.columns

[89]: Index(['user_age', 'departure_day', 'departure_hour_sin', 'departure_hour_cos',


'departure_latitude', 'departure_longitude', 'arrival_latitude',
'arrival_longitude', 'user_gender_F', 'user_gender_M', 'bike_type_BIKE',
'bike_type_BIKE,TPV', 'bike_type_ELECTRIC_BIKE', 'departure_weekday_0',
'departure_weekday_1', 'departure_weekday_2', 'departure_weekday_3',
'departure_weekday_4', 'departure_weekday_5', 'departure_weekday_6'],
dtype='object')

1.0.5 Conclusions
• Female users tend to take slightly longer trips than male users.
• The number of male users is almost three times the number of female users.
• Most people who ride ECOBICI bikes use the service as part of their daily

23
commute.
• ECOBIC travel times can be estimated without much error, from user variables
such as age and gender, as well as departure/destination location, departure
day/time/day of the week, and bike type. However, the most important
variables seem to be age, departure day/hour, and departure/destination
location.

1.0.6 References
• https://www.ecobici.cdmx.gob.mx/en
• https://github.com/dominoFire/ecobici-python/blob/master/ecobici.py
• https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng_final.pdf
• https://en.wikipedia.org/wiki/Welch%27s_t-test
• https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
• https://towardsdatascience.com/understanding-random-forest-58381e0602d2
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.h

24

You might also like