Professional Documents
Culture Documents
Network Analysis
Dario Diaz Cuevas
Abstract
Mexico City's public bike sharing system (ECOBICI) public data was analyzed,
focusing on usage patterns and bike trip duration. Visualizations were used in
order to identify different user behaviors. A statistical Welch t-test was employed
to show that female users tend to make longer trips than male users. Lastly, a
Random Forest algorithm was trained in order to predict the duration of bike trips
with RMSE of 2.5769 and 6.2845 on training and test sets respectively.
Motivation
Mexico City is a large urban area with a massive transportation network. However,
the city is overpopulated, has serious traffic problems, and numerous people
come to work from neighboring cities and towns, which contributes to further
saturating the transportation services and traffic.
ECOBICI has been adopted as an efficient transportation alternative to move
around the city, not only because it complements the massive transportation
network, but also because of the health, environmental, and time-saving benefits
that contribute to a better quality of life.
Understanding and predicting user behavior and bike usage patterns is the key to
provide a better service and contribute to further growth and expansion of
ECOBICI within the city, allowing more people to benefit from it.
ECOBICI Network
• ECOBICI started operating in
February 2010 with 84 bike
stations and 1,200 bikes.
https://www.ecobici.cdmx.gob.mx/en/informacion-del-servicio/open-data
The monthly report from May 2019 was employed for the analysis. This
dataset contains 750,910 instances, and 9 variables, namely: user gender,
user age, bike number, departure and arrival station, departure/arrival date,
departure/arrival time.
Dataset(s)
2. Bike station dataset: There is also an API where a complete list of bike
stations and live bike availability data can be queried from.
https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng_
final.pdf
The bike station data was obtained through the ECOBICI API. The table
contains 480 observations (each corresponding to a bike station) and 11 fields,
out of which only station ID number, bike type and geographic location
(latitude, longitude) were used.
Data Preparation and Cleaning
Usage data was loaded from a csv file. No missing values were present.
Dates and times were combined into new variables containing the complete date
and time information for departures and arrivals. Using these new variables, the
total duration (in minutes) of bike trips was computed.
Usage data was combined with the bike station dataset in order to add a
departure and arrival geographic location to each trip, as well as bike type.
In order to feed variables to models, preprocessing steps were taken, such as:
logarithmic transformations, categorical data specification, variable
standardization, one-hot encoding, and feature engineering
Research Question(s)
• Is there a difference in the duration of bike trips made by male and female
users?
• Can the duration of a bike trip be estimated from the user gender and age, as
well as departure date and time, and start/end points?
• Can different types of user behaviors and patterns be identified using the
available data? (TSNE + clustering w/one hot encoding)
Methods
• Pandas and matplotlib were used for exploratory data analysis and visualization
techniques such as histograms, bar charts, line plots, and heatmaps were used
to analyze user patterns and behaviors.
• The number of male users is almost three times the number of female users.
• Most people who ride ECOBICI bikes use the service as part of their daily
commute.
• ECOBIC travel times can be estimated without much error, from user variables
such as age and gender, as well as departure/destination location, departure
day/time/day of the week, and bike type. However, the most important variables
seem to be age, departure day/hour, and departure/destination location.
Acknowledgements
I acknowledge Eduardo Moreno and Elena Villalobos who worked with me on a
school project back in June 2020, using ECOBICI open data. Our previous work
focused on analysis of variance test to compare bike usage data from 2019 and
2020 in order to assess the Covid-19 pandemic’s effect on the bike network, as
well as spline interpolation and time series forecasting using ARIMA models.
References
• https://www.ecobici.cdmx.gob.mx/en
• https://github.com/dominoFire/ecobici-python/blob/master/ecobici.py
• https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng
_final.pdf
• https://en.wikipedia.org/wiki/Welch%27s_t-test
• https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
• https://towardsdatascience.com/understanding-random-forest-58381e0602d2
• https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegress
or.html
FinalProject
1
1.0.1 Usage information
ECOBICI allows registered users to take a bike at any bike station and return it
to the bike station closest to their destination, unlimited trips of a maximum
duration of 45 minutes are permitted. Anyone who wants to access the ECOBICI
system can pay a subscription for one year, one week, three days or one day.
In order to withdraw and return a bike, an ECOBICI card must be scanned at the
bike station. The hours of service are Mon-Sun, from 5:00 am to 12:30 am.
2
• Is there a difference in the duration of bike trips made by male and female
users?
• Can the duration of a bike trip be estimated by the user gender and age, as
well as the date, time, and start/end points?
• Can I predict what area of the city you are going to? (Clustering de
estaciones, maps, aprox long-lat x coord cart)
• Can different types of user behaviours be identified using the available
data? (TSNE + clustering w/one hot encoding)
"arrival_station","arrival_date","arrival_time"]
d = {df_usage.columns[i] : eng_names[i] for i in range(len(eng_names))}
df_usage.rename(columns=d,inplace=True)
#show first 5 observations
df_usage.head()
3
2 M 28 10212 340 01/05/2019
3 F 23 12098 290 01/05/2019
4 M 33 11352 290 01/05/2019
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750910 entries, 0 to 750909
Data columns (total 9 columns):
user_gender 750910 non-null object
user_age 750910 non-null int64
bike 750910 non-null int64
departure_station 750910 non-null int64
departure_date 750910 non-null object
departure_time 750910 non-null object
arrival_station 750910 non-null int64
arrival_date 750910 non-null object
arrival_time 750910 non-null object
dtypes: int64(4), object(5)
memory usage: 51.6+ MB
The data table is inspected to detect possible missing values, which is
fortunately not the case. For this reason, no row/column deletion nor data
imputation techniques need to be applied.
[5]: #missing value search
df_usage.isnull().any()
4
departure_station False
departure_date False
departure_time False
arrival_station False
arrival_date False
arrival_time False
dtype: bool
Dates and times are combined into a new date-time formatted variables containing
the complete date and time information for departures and arrivals.
Using these new variables, the total duration (in minutes) of bike trips is
computed.
[6]: #conversion to date and time
df_usage["departure_datetime"] = pd.to_datetime(df_usage["departure_date"] + "␣
,→" + df_usage["departure_time"],dayfirst = True)
Upon inspection of the newly defined travel time field, some irregularities can
be noted:
For instance, the minimum trip duration is negative, which is clearly a system
error. Similarly the maximum value is roughly equal to 2550 hours, which is
either an error, or that observation might correspond to a user who failed to
return the bike to a station on time. Since bike trip should be at most 45
minutes long, one would expect the standard deviation to be smaller than its
rather high value, which is thus a consequence of outliers.
Trips of zero-minute duration can be attributed to users who fail to undock the
bike from its anchor point in the station, as well as system errors. For that
5
reason, it makes sense to keep only instances where the travel time is strictly
positive. There exist cases where users take a bike and immediately return it
to the same station, as well as very short trips made to contiguous stations.
However, defining a cut-off minimum travel time would be rather arbitrary. Hence,
trips with positive durations are retained.
[9]: #filter negative durations
df_usage = df_usage[df_usage["travel_time"]>0]
As mentioned before, bike trips can be at most 45 minutes long. Exceeded travel
times might cause users to have their membership temporarily suspended, or even
terminated in extreme cases. By looking at trips longer than 45 minutes, it can
be seen that observations corresponding to large travel times are numerous.
After sorting in descending order, extremely large times are observed.
[10]: travel_time
584701 153132.816667
325373 96387.983333
242703 52802.016667
70163 14637.133333
571992 11703.916667
… …
33316 45.016667
254227 45.016667
373378 45.016667
257677 45.016667
627968 45.016667
By looking at the 99% percentile of the duration of all trips, it can be observed
that despite the existence of multiple users who made very long trips, 99% of
all trips are under one hour of duration. Thus, to avoid completely eliminating
moderately long trips, outlying instances correponding to durations greater than
80 minutes are dropped.
[11]: #99% percentile computation
np.percentile(df_usage[["travel_time"]].sort_values("travel_time",ascending =␣
,→False).values,99)
[11]: 48.36666666666667
6
A histogram for travel time reveals a skewed distribution heavily biased towards
shorter trips. Since the number of observations is very large, a finer grid of
bins can be used to better appreaciate the shape of the distribution.
[22]: #duration distribution visualization
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(6,4))
plt.hist(df_usage.travel_time.values,color="darkviolet",bins=100)
plt.xlabel("Travel time")
plt.ylabel("Frequency")
plt.title("May 2019 trip duration distribution")
line_mean = plt.axvline(x=df_usage.travel_time.values.
,→mean(),color="black",linestyle="--")
line_median = plt.axvline(x=np.median(df_usage.travel_time.
,→values),color="black",linestyle="-")
plt.legend(handles=[line_mean,line_median],labels=["mean","median"])
plt.grid(True)
plt.savefig("duration_hist.png")
plt.show()
Once rows have been filtered by detecting errors and outliers corresponding to
trip durations, other variables can be explored.
The following histograms reveal that the number of bike trips made my male users
7
is almost three times the number of trips made by female users. Also shown is the
fact that the age distribution of users is positively skewed, meaning that most
trips are made by rather young users. The majority of trips is made by people
aged around 30. It is also interesting to note that people in their early 20's
and younger, tend to use ECOBICI way less than people in older age groups.
axs[0].set_xlabel("User gender")
axs[0].set_ylabel("Frequency")
axs[0].grid(axis="y")
axs[0].set_title("May 2019 bike trips")
axs[1].hist(df_usage.user_age,bins=81,color="darkgray")
axs[1].set_xlabel("User age")
axs[1].set_ylabel("Frequency")
axs[1].plot(df_usage["user_age"][df_usage["user_gender"]=="M"].value_counts().
,→sort_index().index,df_usage["user_age"][df_usage["user_gender"]=="M"].
,→value_counts().sort_index().values,color="red",label="M")
axs[1].plot(df_usage["user_age"][df_usage["user_gender"]=="F"].value_counts().
,→sort_index().index,df_usage["user_age"][df_usage["user_gender"]=="F"].
,→value_counts().sort_index().values,color="blue",label="F")
axs[1].legend()#title="Usage by gender")
axs[1].grid(True)
plt.tight_layout()
plt.savefig("gender_age_hists")
plt.show()
8
Another interesting insight can be obtained by looking at the distribution of
bike trips across time and date. In order to visualize this distribution, the
departure hours, and well as the day of the week (Monday=0 to Sunday=6) and
day of the month are considered. Afterwards, a heatmap is used to observe the
two-dimensional distribution of trips.
[23]: #day of the week, day of the month, and hour of the day are extracted from␣
,→departure date-time variable
df_usage["departure_weekday"] = df_usage["departure_datetime"].dt.dayofweek
df_usage["departure_day"] = df_usage["departure_datetime"].dt.day
df_usage["departure_hour"] = df_usage["departure_datetime"].dt.hour
9
[48]: #2d histogram
fig = plt.figure(figsize=(10,10))
plt.
,→hist2d(df_usage["departure_day"],df_usage["departure_hour"],bins=[31,24],cmin=1,cmap="plasma
plt.yticks(range(0,24))
plt.gca().invert_yaxis()
plt.xlabel("Day of the month")
plt.ylabel("Hour of the day")
plt.title("May 2019 bike trip departures 2D-histogram")
plt.colorbar()
plt.savefig("heatmap.png")
plt.tight_layout()
plt.show()
10
The previous plot shows that the number of daily trips has a seasonal component
on both day of the month and hour of the day. The peak times coincide with both
weekdays and workday start and end hours. This suggests that users of the ECOBICI
bike network utilize it as a method of transportation as part of their daily
commute to work, make up a large portion of the total number of trips. Notice
how Labour day (celebrated on May 1st) seems to disrupt the bike usage pattern.
These findings agree with the age histograms which show it is mostly people in
working age who ride ECOBICI bikes.
Mexico City is a large urban area with a massive subway network, and plenty of
bus, BRT, and train routes. However, the city is overpopulated, has serious
traffic problems, and numerous people come to work from neighboring cities and
11
towns, which contributes to further saturating the transportation services and
the city's traffic. ECOBICI might be actually helping thousands of employees to
travel between work and home without needing to spend longer times frustrated in
traffic. Besides allowing people to have a more enjoyable and healthy commute,
ECOBICI might also be contributing to lower pollution emissions in Mexico City by
offering a green transportation method.
In addition to bike usage data, a list of ECOBICI bike stations was obtained
through ECOBICI's API. The following cell of code is a modified version of a
class publicly available here, and it performs the requiered queries:
[25]: #import required libraries
import requests
import json
,→format(base_url, client_id,client_secret)
The station list can be extracted in json format and then saved to a pandas
dataframe.
Inspecting the table shows that each of the 480 bike stations appears only once
in the table, accompanied by the corresponding information such as ID, name,
address, geographical coordinates, etc.
[26]: df_stations.head()
[26]: id name \
0 124 124 CLAUDIO BERNARD-DR. LICEAGA
1 241 E241 EJERCITO NAL-JUAN VAZQUEZ DE LA MELLA
12
2 243 243 MIGUEL DE CERVANTES SAAVEDRA-LAGO FILT
3 350 350 JOSE CLEMENTE OROZCO-CORREGGIO
4 445 E445 RIFF-AVENIDA RIO CHURUBUSCO
location stationType
0 {'lat': 19.422392, 'lon': -99.150358} BIKE
1 {'lat': 19.43862, 'lon': -99.20758} ELECTRIC_BIKE
2 {'lat': 19.440839, 'lon': -99.196712} BIKE
3 {'lat': 19.384062, 'lon': -99.181482} BIKE
4 {'lat': 19.35827, 'lon': -99.156105} ELECTRIC_BIKE
[27]: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 11 columns):
id 480 non-null int64
name 480 non-null object
address 480 non-null object
13
addressNumber 480 non-null object
zipCode 479 non-null object
districtCode 480 non-null object
districtName 480 non-null object
altitude 0 non-null object
nearbyStations 480 non-null object
location 480 non-null object
stationType 480 non-null object
dtypes: int64(1), object(10)
memory usage: 41.4+ KB
This table contains missing values in the zip code and altitude fields. However,
these variables will not be necessary and they can be deleted.
[30]: #detection of missing values
df_stations.isnull().any()
[30]: id False
name False
address False
addressNumber False
zipCode True
districtCode False
districtName False
altitude True
nearbyStations False
location False
stationType False
dtype: bool
The location of each station is given by the latitude and longitude. These values
are contained in a column as dictionaries, from which they can be extracted to
define separate variables.
Since these variables will be used as predictors for the trip duration, they
will be standardized by substracting their mean and dividing by their standard
deviation.
[32]: #extract latitude and longitude
df_stations["latitude"] = pd.Series([d["lat"] for d in␣
,→list(df_stations["location"])])
#standardize variables
14
df_stations["latitude"] = (df_stations["latitude"]-df_stations["latitude"].
,→mean())/df_stations["latitude"].std()
df_stations["longitude"] = (df_stations["longitude"]-df_stations["longitude"].
,→mean())/df_stations["longitude"].std()
By combining the usage and station tables, information about the departure and
arrival station location, as well as the bike type can be assigned to each trip
made by users.
[33]: #merge data sets
df_usage = pd.
,→merge(df_usage,df_stations[["id","latitude","longitude","stationType"]],how="left",left_on="
,→rename(columns={"latitude":"departure_latitude","longitude":
,→"departure_longitude","stationType":"bike_type"})
del df_usage["id"]
df_usage = pd.
,→merge(df_usage,df_stations[["id","latitude","longitude"]],how="left",left_on="arrival_statio
,→rename(columns={"latitude":"arrival_latitude","longitude":
,→"arrival_longitude"})
del df_usage["id"]
11 rows of the merged table have missing observations in the arrival location
variable. This occurs because the stations 1002 and 3000 do not appear in the
official list of 480 bike stations. Since this number of missing values is very
small, these rows are dropped.
[34]: df_usage[df_usage[["arrival_latitude","arrival_longitude"]].isna().any(axis=1)]
15
186763 22:19:01 1002 08/05/2019 22:28:26
188612 07:09:55 1002 09/05/2019 08:19:43
197050 10:49:31 1002 09/05/2019 11:37:56
241586 22:37:29 1002 10/05/2019 22:39:26
253686 22:38:49 1002 11/05/2019 22:42:27
282677 17:22:41 1002 13/05/2019 17:38:22
302093 09:38:29 1002 14/05/2019 09:46:49
519621 09:42:54 1002 23/05/2019 10:12:56
640109 15:08:49 1002 28/05/2019 16:19:26
643460 17:06:57 1002 28/05/2019 18:21:09
701672 13:50:51 1002 30/05/2019 14:25:21
701675 13:51:04 1002 30/05/2019 14:25:00
16
101609 0.215174 BIKE NaN
186763 1.179335 BIKE,TPV NaN
188612 0.380311 BIKE NaN
197050 -0.218579 ELECTRIC_BIKE NaN
241586 0.774832 BIKE,TPV NaN
253686 0.972440 BIKE NaN
282677 0.986781 BIKE NaN
302093 1.324256 BIKE NaN
519621 0.774832 BIKE,TPV NaN
640109 0.062356 BIKE NaN
643460 0.797006 BIKE NaN
701672 1.043259 BIKE NaN
701675 1.043259 BIKE NaN
arrival_longitude
53554 NaN
101609 NaN
186763 NaN
188612 NaN
197050 NaN
241586 NaN
253686 NaN
282677 NaN
302093 NaN
519621 NaN
640109 NaN
643460 NaN
701672 NaN
701675 NaN
[35]: df_stations[(df_stations["id"].isin([1002,3000]))]
[36]: df_usage.dropna(axis=0,inplace=True)
Lastly, data type of departure and arrival station number, user gender, departure
weekday, and and bike type, is transformed to categorical in order to feed this
variables into the predictive model.
Additionally, the sine and cosine transformations
sin(2πx/period), cos(2πx/period)
are applied to the departure hour variable (period=24) to take into account its
cyclical nature and include it in the model.
17
[37]: #conversion to categorical type
for s in␣
,→["user_gender","departure_station","arrival_station","departure_weekday","bike_type"]:
,→
df_usage[s] = df_usage[s].astype("category")
#cyclical variables
period = 24
df_usage["departure_hour_sin"] = np.sin(2*np.pi*df_usage["departure_hour"]/
,→period)
df_usage["departure_hour_cos"] = np.cos(2*np.pi*df_usage["departure_hour"]/
,→period)
After the cleaning and data integration process, the usage data table structure
now contains the neccesary variables and types, and is displayed in the following
cell:
[38]: df_usage.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 748237 entries, 0 to 748250
Data columns (total 22 columns):
user_gender 748237 non-null category
user_age 748237 non-null int64
bike 748237 non-null int64
departure_station 748237 non-null category
departure_date 748237 non-null object
departure_time 748237 non-null object
arrival_station 748237 non-null category
arrival_date 748237 non-null object
arrival_time 748237 non-null object
departure_datetime 748237 non-null datetime64[ns]
arrival_datetime 748237 non-null datetime64[ns]
travel_time 748237 non-null float64
departure_weekday 748237 non-null category
departure_day 748237 non-null int64
departure_hour 748237 non-null int64
departure_latitude 748237 non-null float64
departure_longitude 748237 non-null float64
bike_type 748237 non-null category
arrival_latitude 748237 non-null float64
arrival_longitude 748237 non-null float64
departure_hour_sin 748237 non-null float64
departure_hour_cos 748237 non-null float64
dtypes: category(5), datetime64[ns](2), float64(7), int64(4), object(4)
memory usage: 107.8+ MB
18
1.0.4 Models and findings:
In order to address the first research question, summary statistics are first
obtained, grouping observations by user gender.
[40]: travel_time \
count mean std min 25% 50%
user_gender
F 188769.0 13.813889 9.394245 0.016667 7.1 11.300000
M 559468.0 13.277851 9.324944 0.016667 6.6 10.683333
75% max
user_gender
F 17.916667 79.816667
M 17.300000 79.983333
According to the mean and quartiles from the previous summary, female users
seem to be making slighly longer bike trips on average. In order to determine
whether the mean travel time of female and male users differ in a statistically
significant way, a two-sample statistical test is performed.
For this purpose, the Welch's t-test is employed. The test statistic is the
following:
X̄M − X̄F
t= √
SM2 S2
+ F
NM NF
where X̄M and X̄F represent the male and female mean durations respextively, and
2 and S 2 denote the sample variances for each gender, while N
similarly SM F M and
NF are the sample sizes.
This test assumes that the trip duration is normally distributed, as well as
independence between users. Contrary to the well-known Student's t-test, equal
population variances are not assumed. Given that these assumptions are satisfied,
the test statistic is approximately from the t-distribution.
The hypothesis being tested is µM = µF vs. µM = ̸ µF (two-sided test), where µM
and µF denotes the mean travel time for male and female users respectively.
Before running the test, a transformation needs to be applied to the travel
time variable in order to bring its highly skewed distribution shown above, to
a distribution close to Gaussian.
The proposed transformation is the natural logarithm, which is often used to
eliminate skew. The next plot compares the distribution of trip duration with
and without the logartimic transformation. Note how the transformed data has a
more symmetrical distribution that looks more like a Gaussian.
19
[41]: fig, axs = plt.subplots(figsize=(10,4),nrows=1,ncols=2)
axs[0].hist(df_usage["travel_time"],bins=100)
axs[0].set_ylabel("Frequency")
axs[0].set_xlabel("Travel time")
axs[0].set_title("Untransformed data")
axs[0].grid(True)
axs[1].hist(np.log(df_usage["travel_time"]),bins=100)
axs[1].set_ylabel("Frequency")
axs[1].set_xlabel("log(Travel time)")
axs[1].set_title("Transformed data")
axs[1].grid(True)
plt.savefig("log_duration.png")
plt.tight_layout()
plt.show()
The Welch's t-test is now applied to the transfomed data. The obtained p-value
is less than 0.05, which means the equal mean hypothesis is rejected a the 0.05
significance level.
The test statistic has a negative value and its p-value divided by two is still
less than 0.05. This implies that the one sided test for the hypothesis µM >= µF
would be rejected in favor of µM < µF at the 0.05 significance level.
These test allow us to conclude that there is statistically significant evidence
to support the idea that female users tend to make larger bike trips than male
users.
[42]: #import t test for independent samples from scipy
from scipy.stats import ttest_ind
#run the test with the durations for male and female users
20
ttest_ind(a=np.log(df_usage["travel_time"][df_usage["user_gender"]=="M"].
,→values),b=np.log(df_usage["travel_time"][df_usage["user_gender"]=="F"].
,→values),equal_var=False)
The next model to be applied involves predicting the duration of bike trips
based on user and start/end point information. Travel time plays the role of
the response variable, while the variables to be included as predictors in the
model are the following: User gender, user age, departure station, arrival
station, departure day of the week, departure day of the month, departure
hour (cosine/sine transformations), bike type, departure latitude/longitude
(standardized), and arrival latitude/longitude (standardized).
␣
,→"departure_hour_sin","departure_hour_cos","departure_latitude","departure_longitude","arriva
Due to the rather complex task of predicting travel times from such a limited
number of variables, it is unlikely that a linear model will perform well. For
this reason, a regression tree-based algorithm called Random Forest is chosen.
This algorithm trains a very large number of regression trees on resampled
datasets of equal size, considering only random subsets of variables at each
split. The algorithm's prediction is then computed as the average of the
predictions of each individual tree. The purpose of resampling and random
selection of variable subsets is to decorrelate the trees, while the averaging
procedure attempts to rduce the variance of the predictions.
This technique is nice because it is powerful, requires little tuning. More
information about the Random Forest algorithm can be found here.
Before fitting the model,the data is split into training and test datasets
(70%-30% respectively), and as a last preprocessing step, categorical variables
must be hot-encoded by dummy binary (0-1) variables. Note however that since
the label of bike stations is a categorical variable, hot encoding departure and
arrival stations will increase the number of variables dramatically. A very
high-dimensional setting can negatively impact the performance of the model.
For this reason, these variables are dropped, but their location information is
retained using latitude and longitude.
[50]: #preprocessing
from sklearn.model_selection import train_test_split
del X["departure_station"]
del X["arrival_station"]
X = pd.get_dummies(X)
21
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)
The model is now fitted using the default parameters. Check documentation for
further information.
[52]: #define and fit the model
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(max_depth = 35, n_estimators=100, random_state =␣
,→42,max_features="sqrt")
rf.fit(X_train,y_train)
Once the Random Forest estimator function has been learned, predictions are made
for both training and test data.
[53]: #predictions
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
Train and test model performance is then assesed by root mean squared error
(RMSE).
[55]: #training and test set performance
from sklearn.metrics import mean_squared_error
print("Train RMSE is",np.sqrt(mean_squared_error(y_train,y_train_pred)))
print("Test RMSE is",np.sqrt(mean_squared_error(y_test,y_test_pred)))
plt.ylabel("Importance score")
22
plt.xlabel("Variable")
plt.title("Random Forest variable importance score")
plt.savefig("importance.png")
plt.show()
[89]: X_train.columns
1.0.5 Conclusions
• Female users tend to take slightly longer trips than male users.
• The number of male users is almost three times the number of female users.
• Most people who ride ECOBICI bikes use the service as part of their daily
23
commute.
• ECOBIC travel times can be estimated without much error, from user variables
such as age and gender, as well as departure/destination location, departure
day/time/day of the week, and bike type. However, the most important
variables seem to be age, departure day/hour, and departure/destination
location.
1.0.6 References
• https://www.ecobici.cdmx.gob.mx/en
• https://github.com/dominoFire/ecobici-python/blob/master/ecobici.py
• https://www.ecobici.cdmx.gob.mx/sites/default/files/pdf/user_manual_api_eng_final.pdf
• https://en.wikipedia.org/wiki/Welch%27s_t-test
• https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
• https://towardsdatascience.com/understanding-random-forest-58381e0602d2
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.h
24