Professional Documents
Culture Documents
Abstract
Based on a publicly available dataset of all taxi rides in the city of NY, we model
the taxi behaviour after customer dropoff, with the aim of accessing its efficiency
in passenger replenishment.
Results are compared between simulation runs under different heuristics and real
life results.
Our model does not ambition to be a faithful reproduction of real life conditions,
but results validate the application of Agent Based Modelling (ABM) in the study
of complex environments like the ones created in urban traffic.
Introduction
Data from taxi and other shared rides has been used widely to study various
urban phenomena, such as community detection (de Montis, Caschili, & Chessa,
2013), route selection (E. J. Manley, Addison, & Cheng, 2015), identification of
functional urban regions (E. Manley, 2014) , urban traffic congestion (Solé-
Ribalta, Gómez, & Arenas, 2016) and optimal taxi stand selection (Moreira-
Matias et al., 2012).
P a g e 1 | 15
In this study we aim to model, thru an ABM approach, the taxi driver decisions at
the conclusion of a paid taxi ride when they go look for their next customer. The
objective is to validate if this approach produces "credible" results and turns out
to be an appropriate tool for such an aim. We will try to answer questions such
as: "What is a good strategy? a random walk? or would a historically informed
decision process deliver improved results?"
Our analysis is based on the 2013 data set of yellow taxis in New York, obtained
thru a Freedom of Information Law/Act (FOIL/FOIA) request (NYC DOT, n.d.).
In this data set, the information about the taxis ids (medallions) as well as the
driver identification (license), although masked, can be used to uniquely pair taxi
runs. By using data from two consecutive trip records, the point of drop off of
one customer and the point of pick up of the next can be determined for any taxi.
Time stamps are available to compute “unproductive time”, and, although
routing is not available, distances can be approximated thru “city block” distance,
given the grid nature of most of Manhattan streets and avenues.
Our analysis is based on the taxi ride patterns of October, 24, 2013 a Thursday.
This was a randomly selected working day that seemed to exhibit a normal
pattern. The dataset was filtered for the island of Manhattan and cleaned of
erroneous records.
We simulate the behaviour of “artificial” taxi drivers, with the exact same number
of taxis and passenger trips as included in the dataset after clean-up (respectively
12,981 and 369,205). We have the timestamp of each passenger pickup, but no
information is available about taxi demand. We assume, empirically, the presence
at the pickup location of the requesting passenger "t" seconds before pickup
time. "t" is a parameter in the simulation, that for our analysis was set at 300
seconds (5 mins).
Our expectation was that a decision based on prior historical knowledge would
fare well. To build that knowledge, an hourly based frequency of passenger
pickup and dropoff for the previous 3 Thursdays is used. The probability of an
“artificial” taxi driver driving to a given location (or cell, in our cell based mapping
of Manhattan) looking for customers is:
P a g e 2 | 15
The validity of historical data was assessed comparing key statistics. When testing
for goodness of fit, using a normalized Root Mean Squared cost function:
P a g e 3 | 15
Materials and Methods
The data for this simulation was gathered from (NYC DOT, n.d.). The data
collected includes multiple information (NYC Taxi & Limousine Commission,
2015), out of which we made use of:
The data was selected for the appropriate time frame and cleaned, as the dataset
had a significant number of erroneous records, such as invalid timestamps, or
GPS locations. Data was also selected to exclude any rides originating or ending
outside of Manhattan.
The simulation run used the passenger pickup data of Thursday, October 24th,
2013 and the historical data was built based on the 3 previous Thursdays, namely
3, 10 and 17 of October.
Passenger data set with location and timestamp when he started looking
for a taxi. This is retrieved directly from the pickup data in the original
data set for October 24, 2013, subtracted of "t" seconds ( a simulation
parameter).
A list of all pickup points location and time stamps for the previous 3
Thursdays. These points were clustered into 1000 centroids on an hourly
basis thru the K-Means algorithm in Matlab, and used to partitioning
Manhattan into 1000 cells for every hour. An example of the resulting
centroids can be found in figures 2 and 3. The relative weight of each cell
was then computed considering the number of pickups and dropoffs in
the cell and used to build a strategy for taxi behaviour after passenger
dropoff.
A data set for activation and deactivation of Taxis. A taxi is considered
active the first time he is seen in the original dataset, and dormant if he
switches drivers or not seen in the original dataset for more than 2 hours.
The average speeds of a taxi on an hourly basis, for when the taxi is busy
and free.
P a g e 4 | 15
The actual simulation was built on the Repast Simphony platform (North et al.,
2013).
Model Initialization
Three agents, the taxi, the passenger, and a master controller to generate
passenger requests, were built into the model as described ahead.
Finally during initialization all Taxi agents are loaded and added to the context
with their corresponding activation/deactivation tables and status set to
"dormant".
The controller is activated every second and detects if any passenger should be
created by reading the passenger dataset. If the pickup time is equal or past the
controller time of day, passengers are created as found and added to the context
of the simulation.
The controller detects also the passage of the hour and selects the Manhattan
cell based data for Taxi strategy selection.
Finally the controller is also responsible to terminate the simulation at the end of
the day.
P a g e 6 | 15
The Passenger Agent
The passenger agent advertises the fact that he is looking for taxi and collects
statistics of its waiting time. Passenger agents are removed from the context by
the Taxi agent when he is picked up.
The taxi agent exits and enters the dormant state according to its
activation/deactivation table. This is checked at each step as the time series
progress. If the taxi agent is active he can be in one of two states: free or busy. If
busy he moves at "busy" speed, if free at "free" speed. Speeds are set hourly
according to historical data, adjusted by a truncated Gaussian normal distribution
(𝜌 = 1) with boundaries set as a parameter in the simulation. All simulation runs
were run with a minimum of .7 of average and a maximum of 1.3 of average
speed.
Taxi movement is along the typical incline of Manhattan streets and avenues,
with no effort to follow the actual routes. This introduces an obvious error, but
given the grid nature of Manhattan roadways, this is estimated to be small.
Further development of this simulator could use actual routing information from
a map based routing service.
𝑖+𝑠 3600
𝐶𝑠𝑐𝑜𝑟𝑒 = (𝑁𝑡𝑟𝑢𝑛𝑐 ( , 1)) ∈ |𝑖. . 𝑠| ∗ +𝑆
2 𝑣
Where:
P a g e 7 | 15
𝑖+𝑠 3600
The term (𝑁𝑡𝑟𝑢𝑛𝑐 ( 2
, 1)) ∈ |𝑖. . 𝑠| ∗ 𝑣
represents the perceived time to get to
a given cell, and 𝑆 the cell historical statistical "attractiveness".
To avoid cell overcrowding with taxis, a taxi selects randomly one of the 10 top
cells (out of a 1,000) resulting from applying the formula above.
"i" and "s" are simulation parameters and indirectly control the relative weights
of distance and cell attractiveness in the calculation of destination cell selection
rating. Runs were made with 0.1 and 0.5 boundaries as these seemed to produce
good results. Parameter sweep simulation runs would allow the determination of
the best factors to maximize the strategy effectiveness.
Basically, this is a history based heuristic that uses the frequency of passenger
pickup, per hour, on a cell subdivision of the island of Manhattan and the
distance to the cells from current position.
The Taxi agent also collects several statistics that are displayed and collected
during the simulation.
Results
Several simulations were run with various environments. Within the parameter
space, we studied:
We also looked at key metrics between real data and the simulated runs.
P a g e 8 | 15
The number of waiting passengers
Our simulation allows realtime view of taxi movement. In figures 4 and 5 we can
see a snapshot of two simulations where red dots show busy taxis, green dots
free taxis, blue crosses waiting passengers.
As mentioned before, the lack of latent and unfulfilled demand restricts the
analysis that can be made comparing simulated and real data. Some assumptions
need to be made to compare results.
In figure 6 we compare the number of waiting passengers from real data and
simulated results, assuming a wait time of 1, 3 and 5 minutes for real data. The
simulation run was set at .15 random taxis and 0.1 to 0.5 distance relative weight
in the destination heuristic, which produced the lowest number of waiting
passengers.
As can be seen, the simulated runs perform well when compared to real data and
it's always lower than real data for Passenger waiting time of 5 minutes.
P a g e 9 | 15
Figure 6 - Number of waiting passengers on Thursday 24th, 2013, assuming a wait time of 1, 3 and 5
minutes. Compared with the ABM simulation.
Figure 7 - A pure random walk performs poorly when looking for passengers. From a systems
perspective a mixture of history informed heuristic with a small percentage of random walks delivers
much better results.
P a g e 10 | 15
What is most interesting is that adding a small population of "random" taxi
drivers vastly improves results. In fact a 15% population of "random" taxis lowers
that number to slightly above 1,000.
Finally we also looked at fine tuning the heuristic for destination selection of free
taxis. Here, a parameter sweep would be invaluable, but we could see that
lowering the relative importance of distance to the destination cell, improved
results up to a point. In figure 8 we can see the impact on the number of
passengers waiting of lowering from 1 to a gaussian distributed value between
0.1 and 0.5. The improvement is almost 40% during the morning peak hour.
P a g e 11 | 15
Figure 8 - How much weight should a taxi driver put on the time it will take him to reach a
destination cell? Simulation runs seem to indicate that not too much, after all it may pick up a
passenger along the way.
Discussion
The results from the agent based model built around the NY Department of
Transport Taxi dataset shows the power of Agent Based Modeling when applied
to real world data. Specifically we can show that using historical data to drive Taxi
Drivers decision can be beneficial, improving the system as a whole.
P a g e 12 | 15
It's clear that with these refinements, a valid model could be built to improve
service, optimize costs, from an environment, social and financial perspective. As
referenced before this is an area of continuous study that deserves attention to
face the urban challenges ahead.
P a g e 13 | 15
Bibliography
P a g e 15 | 15