You are on page 1of 15

An Agent Based Model Simulation of Taxi Drivers

Passenger Acquisition Behaviour

Luis Ramada Pereira1,*


1
ISCTE, Instituto Universitário de Lisboa, Lisbon, Portugal
*
E-mail: Corresponding ramadap@gmail.com

Abstract

Based on a publicly available dataset of all taxi rides in the city of NY, we model
the taxi behaviour after customer dropoff, with the aim of accessing its efficiency
in passenger replenishment.

For practical reasons we focus our analysis in the borough of Manhattan,


excluding trips that originate or finish outside of its limits.

We compare the results of multiple agent based simulations where agents


employ different heuristics, such as random walks or history based decision
making. For this we gather historical data amassing knowledge about major
passenger pickup points, determined thru a clustering algorithm with good
enough resolution (resulting in a division of Manhattan into 1,000 cells of
approximately 250x250m of area) to approximate real life conditions. We gather
this knowledge with one hour granularity, based on the results obtained from the
previous 3 corresponding week days and apply it to a single day in October 2013.

Results are compared between simulation runs under different heuristics and real
life results.

Our model does not ambition to be a faithful reproduction of real life conditions,
but results validate the application of Agent Based Modelling (ABM) in the study
of complex environments like the ones created in urban traffic.

Introduction

Data from taxi and other shared rides has been used widely to study various
urban phenomena, such as community detection (de Montis, Caschili, & Chessa,
2013), route selection (E. J. Manley, Addison, & Cheng, 2015), identification of
functional urban regions (E. Manley, 2014) , urban traffic congestion (Solé-
Ribalta, Gómez, & Arenas, 2016) and optimal taxi stand selection (Moreira-
Matias et al., 2012).

P a g e 1 | 15
In this study we aim to model, thru an ABM approach, the taxi driver decisions at
the conclusion of a paid taxi ride when they go look for their next customer. The
objective is to validate if this approach produces "credible" results and turns out
to be an appropriate tool for such an aim. We will try to answer questions such
as: "What is a good strategy? a random walk? or would a historically informed
decision process deliver improved results?"

Our analysis is based on the 2013 data set of yellow taxis in New York, obtained
thru a Freedom of Information Law/Act (FOIL/FOIA) request (NYC DOT, n.d.).

In this data set, the information about the taxis ids (medallions) as well as the
driver identification (license), although masked, can be used to uniquely pair taxi
runs. By using data from two consecutive trip records, the point of drop off of
one customer and the point of pick up of the next can be determined for any taxi.
Time stamps are available to compute “unproductive time”, and, although
routing is not available, distances can be approximated thru “city block” distance,
given the grid nature of most of Manhattan streets and avenues.

Our analysis is based on the taxi ride patterns of October, 24, 2013 a Thursday.
This was a randomly selected working day that seemed to exhibit a normal
pattern. The dataset was filtered for the island of Manhattan and cleaned of
erroneous records.

We simulate the behaviour of “artificial” taxi drivers, with the exact same number
of taxis and passenger trips as included in the dataset after clean-up (respectively
12,981 and 369,205). We have the timestamp of each passenger pickup, but no
information is available about taxi demand. We assume, empirically, the presence
at the pickup location of the requesting passenger "t" seconds before pickup
time. "t" is a parameter in the simulation, that for our analysis was set at 300
seconds (5 mins).

To calculate the efficiency of the "artificial" drivers we look at parameters like


fleet utilization, number of passengers waiting for a taxi, among others.

Our expectation was that a decision based on prior historical knowledge would
fare well. To build that knowledge, an hourly based frequency of passenger
pickup and dropoff for the previous 3 Thursdays is used. The probability of an
“artificial” taxi driver driving to a given location (or cell, in our cell based mapping
of Manhattan) looking for customers is:

 inversely dependent on the distance from the cell


 directly dependent on how popular that cell was previously for passenger
pickup
 and inversely dependent on the number of passenger dropoffs in the
same cell

This is an empiric, but intuitive heuristic.

P a g e 2 | 15
The validity of historical data was assessed comparing key statistics. When testing
for goodness of fit, using a normalized Root Mean Squared cost function:

√∑𝑛𝑡=1( 𝑥1,𝑡 − 𝑥2,𝑡 )2


𝑁𝑅𝑀𝑆𝐷 =
𝑦𝑚𝑎𝑥 − 𝑦𝑚𝑖𝑛
we get the results seen in the table below which are indicative of a very strong
correlation between historical and the data for October 24th.

Metric Goodness of Fit


Number of Active Taxis 0.9594
Number of Taxi Rides 0.9307
Average Speed 0.9246

The hourly variation can be seen in the figure below.

Figure 1 Historical data compared with day of analysis

As no data is available for latent, unfulfilled demand or actual passenger waiting


times, it is not feasible to reason about the fitness of the artificial models when
compared with the real data. However, comparing between simulated models
can help understand some of the performance characteristics of the taxi service
as a function of passenger replenishment behaviour.

P a g e 3 | 15
Materials and Methods

Data Sources and tools

The data for this simulation was gathered from (NYC DOT, n.d.). The data
collected includes multiple information (NYC Taxi & Limousine Commission,
2015), out of which we made use of:

 Taxi Medallion (i.e. masked vehicle identification)}


 Hack_license (i.e. masked driver identification}
 Latitude of pickup point
 Longitude of pickup point
 Latitude of dropoff point
 Longitude of dropoff point
 Timestamp of pickup point
 Timestamp of dropoff point
 Ride Distance

The data was selected for the appropriate time frame and cleaned, as the dataset
had a significant number of erroneous records, such as invalid timestamps, or
GPS locations. Data was also selected to exclude any rides originating or ending
outside of Manhattan.

The simulation run used the passenger pickup data of Thursday, October 24th,
2013 and the historical data was built based on the 3 previous Thursdays, namely
3, 10 and 17 of October.

This data preparation process was run on a MySql workbench.

Once the data was cleaned, several datasets were produced:

 Passenger data set with location and timestamp when he started looking
for a taxi. This is retrieved directly from the pickup data in the original
data set for October 24, 2013, subtracted of "t" seconds ( a simulation
parameter).
 A list of all pickup points location and time stamps for the previous 3
Thursdays. These points were clustered into 1000 centroids on an hourly
basis thru the K-Means algorithm in Matlab, and used to partitioning
Manhattan into 1000 cells for every hour. An example of the resulting
centroids can be found in figures 2 and 3. The relative weight of each cell
was then computed considering the number of pickups and dropoffs in
the cell and used to build a strategy for taxi behaviour after passenger
dropoff.
 A data set for activation and deactivation of Taxis. A taxi is considered
active the first time he is seen in the original dataset, and dormant if he
switches drivers or not seen in the original dataset for more than 2 hours.
 The average speeds of a taxi on an hourly basis, for when the taxi is busy
and free.

P a g e 4 | 15
The actual simulation was built on the Repast Simphony platform (North et al.,
2013).

Figure 2 10PM Figure 3 10AM

𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑠 𝑜𝑓 𝑐𝑒𝑙𝑙𝑠, 𝑐𝑖𝑟𝑐𝑙𝑒 𝑑𝑖𝑎𝑚𝑒𝑡𝑒𝑟 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑎𝑙 𝑡𝑜 𝑝𝑖𝑐𝑘𝑢𝑝 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

Model Initialization

The agent based simulation was run in the repast platform.

Three agents, the taxi, the passenger, and a master controller to generate
passenger requests, were built into the model as described ahead.

During initialization, the model does the necessary housekeeping of setting up


the environment, collecting parameters, setting up the hourly cells
"attractiveness" and activating the cell array for 0 hours (the run starts at
midnight). The hourly heuristic is only fully computed in the Taxi agent (see
below) but during initialization a table is built up with the initial statistics of each
of the 1,000 hourly cells (every hour has a different set of 1,000 cells). This
attractiveness is directly proportional to the number of historical pickups in that
P a g e 5 | 15
cell and inversely proportional to number of dropoffs in that same cell. Intuitively,
a large number of dropoffs increases the competition for the available pickups.
The formula applied is:
3600
𝑆=
𝑝 − 𝑑 − min(𝑝ℎ − 𝑑ℎ ) + 1
Where:

𝑆 = 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑒𝑙𝑙


𝑝 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑖𝑐𝑘𝑢𝑝𝑠 𝑖𝑛 𝑡ℎ𝑎𝑡 𝑐𝑒𝑙𝑙
𝑑 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑜𝑝𝑜𝑓𝑓𝑓𝑠 𝑖𝑛 𝑡ℎ𝑎𝑡 𝑐𝑒𝑙𝑙
min(𝑝ℎ − 𝑑ℎ ) = min(𝑝 − 𝑑) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑐𝑒𝑙𝑙𝑠 𝑖𝑛 𝑡ℎ𝑒 ℎ𝑜𝑢𝑟
This formula penalizes cells with a historical record of many dropoff and few
pickups. As some cells may have more dropoffs than pickups, the value of the
least favourable cell is added to every cell in the hourly array, plus 1, to set the
least attractive cell to 1.

Finally during initialization all Taxi agents are loaded and added to the context
with their corresponding activation/deactivation tables and status set to
"dormant".

The Controller Agent

The controller is activated every second and detects if any passenger should be
created by reading the passenger dataset. If the pickup time is equal or past the
controller time of day, passengers are created as found and added to the context
of the simulation.

The controller detects also the passage of the hour and selects the Manhattan
cell based data for Taxi strategy selection.

Finally the controller is also responsible to terminate the simulation at the end of
the day.

P a g e 6 | 15
The Passenger Agent

The passenger agent advertises the fact that he is looking for taxi and collects
statistics of its waiting time. Passenger agents are removed from the context by
the Taxi agent when he is picked up.

The Taxi Agent

The taxi agent exits and enters the dormant state according to its
activation/deactivation table. This is checked at each step as the time series
progress. If the taxi agent is active he can be in one of two states: free or busy. If
busy he moves at "busy" speed, if free at "free" speed. Speeds are set hourly
according to historical data, adjusted by a truncated Gaussian normal distribution
(𝜌 = 1) with boundaries set as a parameter in the simulation. All simulation runs
were run with a minimum of .7 of average and a maximum of 1.3 of average
speed.

Taxi movement is along the typical incline of Manhattan streets and avenues,
with no effort to follow the actual routes. This introduces an obvious error, but
given the grid nature of Manhattan roadways, this is estimated to be small.
Further development of this simulator could use actual routing information from
a map based routing service.

A taxi agent can be of two types:

1. Random. Where upon becoming free, he selects a random location inside


Manhattan as the next destination point
2. History based. Where the selection of the destination point is based on
historical information and distance

When an agent becomes free, if he is of the "History" type, he selects a cell


destination according to the formula:

𝑖+𝑠 3600
𝐶𝑠𝑐𝑜𝑟𝑒 = (𝑁𝑡𝑟𝑢𝑛𝑐 ( , 1)) ∈ |𝑖. . 𝑠| ∗ +𝑆
2 𝑣

Where:

𝐶𝑠𝑐𝑜𝑟𝑒 = 𝑟𝑒𝑠𝑢𝑙𝑡 𝑠𝑐𝑜𝑟𝑒


𝑁𝑡𝑟𝑢𝑛𝑐 = 𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
𝑣 = 𝑠𝑝𝑒𝑒𝑑 𝑖𝑛 𝑚𝑝ℎ
𝑆 = 𝑐𝑒𝑙𝑙 𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑜𝑚 ℎ𝑜𝑢𝑟𝑙𝑦 𝑠𝑡𝑟𝑎𝑡𝑒𝑔𝑦 𝑎𝑟𝑟𝑎𝑦

P a g e 7 | 15
𝑖+𝑠 3600
The term (𝑁𝑡𝑟𝑢𝑛𝑐 ( 2
, 1)) ∈ |𝑖. . 𝑠| ∗ 𝑣
represents the perceived time to get to
a given cell, and 𝑆 the cell historical statistical "attractiveness".

To avoid cell overcrowding with taxis, a taxi selects randomly one of the 10 top
cells (out of a 1,000) resulting from applying the formula above.

"i" and "s" are simulation parameters and indirectly control the relative weights
of distance and cell attractiveness in the calculation of destination cell selection
rating. Runs were made with 0.1 and 0.5 boundaries as these seemed to produce
good results. Parameter sweep simulation runs would allow the determination of
the best factors to maximize the strategy effectiveness.

Basically, this is a history based heuristic that uses the frequency of passenger
pickup, per hour, on a cell subdivision of the island of Manhattan and the
distance to the cells from current position.

Once a destination cell is determined, the precise location is randomly selected


inside the 250m x 250m cell, and the taxi moves along the avenues and streets
according to its "free" speed.

When the taxi agent is free he queries the environment to determine if a


potential customer is waiting for pickup in its neighborhood. The radius of this
neighborhood is a parameter in the simulation. All runs were made with a radius
of about 50 meters. If a customer is found, the taxi becomes busy and the
passenger agent is removed from the context.

The Taxi agent also collects several statistics that are displayed and collected
during the simulation.

Results

Several simulations were run with various environments. Within the parameter
space, we studied:

 The effect of using different weights for historical information and


distance in the destination selection heuristic
 The effect of increasing population of random Taxis
 How a totally random heuristic compared with an historical based one
 The effect of smaller neighborhoods in the passenger detection phase

We also looked at key metrics between real data and the simulated runs.

 The metrics we used for analysis were:


 The number of active taxis
 The number of busy taxis

P a g e 8 | 15
 The number of waiting passengers

Our simulation allows realtime view of taxi movement. In figures 4 and 5 we can
see a snapshot of two simulations where red dots show busy taxis, green dots
free taxis, blue crosses waiting passengers.

Figure 5 - 15% Random, taxi coverage widens Figure 4 - 0% Random


when compared with no random taxis

Comparing real and simulated data

As mentioned before, the lack of latent and unfulfilled demand restricts the
analysis that can be made comparing simulated and real data. Some assumptions
need to be made to compare results.

In figure 6 we compare the number of waiting passengers from real data and
simulated results, assuming a wait time of 1, 3 and 5 minutes for real data. The
simulation run was set at .15 random taxis and 0.1 to 0.5 distance relative weight
in the destination heuristic, which produced the lowest number of waiting
passengers.

As can be seen, the simulated runs perform well when compared to real data and
it's always lower than real data for Passenger waiting time of 5 minutes.

P a g e 9 | 15
Figure 6 - Number of waiting passengers on Thursday 24th, 2013, assuming a wait time of 1, 3 and 5
minutes. Compared with the ABM simulation.

Exploring the parameter space in simulated runs


Simulating a full day of taxi runs is a heavy computing process at the "desktop"
performance level, and running batch sweeping parameter runs would take
probably weeks. However we were able to compare selective parameters and
determine that a purely random strategy does not perform well when compared
to the historical based heuristic. As can be seen in figure 7, a purely random
strategy has a maximum of over 3,500 passengers waiting for a taxi, while a
purely heuristics based lowers that number to 3,000.

Figure 7 - A pure random walk performs poorly when looking for passengers. From a systems
perspective a mixture of history informed heuristic with a small percentage of random walks delivers
much better results.

P a g e 10 | 15
What is most interesting is that adding a small population of "random" taxi
drivers vastly improves results. In fact a 15% population of "random" taxis lowers
that number to slightly above 1,000.

There is an intuitive explanation for this, as a purely historical heuristic tends to


cluster a lot of taxis in key cells. Having a percentage of free roaming taxis
population, allows for the capture of passengers that would otherwise be
ignored.

Finally we also looked at fine tuning the heuristic for destination selection of free
taxis. Here, a parameter sweep would be invaluable, but we could see that
lowering the relative importance of distance to the destination cell, improved
results up to a point. In figure 8 we can see the impact on the number of
passengers waiting of lowering from 1 to a gaussian distributed value between
0.1 and 0.5. The improvement is almost 40% during the morning peak hour.

P a g e 11 | 15
Figure 8 - How much weight should a taxi driver put on the time it will take him to reach a
destination cell? Simulation runs seem to indicate that not too much, after all it may pick up a
passenger along the way.

Discussion

The results from the agent based model built around the NY Department of
Transport Taxi dataset shows the power of Agent Based Modeling when applied
to real world data. Specifically we can show that using historical data to drive Taxi
Drivers decision can be beneficial, improving the system as a whole.

Certainly, to be useful in practice, extensive refinements would be needed. To


better simulate, actual routing with real world coordinates of traffic ways would
be needed. Instead of global averages, location based speed would need to be
used (although this could be extracted up to a point from the datasets). Historical
data could use non traffic related events, such as weather or social, to factor
anomalies out, or relate to the current simulated environment.

To allow a deeper analysis, additional information would be required, especially


unfulfilled demand (customers giving up) and queue performance (time to be
served).

P a g e 12 | 15
It's clear that with these refinements, a valid model could be built to improve
service, optimize costs, from an environment, social and financial perspective. As
referenced before this is an area of continuous study that deserves attention to
face the urban challenges ahead.

P a g e 13 | 15
Bibliography

de Montis, A., Caschili, S., & Chessa, A. (2013).


Commuter networks and community detection: A
method for planning sub regional areas. European
Physical Journal: Special Topics, 215(1), 75–91.
http://doi.org/10.1140/epjst/e2013-01716-4
Manley, E. (2014). Identifying functional urban regions
within traffic flow. Regional Studies, Regional
Science, 1(1), 40–42.
http://doi.org/10.1080/21681376.2014.891649
Manley, E. J., Addison, J. D., & Cheng, T. (2015). Shortest
path or anchor-based route choice: A large-scale
empirical analysis of minicab routing in London.
Journal of Transport Geography, 43, 123–139.
http://doi.org/10.1016/j.jtrangeo.2015.01.006
Moreira-Matias, L., Fernandes, R., Gama, J., Ferreira, M.,
Mendes-Moreira, J., & Damas, L. (2012). An online
recommendation system for the taxi stand choice
problem (Poster). IEEE Vehicular Networking
Conference, VNC, 173–180.
http://doi.org/10.1109/VNC.2012.6407427
North, M. J., Collier, N. T., Ozik, J., Tatara, E. R., Macal, C.
M., Bragen, M., & Sydelko, P. (2013). Complex
adaptive systems modeling with Repast Simphony.
Complex Adaptive Systems Modeling, 1(1), 3.
http://doi.org/10.1186/2194-3206-1-3
NYC DOT. (n.d.). TLC Trip Record Data. Retrieved from
http://www.nyc.gov/html/tlc/html/about/trip_recor
d_data.shtml
NYC Taxi & Limousine Commission. (2015). Data
Dictionary – Yellow Taxi Trip Records. Retrieved July
P a g e 14 | 15
12, 2016, from
http://www.nyc.gov/html/tlc/downloads/pdf/data_d
ictionary_trip_records_yellow.pdf
Solé-Ribalta, A., Gómez, S., & Arenas, A. (2016). A model
to identify urban traffic congestion hotspots in
complex networks. arXiv, 1–8. Retrieved from
http://arxiv.org/abs/1604.07728

P a g e 15 | 15

You might also like