You are on page 1of 9

2 PERVASIVE computing Published by the IEEE CS n 1536-1268/11/$26.

00 © 2011 IEEE
L ARG E - SCAL E OPPOR T UNI ST I C SE NSI NG
A Tale of One City:
Using Cellular Network
Data for Urban Planning
W
ith the continuing urbaniza-
tion of the world’s popula-
tion and the rapid growth
of cities, urban planners are
faced with many challenges,
including heavily congested roads, overzealous
development, and increasing pollution. To eff-
ciently address these problems, urban planners
must develop a better understanding of the
dynamics of modern cities. This includes study-
ing the fow of people into and out of cities as
well as the use of the commer-
cial and residential parts of a
given city.
Traditional studies of city
dynamics are time-consuming
and expensive, entailing tech-
niques such as surveys and
vehicle counting. Large-scale
commuting studies can take
years to complete, and many
municipalities learn about new trends only when
a detailed census is released. In contrast, cellular
networks must know the approximate locations
of all active cellular phones to provide them
with communication services, and thus cellular
network data has the potential to revolutionize
the study of city dynamics. Given these phones’
ubiquity and their almost constant proximity
to their owners, cellular networks can be used
to opportunistically sense the locations of large
populations of people. They provide a means to
monitor city dynamics frequently, cheaply, and
at an unprecedented scale.
Anonymized call detail records are one means
of capturing city dynamics (see the sidebar for
a discussion of related work in this area). CDRs
document the location of the wireless cell car-
rying every voice call and short message service
(SMS) transaction, as well as the time the event
occurred. Because service providers routinely
collect CDRs for operational, planning, and
billing purposes, the incremental cost and re-
sources required to analyze this data are mini-
mal. However, CDRs have several limitations.
First, they’re sparse in time because they’re gen-
erated only when a transaction occurs, render-
ing cell phone users invisible at all other times.
Second, they’re coarse in space because they re-
cord locations at the granularity of a cell tower
sector, giving an uncertainty on the order of one
square mile for each transaction.
Nonetheless, the convenience and preva-
lence of CDR collection makes it worthwhile
to investigate how much can be learned from
these records. We present several ways in which
CDRs can provide urban planners with im-
portant information about city dynamics. We
analyze two months of cellular traffc in and
around Morristown, New Jersey, a suburban
US city with approximately 20,000 residents
and demonstrate the feasibility of our approach
through tabulation, statistical analysis, and
visualization.
Cellular data from call detail records can help urban planners better
understand city dynamics. The authors use CDR data to analyze people
fow in and out of a suburban city near New York City.
Richard A. Becker,
Ramón Cáceres, Karrie Hanson,
Ji Meng Loh, Simon Urbanek,
Alexander Varshavsky,
and Chris Volinsky
AT&T Labs–Research
PC-10-04-Varsh.indd 2 8/30/11 2:34 PM
OCTOBER–DECEMBER 2011 PERVASIVE computing 3
Dataset
We collected anonymized CDRs
from the cellular network of a large
US communications service provider,
capturing transactions carried by the
35 cell towers located within fve miles
of Morristown’s center. These 35 cell
towers house approximately 300 anten-
nas pointed in various directions and
supporting various radio technologies
and frequencies. Our goal was to cap-
ture cellular traffc in and around the
town, and choosing the fve-mile radius
let us cover both Morristown proper
and its neighboring areas. A party not
involved in the data analysis collected
and anonymized the data. The data
included voice and SMS traffic for
60 days between 29 November 2009 and
27 January 2010. In total, we collected
15 million voice CDRs and 26 million
SMS CDRs for 475,000 unique phones.
Given the data’s sensitivity, we took
several steps to ensure individuals’ pri-
vacy. First, we only used anonymous
records in this study. In particular, we
removed personally identifying char-
acteristics from our CDRs and linked
CDRs for the same phone using an
anonymous unique identifer that con-
sisted of the 5-digit billing zip code and
a unique integer, rather than a tele-
phone number. No demographic data
is linked to any cell phone user or CDR.
Each CDR also contained the starting
time of the voice or SMS event, the
event’s duration, and the locations and
azimuths of the cell tower antennas as-
sociated with the event.
S
everal recent papers have studied how cellular network data
can be used for urban planning. In a case study in Milan,
Italy, Carlo Ratti and his colleagues
1
and later Riccardo Pulselli
and his colleagues
2
demonstrated that it is possible to graphically
represent the intensity of urban activities and their evolutions
through space and time using call volume at cellular towers.
Jonathan Reades and his colleagues also looked at how a cell
tower’s call volume correlates with urban activities in the tower’s
geographic vicinity.
3
They studied call volume activity in six
distinct locations in Rome, Italy, and showed that the call volume
varies drastically between the studied locations and between
weekdays and weekends. They also proposed an algorithm for
clustering geographic locations that exhibit similar call volumes.
Fabien Girardin and his colleagues used tagged photographs
from Flickr in combination with the call volume data to determine
the whereabouts of locals and tourists in Rome, Italy.
4
They later
repeated the study with only call volume data to examine the
differences in behavior between tourists and locals in New York
City.
5
Francesco Calabrese and his colleagues studied where peo-
ple came from to attend special events in the Boston, Massachu-
setts, area.
6
They found that people who live close to an event are
more likely to attend it and that events of the same type attract
people from roughly the same home locations. Although we also
study how cellular network data can be used for urban planning,
we looked at a different set of research questions, such as deriving
and validating laborshed, calculating partyshed, capturing a city’s
lifebeat, and clustering users based on their calling patterns.
Marta González and her colleagues used cellular records from
an unnamed European country to form statistical models of how
individuals move.
7
Sibren Isaacman and his colleagues looked
at the difference in human mobility between New York and Los
Angeles populations.
8
Chaoming Song and his colleagues stud-
ied the predictability of individual’s movements and showed that
given suffcient past history, one could guess the current location
of a given user with high accuracy.
9
Although we also look at
human mobility patterns, this article concentrates on studying
how to use cellular network data in urban planning.
REFERENCES
1. C. Ratti et al., “Mobile Landscapes: Using Location Data from Cell
Phones for Urban Analysis,” Environment and Planning B: Planning
and Design, vol. 33, no. 5, 2006, pp. 727–748.
2. R. Pulselli et al., “Computing Urban Mobile Landscapes Through
Monitoring Population Density Based on Cellphone Chatting,” Int’l J.
Design and Nature and Ecodynamics, vol. 3, no. 2, 2008, pp. 121–134.
3. F. Girardin et al., “Digital Footprinting: Uncovering Tourists with
User-Generated Content,” IEEE Pervasive Computing, vol. 7, no. 4,
2008, pp.36-43.
4. J. Reades et al., “Cellular Census: Explorations in Urban Data Collec-
tion,” IEEE Pervasive Computing, vol. 6, no. 3, 2007, pp. 30–38.
5. F. Girardin et al., “Towards Estimating the Presence of Visitors from
the Aggregate Mobile Phone Network Activity They Generate,” Proc.
Int’l Conf. Computers in Urban Planning and Urban Management, ACM
Press, 2009, pp. 52–61.
6. F. Calabrese et al., “The Geography of Taste: Analyzing Cell-Phone
Mobility and Social Events,” Proc. Int’l Conf. Pervasive Computing, LNC
6030, Springer, 2010, pp. 22–37.
7. M.C. González, C.A. Hidalgo, and A.-L. Barabási, “Understanding
Individual Human Mobility Patterns,” Nature, vol. 453, 5 June 2008,
pp. 779–782.
8. S. Isaacman et al., “A Tale of Two Cities,” Proc. 11th ACM Workshop
on Mobile Computing Systems and Applications (HotMobile 10), ACM
Press, 2010, pp. 19–24.
9. C. Song et al., “Limits of Predictability in Human Mobility,” Science,
vol. 327, no. 5968, 2010, pp. 1018–1021.
Related Work in Using Cellular Network Data
for Urban Planning
PC-10-04-Varsh.indd 3 8/30/11 2:34 PM
4 PERVASIVE computing www.computer.org/pervasive
LARGE-SCALE OPPORTUNISTIC SENSING
Second, we present all our results as
aggregates. That is, no individual anon-
ymous identifer was singled out for the
study. By observing and reporting only
on the aggregates, we protect individu-
als’ privacy.
Finally, each CDR only included lo-
cation information for the cellular tow-
ers with which a phone was associated
during a voice call or at the time of a
text message. The phones were effec-
tively invisible to us aside from these
events. In addition, we could estimate
the phone locations only to the granu-
larity of the cell tower antenna coverage
area. The effective radius of an antenna
depends on tower height, radio power,
and terrain and can vary signifcantly
from location to location. Chaoming
Song and his colleagues estimate the
average coverage of a cell tower as three
square miles.
1
Because each tower typi-
cally has three directional sectors, we
estimate the coverage of a single sector
as about one square mile.
Calculating and
Validating the Laborshed
Understanding the daily fow of people
in and out of a city is important for
urban planning. In particular, for a city
with a commercial district, understand-
ing where workers live can help manage
vehicular traffc fow and plan public
transportation services. Specialized re-
ports from the census are available to
study daily fows of people, but because
the reports are only available once a
decade, they don’t help planners in eval-
uating shorter-term effects from new
commuting options such as carpooling
initiatives.
Morristown is a regional center of
commerce and shopping, with a devel-
oped downtown area, many offce com-
plexes, a large hospital, and the county
courthouse. It draws a large worker
base from the nearby suburbs and even
some workers from the much larger
New York City. The geographical area
representing where a city’s workers live
is its laborshed.
We used our CDR data to calculate
Morristown’s laborshed. We classifed
cell phone users as workers if they were
frequently observed in Morristown
during business hours (9 a.m. to 5 p.m.,
Monday to Friday). More specifcally,
a worker must satisfy two conditions.
First, the worker must engage in an aver-
age of at least four calls or messages per
week during business hours, involving
one of the Morristown cell towers.
Second, the worker must make those
calls or messages on an average of at least
two unique weekdays per week. We
derived these thresholds experimentally
and observed that moderate changes to
these values didn’t affect our results.
We used account-billed zip codes to
identify place of residence.
We validated our CDR-based la-
borshed results by comparing them to
publicly available US Census data. We
used the 2000 Census Transportation
Planning Package (CTPP), specifcally
the “Journey to Work” package, which
includes detailed information on com-
muting patterns including counts of
commuters to and from specifc census
tracts.
2
We mapped census tract counts
to zip codes by calculating the fraction
of a census tract that fell within each zip
code of interest.
Figure 1 shows contour maps of the
Morristown laborshed as calculated
from the CDR data (Figure 1a) and
from the census data (Figure 1b). We
don’t expect the two maps to be identi-
cal because our CDR records only show
those who are actively generating calls
or SMS records, and they refect only
the activity of the part of the population
Figure 1. Morristown laborshed maps calculated from (a) call detail records (CDR) data and (b) 2000 US Census data. The two
maps show similar patterns, indicating that CDR data provides a plausible estimate of where Morristown workers live. The red
dots represent Morristown’s center.
(a) (b)
New York New York
New Jersey New Jersey
M
a
n
h
a
t
t
a
n
M
a
n
h
a
t
t
a
n
PC-10-04-Varsh.indd 4 8/30/11 2:34 PM
OCTOBER–DECEMBER 2011 PERVASIVE computing 5
on our company’s network. Addition-
ally, the 2000 census data is more than
a decade old, and commuting patterns
might have changed signifcantly in that
time. However, we expect the maps to
be similar if our methodology is sound.
Indeed, we see that the geographic dis-
tributions are similar, especially for
the regions close to Morristown, but
also for the area near Newark, roughly
midway between Morristown and
New York City, and for the region to
the south and east of Morristown. The
CDR numbers are lower than the cen-
sus numbers for the more distant north-
western region and somewhat higher
for New York City.
Studying the contour maps them-
selves reveals a high-density region
centered directly over Morristown, in-
dicating a large concentration of people
who both live and work in Morristown.
We also see that many more workers
come from areas north of Morristown
than south and that there seems to be
a cluster of workers to the east, in the
more heavily populated areas close to
New York City. Additionally, there are
some pockets of workers who come
from towns west of Morristown. Urban
planners could use this information to
reduce traffc congestion by organizing
new transit and bike routes or park-
and-ride programs.
Another way to validate our labor-
shed results against the CTPP data is
to show the data underlying Figure 1
as a scatterplot. We plotted each zip
code as a point refecting the relation-
ship between our estimates (which use
account-billed zip codes) with the cen-
sus numbers. Figure 2 shows the result,
plotting the two axes on a logarithmic
scale. Although, as discussed earlier, we
don’t expect our estimates to perfectly
agree with the census numbers, we do
expect our points to align with theirs
up to a multiplicative factor. Hence,
they should fall close to a straight line
in this plot. For comparison, we show
the y = x equality line as a dotted line,
and the best linear ft, y = 0.387x as a
solid line. The correlation coeffcient
is 0.81. There does seem to be a clear
correspondence between the CDR- and
census-based numbers, and if we want
to roughly estimate numbers of people,
we would multiply the CDR numbers
by 1/0.387.
Calculating the Partyshed
We can apply techniques similar to
those described previously to other
groups of people as well, such as people
who are active late at night. Like many
cities, Morristown has a lively bar and
restaurant scene that attracts both
locals and people from other communi-
ties. We refer to the geographical distri-
bution of where this group lives as the
partyshed.
We identifed the partyshed cohort
using CDRs by setting different criteria
and thresholds from those we used for
the laborshed. We looked for cell phone
users who had voice call or text mes-
saging activity late on weekend nights
(10 p.m. to 3 a.m., Fridays to Sundays).
Figure 3 shows the resulting par-
tyshed. The distribution of partiers
appears to be considerably more con-
centrated in and near Morristown than
the distribution of workers. Nonethe-
less, there is still some representation of
people who live far away, even as far as
New York City. Knowing where groups
of revelers come from and where they
return to at the end of the night could
allow towns to tailor services such as
late-night shuttle buses intended to
keep inebriated drivers off the road.
Capturing a City’s Lifebeat
We can identify patterns of human ac-
tivity in different parts of a city—a city’s
Figure 2. Scatterplot showing agreement between Morristown laborshed numbers
from call data records (CDR) data and US Census data. Each point represents one
zip code. The solid line shows the best linear ft, where the CDR count equals 0.387
of the census count. If the CDR estimates exactly matched the census numbers, the
points would fall on the dotted line.
Census estimates
C
D
R

e
s
t
i
m
a
t
e
s
1 10 100 1,000 5,000
1
10
100
1,000
5,000
Morristown
PC-10-04-Varsh.indd 5 8/30/11 2:34 PM
6 PERVASIVE computing www.computer.org/pervasive
LARGE-SCALE OPPORTUNISTIC SENSING
lifebeat—by observing cell phone usage
in different cell tower antenna coverage
regions. By studying these patterns, city
offcials could potentially model the
typical fow of people between different
parts of the city over time. Monitoring
these patterns might in turn allow the
timely detection of anomalies such as
dangerous overcrowding surrounding
a popular music concert, or following
the traffc fow during a weather emer-
gency. Studies using data at the time
scale of the national census clearly can’t
address events of this nature, but CDR-
based analysis can be performed in
almost real time.
To analyze data from multiple cell
tower antennas simultaneously, we de-
veloped a novel visual display capable
of representing the data’s multivariate
nature. We frst aggregated the under-
lying data in two-minute intervals and
then aggregated by the day of the week,
which let us study day-specifc patterns.
In total, we ended up with 720 two-
minute bins for each day of the week for
both voice calls and SMS messages. We
then used the principle of small multi-
ples
3
to display the data for all combi-
nations of the partitioned variables.
Before illustrating how viewing the
behavior of multiple antennas simulta-
neously can provide powerful insights,
we present plots for two specifc anten-
nas in Morristown in detail. Figure 4
shows usage plots captured on two dif-
ferent days of the week for two anten-
nas located on the same cell tower but
Figure 3. Morristown partyshed map showing the home locations of people who
used their cell phones during weekend late nights in downtown Morristown.
Compared to the laborshed maps, partiers’ homes are concentrated in areas closer
to Morristown than workers’ homes.
M
a
n
h
a
t
t
a
n
New York
New Jersey
Figure 4. Lip plots of voice call and SMS volumes show unusual spikes highlighting local patterns or events in Morristown, New
Jersey. Call volume (plotted upward: inbound, red; outbound, blue) and SMS volume (plotted downward: inbound, light green;
outbound, dark green) on two antennas are shown. The antenna in (a) points towards the commercial and restaurant district
and the antenna in (b) points toward the high school. A voice peak occurs Saturday at 2 a.m. when the bars close. Both voice and
SMS peaks occur Tuesday when the school lets out.
6 a.m. Noon 6 p.m. Midnight
Downtown antenna—Saturday
6 a.m. 6 a.m. Noon 6 p.m. Midnight 6 a.m.
High school antenna—Tuesday
Voice in
Voice out
SMS in
SMS out
(a) (b)
PC-10-04-Varsh.indd 6 8/30/11 2:34 PM
OCTOBER–DECEMBER 2011 PERVASIVE computing 7
pointing in different directions. The
x-axis represents time, starting and
ending at 6 a.m. The plot’s height shows
the amount of traffc: height above the
axis represents voice call volume, while
height below the axis represents SMS
volume. By using these opposite direc-
tions of the axes, we avoid overplotting
and achieve at-a-glance shape recogni-
tion. In addition, our visual cognitive
system can evaluate symmetry quickly
so we can easily assess whether voice
usage strongly deviates from SMS us-
age. For both traffc types, we use color
to distinguish inbound from outbound
traffc. The resulting shape of the plot
when used for within-day activity re-
sembles lips; hence, we call this type of
visualization lip plots.
The patterns of these two plots are
strikingly different. Figure 4a shows
data from a Saturday in one part of
town. The SMS traffc dominates the
voice traffc, and the volumes keep ris-
ing throughout the day with maximal
usage between 11 p.m. and 1 a.m. The
voice traffc, despite being dominated
by the SMS traffc, has a noticeable
spike at 2 a.m. This is the cell tower
antenna pointing to the downtown
area, including most of the restaurants
and bars in town. The spikes might rep-
resent late night revelers. In particular,
this voice spike might refect the fact
that the bars close at 2 a.m. and patrons
are looking for a ride home.
The plot in Figure 4b has a different
pattern. This plot is from a weekday
and shows more data from SMS traffc
than from voice. But here most traffc
is in the morning. In particular, we see
spikes in SMS usage at 7 a.m., 11 a.m.,
and 2 p.m. Only at 2 p.m. do we see a
similar spike in voice traffc. This cell
antenna points toward the town’s high
school and could refect the students’
communication patterns, texting be-
fore and after school and during lunch.
The larger 2 p.m. spike at the end of the
school day might refect calls between
students and parents, where voice chan-
nels would be more likely.
But how can we get a sense of all the
traffc in a given area in a quick and vi-
sually appealing manner? We devised a
visualization that shows all the data for
all the days and antennas by treating
the cell tower, segment, and technology
as partitioning variables in a large
display. Figure 5 is a subset of such a
display showing three antennas of the
main cell tower downtown, but the
complete display showing all antennas
of this tower is best viewed by printing
it out on large-scale paper and hanging
it on a wall (a high-resolution, zoom-
able version of this complete display
is available at http://bit.ly/BigLipPlot).
In these composite displays, the indi-
vidual lip plots are laid out in a grid.
Each row represents a tower antenna
with a pictogram on the left annotat-
ing the particular combination: the cir-
cle’s size represents the frequency, the
line direction represents the segment’s
direction, and its color represents
the technology (red for 3G and blue
for 2.5G).
We’re interested in comparing rela-
tive behavior across sectors, so we
scaled each row of the composite plot
individually to remove the infuence of
the volume per segment. However, it is
important to recognize sectors with a
small volume because patterns for such
sectors will be inherently more noisy.
We’ve therefore included a bar chart of
Figure 5. Composite lip plots of three sectors from a single cell tower for each day of the week. The lines with the circles on the
left show the antennas’ compass directions. The bars on the right show relative traffc volumes of in/out voice and SMS traffc.
Each antenna (row) has unique characteristics—for example, the high school antenna (top row) shows the characteristic SMS
spikes of school activity only on weekdays.
6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am Noon 6pm Midn. 6am
Mon. Tue. Wed. Thu. Fri. Sat. Sun.
PC-10-04-Varsh.indd 7 8/30/11 2:34 PM
8 PERVASIVE computing www.computer.org/pervasive
LARGE-SCALE OPPORTUNISTIC SENSING
the volume for each activity type on the
right side of each row.
These composite plots show intricate
patterns of communication and reward
careful scrutiny. We learned several
things from looking at the larger plot.
First, there’s a heterogeneity in the pat-
terns. Each row refects a single direc-
tion so any directional patterns can be
quickly seen. There’s a lot of variabil-
ity across directions, refecting differ-
ing usage patterns in different parts of
the city. There’s also a wide variance
in the volume covered by the differ-
ent directions, as the bar plots on the
right show. Certain directions simply
have more traffc than others. In addi-
tion, the relationship between SMS and
voice changes by direction. The middle
row appears to have much more SMS
traffc relative to voice than the other
rows. This is particularly apparent on
weekends. Finally, small volume anten-
nas have high variance and often result
in a fuzzier look.
Identifying Usage Patterns
We used cell phone activity to group city
dwellers into categories that can help
urban planners answer several questions.
For example, do people who commute
to Morristown stay after work to eat
dinner and hang out, or do they head
home as quickly as possible? Do those
who live in Morristown head down-
town after work to grab dinner? Are
the users who shop on a weekend day
the same as those who hang out in the
bars on the weekend nights? We can ad-
dress some of these questions by cluster-
ing users into groups based on their cell
phone usage profles and studying these
groups in more detail. We use an unsu-
pervised clustering algorithm that has
no prior assumptions as to what these
user profles might look like. Our al-
gorithm identifes clusters of behavior,
with each cluster centriod representing
the mean behavior in that group.
To capture usage patterns, we ag-
gregated voice and SMS usage sepa-
rately into bins. Each bin represents a
particular hour of the day and day of
the week, giving us a total of 168 bins
(24 × 7). We didn’t differentiate be-
tween incoming and outgoing events, as
our analysis shows that there is a strong
correlation between the two. A voice
call contributes to a bin an amount
proportional to the duration of the call
falling into that bin. For example, the
call contributes 60 minutes to the bin
if it spans the entire hour. For SMS, a
bin contains the number of SMS events
during a corresponding hour and day of
the week. The result of the aggregation
is two matrices, Q = (q
i,j
) for voice and
P = (p
i,j
) for SMS, with seven columns
corresponding to days of the week and
24 rows corresponding to hours of
the day.
Finding stable clusters is challeng-
ing in datasets like these because of
the high variance of individual behav-
ior. For low volume users, patterns
might not present themselves clearly. If
there are too many low volume users,
the clustering algorithm will wander
too much, trying to ft a signal to the
noise. We therefore ft the clustering
algorithm on a reduced, thresholded
set of users, and then used the clusters
found from this reduced set to assign
a cluster label to each user. We used a
threshold of 10 hours of traffc over the
60-day period (where an SMS counts as
a 1.5-minute call), leaving us with ap-
proximately 26,000 users. After identi-
fying the clusters, we assigned each user
a cluster based on the shortest Euclid-
ean distance from the cluster means.
To make voice and SMS usage pro-
fles comparable for the purpose of clus-
tering, we divided the bin counts by the
global mean of each activity group. We
used a k-means clustering algorithm
with normalized usage vectors v
i
con-
sisting of the entries of both matrices P
and Q: v
i
= (u
i
,
1
… u
i,336
) = (p
1,1
… p
24,1
,
p
1,2
… p
24,7
, q
1,1
… q
24,1
, q
1,2
… q
24,7
)
with ∀i : S
j
u
i,j
= 1 and a Euclidian dis-
tance measure. Despite the simplicity
of the distance measure, which doesn’t
account for the temporal relation-
ship of the entries in the usage matrix,
the resulting clusters are remarkably
consistent and informative. The best
result with respect to cluster size distri-
bution is achieved for k = 7.
Figure 6 shows a plot of the seven
cluster means, each representing a
specifc usage profle. A few key usage
types are evident. For example, clus-
ter 2 consists of voice users who have
heaviest usage during business hours,
are a bit less active on the weekend,
and have little to no SMS usage; clus-
ter 4 are after-dinner voice callers; and
cluster 7 are business-hour texters. Few
individual’s profles will look like these
idealized cluster means, but we can cal-
culate a distance from any single usage
profle to these cluster means to deter-
mine that user’s most likely cluster. In
aggregate, this lets us calculate the clus-
ter mix, or the proportion of different
types of user profles present in the city
at any given point of time. Additionally,
using this method, we could study user
behavior through time to see seasonal
changes.
By combining our clustering results
with contour maps, we can visualize
and compare the geographical foot-
print of users belonging to different
clusters. If the different clusters really
represent different segments of society,
they might have a different geographi-
cal footprint. Urban planners might be
interested in which of these clusters re-
fer to groups of people that live in Mor-
ristown, as opposed to those that are
coming to visit or are passing through.
Compare, for example, the usage
profles of clusters 5 and 7 in Figure 6.
At frst glance, these two profles look
fairly similar: both show heavy SMS us-
age with little voice activity. The major
difference is that cluster 5 shows usage
on the weekends and at night, whereas
cluster 7 is predominantly active during
business hours.
Figure 7 plots the geographical foot-
print of users belonging to these two
clusters. Despite their similarity in us-
age, the two clusters have quite differ-
ent geographical footprints. Cluster
7 is a geographically diverse group of
people that by and large live outside of
PC-10-04-Varsh.indd 8 8/30/11 2:34 PM
OCTOBER–DECEMBER 2011 PERVASIVE computing 9
Morristown, with signifcant clusters
both to the east and west of town. This
helps explain why this profle doesn’t
show a lot of weekend activity: people
who work in Morristown but live fur-
ther away might not want to return to
the town during their free weekend
hours. Cluster 5 is different—concen-
trated in and around Morristown, per-
haps indicating students in the town. In
fact, each cluster has a unique signature
of usage indicated by hours and days of
primary use, indicating interesting as-
pects about city visitors.
I
n the future, we will investigate
how cellular network data can
be used to identify anomalous
events, such as parades, holidays,
or disruptions due to traffc or weather
incidents. We will also investigate the
accuracy of estimating the geographi-
cal distribution of city residents’ work
locations. Finally, we plan to study how
useful it is to capture the temporal ef-
fects in our data for urban planners.
REFERENCES
1. C. Song et al., “Limits of Predictability
in Human Mobility,” Science, vol. 327,
no. 5968, 2010, pp. 1018–1021.
2. Census Transportation Planning Pack-
age (CTPP) 2000: Part 3; www.transtats.
bts.gov.
3. E.R. Tufte, The Visual Display of Quanti-
tative Information, Graphics Press, 1983.
Figure 6. Seven cell phone usage patterns identifed via clustering. Patterns emerge based on voice call and SMS volumes on
different days of the week and hours of the day. Voice usage is shown on the left of the gray vertical bars, SMS usage on the
right. Darker colors indicate higher volumes. The bar at the top shows the relative size of the cluster, with the cluster number on
the top left. For example, cluster 1 shows only voice calls, just before and just after business hours. In contrast, cluster 7 shows
primarily SMS usage during business hours.
1 2 19.8% 3 18.4% 4 8.1% 5 22.2% 6 15.9% 7 10.2%
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
Voice SMS
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
i
g
h
t
3
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
M W F S M W F S
6
8
N
o
o
n
3
5
7
9
M
i
d
n
.
3
5.4%
Figure 7. The cell phone usage clusters shown in Figure 6 have different geographic footprints. For example, (a) cluster 5
(primarily SMS usage outside of school hours) has a Morristown-centric footprint while (b) cluster 7 (SMS usage during business
hours) draws people who live in a much wider area.
(a) (b)
New York
New Jersey
M
a
n
h
a
t
t
a
n
New York
M
a
n
h
a
t
t
a
n
New Jersey
PC-10-04-Varsh.indd 9 8/30/11 2:34 PM
10 PERVASIVE computing www.computer.org/pervasive
LARGE-SCALE OPPORTUNISTIC SENSING
Richard A. Becker is a member of the techni-
cal staff in the Statistics Research Department at
AT&T Labs–Research, Florham Park, New Jersey.
His research interests include statistical comput-
ing, data visualization, and data analysis. He is
author of The New S Language (Chapman & Hall,
1988), an AT&T fellow, a fellow of the American
Statistical Association, and a member of ACM
and the IEEE Computer Society. Contact him at
rab@research.att.com.
Ramón Cáceres is a member of the technical
staff at AT&T Labs–Research. His research inter-
ests include mobile and pervasive computing,
wireless networking, virtualization, security, and
privacy. Cáceres has a PhD in computer science
from the University of California, Berkeley. He is
an ACM distinguished scientist, an IEEE senior
member, and a Usenix member. Contact him at
ramon@research.att.com.
Karrie Hanson is a member of the Communica-
tions Technology Research group at AT&T Labs–
Research. Her research interests center around
building new services using technologies such as
voice over IP, teleconferencing, medical sensing,
and location information. Hanson has a PhD in
chemical engineering, with an emphasis in elec-
trochemistry from the University of California,
Berkeley. Contact her at karrie@research.att.com.
Ji Meng Loh is a member of the technical staff
at AT&T Labs–Research. His research interests
include spatial statistics, bootstrap of spatial
data, and anomaly detection, with applications
in areas such as astronomy, fMRI studies, and
public health. Loh has a PhD in statistics from
the University of Chicago. Contact him at loh@
research.att.com.
Simon Urbanek is a member of the technical staff in
the Department of Statistics at AT&T Labs–Research.
His research interests include visualization, explor-
atory model analysis, statistical computing, and data
mining. He is associate editor of the Journal of Statisti-
cal Software and a member of the R Core Develop-
ment Team, Interface Foundation, and American
Statistical Association. Urbanek has a PhD in statistics
from the University of Augsburg, Germany. Contact
him at urbanek@research.att.com.
Alexander Varshavsky is a member of the techni-
cal staff at AT&T Labs–Research. His research in-
terests include mobile and ubiquitous computing,
context-awareness, and security in mobile systems.
Varshavsky has a PhD in computer science from the
University of Toronto. Contact him at varshavsky@
research.att.com.
Chris Volinsky is the executive director of the Sta-
tistics Research Department at AT&T Labs–Research
in Florham Park, New Jersey. His research interests
include large-scale data mining, recommender sys-
tems, social networks, statistical computation, and
anomaly detection. Volinsky has a PhD in statistics
from the University of Washington. Contact him at
volinsky@research.att.com.
the AUTHORS
PC-10-04-Varsh.indd 10 8/30/11 2:34 PM