You are on page 1of 151

An Investigation in Defining Neighbourhood Boundaries

Using Location Based Social Media


Tai Tong KAM
28th August 2015
For BENVGSC6: Dissertation

Supervised by: Steven Gray, Dr Elsa Arcaute


Word Count: 10,169 words
This dissertation is submitted in partial fulfilment for the requirements for the MSc in
Smart Cities and Urban Analytics in the Centre for Advanced Spatial Analysis, Bartlett
Faculty of the Built Environment, University College London.

ABSTRACT
The widespread use of smartphones and social media has opened opportunities for
researchers to define one of the most elusive concepts in cities: neighbourhoods. While
the number of neighbourhood detection methods using location based social media have
increased in recent years, there is much that we do not know about the process. For
example, researchers have rarely integrated the neighbourhoods detected with
administrative data to add meaning beyond what can be inferred from social media.
This work takes a step towards better understanding neighbourhood detection methods,
and also attempts to add meaning to the clusters / neighbourhoods generated by
incorporating administrative data to these clusters / neighbourhoods.
I break down the neighbourhood detection process into three common elements (a) the
unit used for aggregation, (b) the type of clustering method used; and (c) the similarity
measure.
I then illustrate one way of better understanding the neighbourhood detection process by
applying multiple variations of the Livehoods method (Cranshaw et al., 2012) on data
from Greater London, and find that in addition to neighbourhood clusters, the
Livehoods method may also be able to generate clusters that depict the citys boundaries
from the residents perspective.
I also make a preliminary attempt in this work to combine the clusters / neighbourhoods
formed using the Livehoods method with data from Londons Lower Super Output
Areas to investigate ethnic diversity in neighbourhoods. I found that using location
based social media may generate neighbourhood boundaries that are more appropriate
than or can complement traditional administrative boundaries for studies where
definitions of neighbourhood goes beyond arbitrary administrative boundaries and a
multifaceted view of neighbourhoods is needed.
2

DECLARATION
I, Tai Tong Kam, hereby declare that this dissertation is all my original work and that all
sources have been acknowledged. It is 10,169 words in length.
Signature

====================
Date: 28th August 2015

TABLE OF CONTENTS
1.

2.

3.

4.

5.

RESEARCH GOAL AND OVERVIEW ..................................................................... 8


1.1.

Research goal, motivations, and limitations......................................................... 8

1.2.

Overview ........................................................................................................... 10

INTRODUCTION .................................................................................................... 12
2.1.

Neighbourhoods ................................................................................................ 12

2.2.

Location Based Social Media and Detecting Neighbourhood Boundaries .......... 14

2.3.

Review of Methods for Neighbourhood Detection ............................................. 16

METHODOLOGY ................................................................................................... 25
3.1.

Data sources ...................................................................................................... 25

3.2.

Data sorting, import, storage and analysis......................................................... 26

3.3.

The Livehoods method ...................................................................................... 26

ANALYZING THE LIVEHOODS METHOD.......................................................... 30


4.1.

Tuning the number of smallest eigenvalues (k) .................................................. 30

4.2.

Tuning the alpha constant ()............................................................................ 33

4.3.

Tuning the nearest neighbours parameter (m) .................................................. 34

4.4.

Using cosine similarity....................................................................................... 35

4.5.

Nearest neighbours versus full similarity graph ................................................ 36

4.6.

Summary........................................................................................................... 36

DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS.................. 38


5.1.

Overview of neighbourhoods ............................................................................. 38

5.2.

Breakdown of individual neighbourhoods ......................................................... 46

6.

COMPARING LIVEHOODS CLUSTERS TO LOWER SUPER OUTPUT AREAS 54

7.

CONCLUSION......................................................................................................... 59
7.1.

Concluding Remarks ......................................................................................... 59

7.2.

Limitations and Future Research ...................................................................... 60

8.

BIBLIOGRAPHY .................................................................................................... 64

9.

APPENDIX .............................................................................................................. 67
9.1.

Scripts for collecting and formatting data for analysis ...................................... 67

9.1.1.

IPython notebook: twitter_streaming.ipynb............................................... 67

9.1.2.

IPython notebook: extract_twitter_data.ipynb........................................... 70

9.1.3.

IPython notebook: foursquare_search_place.ipynb ................................... 75

9.1.4.

IPython notebook: format_data_for_analysis.ipynb .................................. 84

9.2.

Scripts for Livehoods clustering method ........................................................... 89

9.2.1.

Bash script: install.sh ................................................................................. 89


4

9.2.2.

Bash script: runLDN.sh ............................................................................. 90

9.2.3.

Python script: clustering.py ....................................................................... 92

9.2.4.

Python script: clusteringalgo.py ................................................................. 94

9.2.5.

Python script: getdata.py ..........................................................................100

9.2.6.

Python script: utils.py ...............................................................................111

9.3.

Scripts for visualizing cluster results ................................................................119

9.3.1.

Python script: formatresults.py.................................................................119

9.3.2.

Python script: visualize_cluster_results.py................................................127

9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in
terms of ethnic diversity ..............................................................................................138
9.4.1.

Python script: extract_ldn_lsoa.ipynb .......................................................138

9.4.2.

Python script: add_ethnic_diversity_to_geojson.ipynb .............................141

9.4.3.

Python script: stats_for_eth_diversity.ipynb .............................................146

9.4.4.

R script: ethnic_diversity_chart.R ............................................................148

9.5.

Livehood clusters for nearest neighbours parameter m=5 to m=20 ..................149

9.6.

Largest cluster generated from Livehoods method ...........................................151

LIST OF FIGURES
Figure 1: Relationship between number of smallest eigenvalues (k) found and number of
clusters formed ................................................................................................................. 32
Figure 2: Boundaries formed for different number of clusters .............................................. 33
Figure 3: Boundaries formed for different alpha constants ................................................... 34
Figure 4: Boundaries formed for different nearest neighbours parameter (m) ....................... 35
Figure 5: Clustering results for London ................................................................................ 40
Figure 6: Properties of Livehood clusters ............................................................................. 44
Figure 7: Overall distribution of venues and checkins across clusters .................................... 47
Figure 8: Hirschman concentration index (HI) for clusters..................................................... 56

LIST OF TABLES
Table 1: Summary statistics for cluster results for London .................................................... 41
Table 2: Percentage difference between proportion of venues within cluster to proportion of
venues within city in terms of Foursquares main categories ............................................... 50
Table 3: Percentage difference between proportion of users within cluster checking-in to
proportion of users within city checking-in in terms of Foursquares main categories ............ 52

ACKNOWLEDGMENTS
I would like to thank my supervisors, Steven Gray and Elsa Arcaute, who have been
extremely supportive and helpful throughout the dissertation process. Steven was also
instrumental in helping me process the data by guiding me on the process for setting up
the cloud computing infrastructure required to run the time-consuming scripts in parallel.
On the other hand, Elsa introduced me to Anastasios Noulas from the University of
Cambridge, who kindly provided the Foursquare data used in this work.
I would also like to thank all the teachers, staff and fellow course mates at CASA, who
have given me a great year of friendship, learning and joy in my time at CASA and
inspired me to do better.
Finally, I would like to thank my partner Cherlyn Ng, whose love, patience and support
made it possible for me to focus on my work while we were 6,740 miles apart.

1. RESEARCH GOAL AND OVERVIEW


1.1. Research goal, motivations, and limitations
The widespread use of smartphones and social media has generated an immense
amount of data which has been used to study topics such as mobility and event
detection in the city (Silva et al., 2013). Some researchers have been attempting to
use the data to define one of the most elusive concepts in cities: neighbourhoods
(Cranshaw et al., 2012; Falher et al., 2015; Zhang et al., 2013). While the research is
promising, there is much that we do not understand about the process of detecting
neighbourhoods using location based social media. For example, we do not know
how the neighbourhoods detected compare with traditional administrative
boundaries, and how we can combine the neighbourhoods detected with data from
these administrative boundaries to help us better understand cities dynamically. We
also do not know how the neighbourhoods detected may change when data over
different time periods or different time intervals are used and what these changes
may mean.
This work takes a step towards better understanding neighbourhood detection
methods. I break down the neighbourhood detection process into three common
elements (a) the unit used for aggregation, (b) the type of clustering method used;
and (c) the similarity measure used so that they can be studied in depth.
Better understanding can come in the form of research on particular elements in the
neighbourhood detection process across a variety of methods and comparing the
differences when different elements are used. It can also come in the form of better
understanding a particular method in depth and exploring how the neighbourhoods
formed are different depending on the parameters used.

In this dissertation, I illustrate one way of doing this by applying multiple variations
of the Livehoods method (Cranshaw et al., 2012) on data from Greater London. The
Livehoods method was chosen as it is a venues-based approach which has not been
used as much in the literature. In addition, it has not yet been applied to the Greater
London area.
As mentioned above, we do not understand how we can combine the clusters /
neighbourhoods detected via neighbourhood detection methods with data from these
administrative boundaries to help us better understand cities. Integrating cluster /
neighbourhoods detected using neighbourhood detection with data from
administrative boundaries is rare in the neighbourhood detection literature as most
researchers using neighbourhood detection methods have used them for developing
recommendation engines that find similar places based on social media activity. As
such, I make a preliminary attempt in this work to combine the clusters /
neighbourhoods formed using the Livehoods method with data from more
traditional administrative boundaries (the Lower Super Output Areas in this case) to
extend the meaningfulness of the clusters / neighbourhoods formed. In particular, I
have tried to integrate ethnic diversity data with the clusters / neighbourhoods
formed using the Livehoods method.
As neighbourhood detection using location based social media is relatively new and
there are few comparisons between existing neighbourhood detection methods, this
work is not aimed at evaluating whether one method or even whether particular
elements of a method are better than another. Neighbourhood detection is a form of
clustering, and determining the best clustering method has a certain degree of
subjectivity.

1.2. Overview
The dissertation is divided into seven sections.
Section Two discusses the concept of neighbourhoods, its importance for
understanding cities and why social media is a useful source of data for defining
neighbourhoods. I will review the methods that have so far been used for defining
neighbourhoods and three common elements used by the methods: (a) the unit used
for aggregation, (b) the type of clustering method used; and (c) the similarity
measure used. I will then describe what we have learnt so far about neighbourhood
detection using location based social media, and outline some ideas for better
understanding these methods.
Sections three to six illustrates one way we can better understand neighbourhood
detection methods by taking a closer look at the Livehoods method (Cranshaw et al.,
2012). Section Three begins by describing the data and methodology used.
Section Four then considers different variations of Cranshaw et als (2012)
Livehoods method for neighbourhood detection and tests three different parameters
to find out if changing them affects the clustering results.
Section Five describes the clusters / neighbourhoods that are formed using the
Livehoods method and explores some types of information that can be derived from
these clusters, by combining the clusters with Foursquares venues database.
Section Six describes the clusters / neighbourhoods that are formed using the
Livehoods method by combining them with data from Lower Super Output Areas
(LSOAs) in Greater London. It discusses the issue of the modifiable areal unit
problem (Openshaw, 1984) and how characteristics of the clusters / neighbourhoods
formed using the Livehoods method may be more appropriate than traditional
administrative boundaries such as the LSOAs.
10

Section Seven consists of concluding remarks and outlines some ideas for further
research that can help us better understand neighbourhood detection methods using
location based social media.

11

2. INTRODUCTION
2.1. Neighbourhoods
Neighbourhoods are a ubiquitous feature of urban living everyone lives in a
neighbourhood. Many groups have an interest in understanding neighbourhoods.
Cranshaw and Yano (2010) note that analysing neighbourhoods is of interest to
businesses such as realtors and developers as the quality of a neighbourhood
affects the value of their assets, and to researchers in the social sciences as they seek
to understand neighbourhood and community level factors that influence
phenomenon such as obesity rates and perceived happiness through neighbourhood
effects (Sampson et al., 2002). A third group that has an interest in neighbourhoods
are city governments that implement neighbourhood interventions and wish to
identify where the interventions would make sense and be most effective. Being
able to identify neighbourhoods in our cities would be valuable to all three groups.
While there is a general consensus that a neighbourhood is a contiguous
geographic area within a larger city, limited in size, and somewhat homogeneous in
its characteristics (Weiss et al., 2007), it is hard to pin down a more exact definition
(Chaskin, 1998; Weiss et al., 2007). Researchers have defined neighbourhoods in
terms of 3 dimensions with varying emphasis by social ties, physical demarcations
and residents experiences (Chaskin, 1997). These are influenced by many factors
such as administrative boundaries, manmade features such as roads, natural features
such as rivers, demographics, social networks of the people that live in or frequent
the area, and the availability of services and facilities (Cranshaw and Yano, 2010).
Each persons perception of their neighbourhood boundaries may differ, even from
their neighbours, and these perceptions may also differ from the official boundaries
used by city governments for urban planning or neighbourhood initiatives
12

(Campbell et al., 2009). However, researchers have also found evidence that
residents often identify a common core within their neighbourhood, and the
differences are about the boundaries where neighbourhoods begin and end
(Campbell et al., 2009).
Neighbourhoods differ from communities, in the sense that neighbourhoods are tied
to a spatial unit with boundaries, while communities are not limited to spatial units.
This difference is reflected in how the role of neighbourhoods in cities has shifted
over time. To summarize Chaskin (1997), neighbourhoods in the past were tied
closely to the idea of community. There were close ties between those living within
a neighbourhood and a strong sense of identity, akin to an urban village. However,
as transportation systems improved and communication over long distances became
available, ties within a neighbourhood have become less close and more functional,
providing a space where neighbours share information, aid and services. When
studying social ties within neighbourhoods, it may be useful to look at common
social and functional activities between those living in a neighbourhood and where
these activities take place. These may give an indication of places that are
considered part of the neighbourhood for those involved in the activities.
Traditionally, studies on neighbourhoods and the neighbourhood effect have used
boundaries where data was easily available, such as administrative and political
boundaries. The data is often reliable as they are typically collected by government
agencies, and the boundaries used usually do not change greatly. Such data is useful
for understanding long term trends and behaviours such as demographics and
urbanisation. However, these traditional data sources are usually collected at certain
periods with long intervals between each period. The data collected represents
snapshots at particular points in time, and do not capture the multiple changes that
13

may occur in between data collection periods. This means that data from traditional
data sources are less useful for reseachers interested in questions where trends and
behaviors are more short term or temporary in nature, such as commuting behaviour
during transport strikes or riots, are unable to capture the. For example, full censuses
in the United Kingdom take place once every ten years. In addition, data from
traditional sources is often expensive and time consuming to collect. Such issues
means data from traditional sources are less suitable for studying trends and
behaviours that are more short term in nature and change frequently. For studying
more short term and dynamic trends and behaviours, location based social media is
likely to be a more suitable data source.

2.2. Location Based Social Media and Detecting Neighbourhood Boundaries


Location based social media is a relatively new source of data for researchers. Users
of these platforms post their thoughts or activities with location data attached. Many
of the characteristics of data from these posts or check-ins make it suitable for
studying short term phenomena and behaviours. It is easily available, it is cheap and
quick to collect, and it provides multiple points of data within a short period. Its
biggest advantage over other data sources is the amount of context that it provides.
A typical data point from location based social media contains information on who
the user is, where the user was, when the data was created. It also provides
additional information depending on the social media platform used. For example,
Twitter1 users post tweets indicating what they were doing or thinking, Instagram2
users post photos, and Foursquare 3 users provide more detailed information about
1

https://twitter.com/

https://instagram.com/

https://foursquare.com/

14

their location. Social media platforms may provide additional contextual


information. The aforementioned Foursquare, for example, maintains a database of
venues that their users post from. This database contains rich contextual information
such as the type of venue (e.g. restaurant, school) and its popularity, which can be
linked to the posts from its users. Furthermore, it is possible to look at the
relationships between different users on social media platforms through the users
interactions with each other.
Silva et al (2012) observe that the widespread adoption of smartphones and social
media websites has created a valuable opportunity to study city dynamics. Data
from location based social media provides rich contextual information on user
activity at different times of day. These characteristics make location-based social
media useful in detecting the invisible image of cities (Silva et al., 2012), such as
patterns of transition between locations that serve different functions in the city.
Given that city neighbourhoods do not follow strict boundaries and can shift over
time (Chaskin, 1997), location-based social media, which provides a large amount
of data in real time, is a useful source of information for neighbourhood detection in
cities and identifying changes over time. As such, researchers have also started to
use social media to detect neighbourhood boundaries.
Using data from location based social media has its limitations. While data from
location based social media has rich context and can be collected easily, such
platforms are typically used by young males who are interested in technology
(Cranshaw et al., 2012), thus the data represents a skewed demographic. Using such
data may generate clusters / neighbourhoods that reflect the views of a certain
demographic, which may not be in agreement with the general population. In
addition, data on these platforms are usually private unless the user agrees to share
15

the data publicly, which further limits the amount of data available for analysis.
Another factor to consider is that users may curate the types of places that they
check-in at using location based social media. Places that are considered more
socially desirable to be at may be over represented when using data from location
based social media. For example, people may be more likely to checkin when eating
at a new fancy restaurant or shopping in a branded goods store rather than when
they are eating at a fast food restaurant or shopping in a discount store. This means
that conclusions based on data from location based social media will likely be
biased towards such socially desirable venues. In the case of neighbourhood
detection, the clusters / neighbourhoods formed may be similarly biased. Previous
research has shown that users have been more likely to check-in at venues
concerning travel and transport, office buildings, and residences (Preotiuc-Pietro
and Cohn, 2013). Despite these limitations, reseachers believe that data from
location based social media can still be valuable for its rich contextual information
and sheer volume available (Silva et al., 2013).
2.3. Review of Methods for Neighbourhood Detection
What follows is a review of neighbourhood detection methods using location-based
social media. Neighbourhood detection using location based social media is
typically treated as a clustering problem, and the methods used so far reflect this
paradigm. Essentially, researchers wish to cluster users social media activities into
contiguous geographic areas based on certain measures of similarity.
Neighbourhood detection methods usually contain three elements:
a. The unit used for aggregation (e.g. grid-based, venue-based)
b. The type of clustering method (e.g. K-Means clustering, spectral
clustering)
16

c. The similarity measures used

Unit used for aggregation


While the data from location based social media comes in the form of individual
posts or check-ins, they are usually aggregated in some spatial form before being
clustered. A common method used in neighbourhood detection is to take the gridbased approach for aggregating the posts. This means dividing the city into multiple
grid squares of equal size and aggregating the properties of the posts within the grid
square. The properties of the grid squares are later used to calculate similarity
measures between grid squares during clustering. Noulas et al (2011), for example,
used a grid-square approach where each grid contained the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues.
Grid squares that are contiguous and are similar to each other based on the
clustering algorithm are then grouped up and form neighbourhoods. Grid-based
approaches can alter the neighbourhoods formed depending on the number, size and
shape of the grid cells used, and is an important consideration when adopting this
approach. For example, large grid cells means a lower number of grids overall and
will increase the speed of processing, but are less precise in delineating
neighbourhood boundaries. In certain cases, the grid square itself may be treated as
a neighbourhood. The size of the grid is often a key decision that has to be made in
grid-based approaches.
A second method is the venues-based approach. Venues are locations specifically
identified by location-based social media platforms, which usually have a database
of venues that users can check-in from. Researchers can make use of the data
contained in these venue databases in addition to the posts made by the users to
17

develop methods for neighbourhood detection. Venues that are considered similar to
each other and fulfil a proximity criterion such as being within a certain distance
from each other are then grouped together and the area bounded by these venues
form a neighbourhood. The proximity criterion is important as it defines the
geographic aspect of the venues. It is similar to how defining the size and shape of
the grids in the grid-based approach determines how the grids are geographically
related to each other. One of the earliest attempts at neighbourhood detection using
location based social media is called Livehoods (Cranshaw et al., 2012) and this
took the venues-based approach. Zhang et al (2013) pointed out that one of the
weaknesses of the venues-based approach is that the neighbourhoods formed have to
be geographically tied to the network of venues used, whereas the grid-based
approach does not.
Clustering methods
Clustering methods used in neighbourhood detection are a reflection of the breadth
and variety of clustering methods used in other fields. This dissertation does not
seek to determine which clustering methods are the best methods for
neighbourhood detection using location baesd social media, since there is a certain
degree of subjectivity. So far, neighbourhood detection methods have included
clustering methods such as K-Means clustering (Del Bimbo et al., 2014), spectral
clustering (Cranshaw et al., 2012; Noulas et al., 2011), and topic-based modelling
(Cranshaw and Yano, 2010). Each clustering method used involves the researcher
choosing parameters used. Examples are the number of topics to use for topic-based
modelling and the number of clusters in K-Means clustering.

18

Similarity measures
A variety of similarity measures have been used in neighbourhood detection. In
terms of properties to include in the similarity measure, researchers have used
properties related to users, such as the users check-in patterns and interests (Del
Bimbo et al., 2014). Researchers have also used properties related to venues in the
databases of location based social media platforms, such as the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues
(Noulas et al., 2011). Other researchers have combined the above mentioned
properties with temporal properties to provide a contextually richer set of properties
to calculate similarity (Falher et al., 2015; Zhang et al., 2013). Different properties
characterise neighbourhoods in different ways, and makes them useful for different
purposes. Amongst the three dimensions of neighbourhoods mentioned earlier
(social ties, physical demarcations and residents experiences), methods in
neighbourhood detection using location based social media have typically used
properties related to residents experiences, for example the number of check-ins,
the temporal pattern of check-ins, and the type and number of venues in the area.
Cosine similarity measures similarity as the angle between two vectors (Xia et al.,
2015). In neighbourhood detection methods, these vectors represent the properties of
the grid and of the venues in the grid-based method and the venues-based method
respectively. Cosine similarity is often used for clustering in neighbourhood
detection with location based social media, and often preferred over other similarity
measures because cosine similarity does not take the magnitude of the vectors into
account. This is useful in cases where the magnitudes of the vectors differ greatly
but at the same time less important for determining similarity. For example, cosine
similarity is often used in information retrieval to determine document similarity as
19

the relative frequency of words in each document and across documents are more
important than the total number of words in a document (Huang, 2008). Similarly,
the magnitude of vectors used in neighbourhood detection differ greatly. The most
popular venues often garner many more check-ins than those less popular and the
most active users check-in much more frequently than those who are less active
(Scellato and Mascolo, 2011). As such, researchers have found that relative
frequencies between venues/grid squares are more useful for neighbourhood
detection rather than absolute numbers, and prefer cosine similarity measures over
Euclidean distance measures when measuring similarity for neighbourhood
detection (Cranshaw et al., 2012; Preoiuc-Pietro et al., 2013).
Researchers use different combinations of the three elements (unit used for
aggregation, clustering method, similarity measure) of neighbourhood detection to
create neighbourhoods, depending on their research purpose. Within each element,
researchers have also had to make decisions that influence the eventual
neighbourhoods formed. Most of the research so far seek to compare urban
neighbourhoods within and across cities so that recommendation engines can make
better recommendations based on criteria such as the users check-in patterns, the
users preferred venue categories and the users interests. Their goals are to suggest
new places that the user may wish to visit, which are similar to places the user has
visited in the past.
A typical example of a neighbourhood detection method for recommendation
engines comes from Noulas et al (2011). They take a grid-based approach and use a
spectral clustering algorithm to cluster grid squares based on the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues.
The method creates neighbourhoods that give us an idea of what type of places are
20

in an area, and a measure of their importance based on users check-in activity.


Another example is Del Bimbo et al (2014)s LiveCities method, which performed
K-means clustering using data on Facebook check-ins and user interests and
Foursquare venue categories.
An early attempt at neighbourhood detection was the Livehoods algorithm
(Cranshaw et al., 2012), which took the venues-based approach and used spectral
clustering to cluster Foursquare venues in Pittsburgh in the United States based on
spatial and social proximity. Through interviews with local residents, Cranshaw et al
(2012) found that neighbourhood detection methods could generate clusters /
neighbourhoods that reflect the character of life in cities. More recent attempts
have combined more information and experimented with different elements. For
example, Zhang et al (2013)s Hoodsquare method takes a grid-based approach and
assesses the similarity of a grid cell with its neighbouring grid cells based on (a) the
distribution of Foursquare venue categories in vicinity; (b) whether these venues
were frequented by tourists or locals, and; (c) the busiest time of the day in terms of
check-ins at these venues. Neighbourhoods were then formed by finding groups of
grid cells that had high relative homogeneity. Zhang et al (2013) point out that using
multiple types of information may better represent the multifaceted nature of
neighbourhoods, and that grid-based methods may be more suitable for identifying
neighbourhoods as the boundaries formed using grid-based methods are not bound
to a particular set of venues.
The most recent attempt at neighbourhood detection using location based social
media describes neighbourhoods in terms of the activity they host (Falher et al.,
2015). Falher et al consider 2 neighbourhoods to be similar if they contain the same
kind of Foursquare venues in the same proportion. In addition to basing the
21

similarity of these venues on the number of check-ins and unique users as well as
the temporal distribution of the check-ins, they also take into account the
distribution of Foursquare venues in the surrounding area.
Cranshaw and Yano (2010) provided a different perspective by treating the question
as an issue of latent topic discovery. They divided the city into grids and applied
topic based modeling to the grids, using each grid as a document and each
Foursquare category tag as a word. With this method, they were able to identify
clusters of places and activities that often appeared together (e.g. beach and seafood).
While research on neighbourhood detection using location based social media has
flourished, there is less research available on understanding whether these methods
accurately reflect neighbourhoods in reality, and how they can contribute to
purposes other than recommending new places that users may wish to visit.
Researchers using the Livehoods algorithm attempted to validate the
neighbourhoods generated through their algorithm (Cranshaw et al., 2012). The
neighbourhoods identified by Cranshaw et als algorithm included neighbourhoods
that corresponded with municipal boundaries, those that were subsets of municipal
boundaries and those that spilled over to more than one municipal boundary.
Cranshaw et al interviewed 27 residents that lived in the city and found that the
neighbourhoods generated by their Livehoods method closely matched the residents
perspectives of neighbourhoods in the city. Cranshaw et als research provides
evidence that the boundaries generated by neighbourhood detection algorithms can
capture local dynamics that includes factors such as municipal boundaries,
demographics, traffic flow and economic development.

22

Some researchers have argued that including more properties in the similarity
measures would better characterise the units being aggregated and produce clusters
that more closely match actual neighbourhoods. For example, Del Bimbo et al (2014)
use both static features (e.g. categories assigned by location based social networks)
and dynamic features (e.g. distribution of the interests of the people who check in at
venues) in their LiveCities method to create neighbourhoods for Florence, which
they then validated qualitatively through online questionnaires with 28 residents.
They found that including both types of features produce neighbourhoods that better
reflect the residents perceptions.
There is much that we do not know about the methods used for neighbourhood
detection process with location based social media. For example, we do not know
how the neighbourhoods detected compare with traditional administrative
boundaries, and how we can combine the neighbourhoods detected with data from
these administrative boundaries to help us better understand cities dynamically. We
also do not know how the neighbourhoods detected may change when data over
different time periods or different time intervals are used and what these changes
may mean.
Better understanding can come in the form of research on particular elements in the
neighbourhood detection process across a variety of methods and comparing the
differences when different elements are used. It can also come in the form of better
understanding a particular method in depth and exploring how the neighbourhoods
formed are different depending on the parameters used. In this dissertation, I look at
the Livehoods method in depth by applying variations of the method on data
collected on Greater London. The Livehoods method was chosen as it is a venuesbased approach which has not been used as much in the literature. It is also one of
23

the rare methods in the neighbourhood detection literature that has validated the
clusters / neighbourhoods generated with the citys residents and found strong
support that the residents perceptions agreed with the clusters formed. This gives it
legitimacy in being able to detect actual neighbourhoods compared to other
neighbourhood detection methods. In addition, it has not yet been applied to the
Greater London area.

24

3. METHODOLOGY
Python was used for most of the analysis and visualization in this work. IPython
notebooks were used for early exploration and experimentation with the data and
Python scripts were written in the later stages to run the neighbourhood detection
method. All scripts used for this work can be found in the appendix section.
3.1. Data sources
The data used for analysis consists of 42,581 Foursquare check-ins at 8,845 venues
by 12,397 unique users in the Greater London area from 6th April 2011 to 31st May
2011. This data was kindly provided by Anastasios Noulas from the University of
Cambridge. For each check-in, the data consists of the user ID, the time, the latitude
and longitude, and the venue ID. Further information on the venues was collected
using the python package foursquare. This included information on the venues
name, category and subcategory (as categorized by the social media network
Foursquare).
Data was also collected from 6th April 2015 to 31st May 2015 for three cities:
London, Singapore and New York City. The Python package tweepy was used to
collect data from Twitters streaming API, which offers samples of the data being
posted on Twitter in real time. A subset of this data consists of Foursquare checkins
from users who have linked their Foursquare accounts to their Twitter accounts such
that their Foursquare checkins also appear as tweets on Twitter. The scripts for
collecting this data and formatting them for analysis are also included in the
appendix. While this data was eventually not used in the analysis for this work,
future work could compare the results generated across the three different cities, or
the results generated from 2 different time periods in London.

25

3.2. Data sorting, import, storage and analysis


The data was formatted using the Python package pandas, which was developed to
mimic the R softwares capabilities in managing large tables of data quickly and
easily. To improve the speed of the analysis, many of the intermediate data required
was pre-generated and stored in various file formats such as JSON files, numpy files
for matrices and pickle files created using the Python pickle package.
As each run of the method took a significant amount of time of one to two hours, an
Amazon cloud server was set up to run the multiple variations of the neighbourhood
detection method. This greatly sped up the process.
The results of the neighbourhood detection method were stored in pickle files. They
were subsequently converted to GeoJSON format and also stored in a MySQL
database using Pythons sqlalchemy package for further analysis and visualization.
In parts of the process where GeoJSON files had to be manipulated, the Python
packages fiona and shapely were used to manage GeoJSON files and check for
relationships between geographic features, for example whether a particular venue
was within a particular boundary.
Many of the visualizations in this work were created using Pythons matplotlib and
seaborn packages. Figure 8 was created using the software R and its ggplot library.
3.3. The Livehoods method
The Livehoods method is Cranshaw et als (2012) method for neighbourhood
detection using location based social media. It is a venues-based approach that
performs spectral clustering on an affinity matrix that takes both spatial affinity and
social affinity into consideration. This method sought to fit the intuitive notion that
26

neighbourhoods are areas that a similar set of people frequent the more often the
same people go to the same venues, the more likely these venues are in the same
neighbourhood. To validate this method, Cranshaw et al (2012) had conducted
qualitative interviews with residents in their study area and verified that the
neighbourhoods generated by their method closely matched the residents
perspectives of neighbourhoods in the city.
Specifically, I applied the following steps from Cranshaw et al (2012) to generate
the affinity matrices used in the spectral clustering algorithm:
1. Given the following sets:
a. Set V, a set of nv Foursquare venues, for which we can compute a
geographic distance (, ) between the venues given their latitude
and longitude coordinates.
b. Set U, a set of nu Foursquare users
c. Set C, a set of checkins of users in U to the venues in V
Each venue v in V is then represented by an nu dimensional vector

where the uth component of


is the number of times user u checked-in
to v.
2. Compute the social similarity s(i, j) between each pair of venues i, j V by
comparing the vectors and . Cosine similarity was used for this measure,
where
(, ) =

(
. )

27

3. Compute an nv by nv affinity matrix on the venues. For a given venue v, let


Nm(v) be the m closest venues to v according to the (, . ) for some
parameter m. Then we let
, = {

(, ) + ,
0,

() ()

where is a small constant that prevents any degenerate matrices from


forming. In Cranshaw et al (2012)s work, a value of 1 102 was used for
.
The affinity matrices were generated using the python packages numPy (Van Der
Walt et al., 2011) and sciPy (Jones et al., 2001), and spectral clustering was
performed on the affinity matrices using the python package scikit-learn (Pedregosa
et al., 2011). To determine the number of clusters that the algorithm should create, I
used the commonly-used eigengap heuristic (Noulas et al., 2011; Planck and
Luxburg, 2006). This involved calculating the k smallest eigenvalues of the
normalized Laplacian of the affinity matrix, and setting the number of clusters as the
number where the largest difference in eigenvalues occurred.
The question of determining parameters such as the number of clusters to form is an
important issue for clustering algorithms (Lancichinetti and Fortunato, 2009; Planck
and Luxburg, 2006; Zelnik-Manor and Perona, 2004). For some clustering
algorithms, researchers have found that maximizing modularity is a useful technique
to guide which values to use for various parameters (Lancichinetti and Fortunato,
2009), though they also recognize that this technique has its own limitations
(Fortunato and Barthlemy, 2007; Good et al., 2010; Lancichinetti and Fortunato,
2011). For spectral clustering algorithms such as the one used in the Livehoods

28

method, the eigengap heuristic was developed in particular to maximize modularity


for the clusters generated (Donetti and Munoz, 2004).
Cranshaw et al (2012) included a post processing step after spectral clustering to
break up any cluster that spanned too large a geographic area (more than 40% of the
geographic area in their work on Pittsburgh), and redistributed the venues in those
clusters to the nearest cluster instead. In my work, the spectral clustering algorithm
typically produced one cluster that spans a large part of the city. This seems to be a
qualitatively different type of cluster where its boundaries are a reflection of what
the users of the social media platform regard as the boundaries of their city, rather
than any particular neighbourhood. As there was no theoretical reason to redistribute
the venues in this large cluster and as a result expand the boundaries of the other
clusters, I chose not to break up the large cluster.

29

4. ANALYZING THE LIVEHOODS METHOD


As described above, there are a number of parameters in the Livehoods method
(Cranshaw et al., 2012) that can be tuned to generate the neighbourhood boundaries:
the number of smallest eigenvalues to calculate (k), the number of nearest
neighbours (m), and the alpha constant . Cranshaw et als (2012) values for these
parameters for the Pittsburgh metropolitan area were 45, 10 and 0.01 respectively.
Cranshaw et al (2012) acknowledged that tuning the clusters is non-trivial and may
lead to experimenter bias. As such, it is worth exploring how tuning the parameters
affects the resulting neighbourhoods formed to better understand the Livehoods
method.
4.1. Tuning the number of smallest eigenvalues (k)
In general, as the value for k increased, the total number of clusters formed
increased as well. Figure 1 illustrates the relationship between k and the total
number of clusters formed using the eigengap heuristic, for values of k from 0 to
200 and Cranshaw et als (2012) values of 0.01 for the alpha constant and 10 for the
number of nearest neighbours. The number of clusters formed increases at certain
threshold value of k, and remains constant until the next threshold is reached. The
threshold values for k in this case are 7, 9, 13, 25, 43, 74 and 101 with the
corresponding values for the number of clusters formed being 5, 7, 11, 23, 41, 72
and 99.
Figure 2 shows the boundaries of the clusters that are formed when the 7 different
values are used in the Livehoods method, with m = 10 and = 0.01. As the number
of clusters created increases, the larger clusters tend to break up into smaller and
smaller clusters. The areas near the centre of the city tend to be broken up first, and
continue to be broken up into smaller clusters as the number of clusters increase.
30

The clusters nearer to the edges of the city tend to remain large and unbroken.
Generally, the clusters formed nearer the edge of the city are larger than the clusters
formed nearer the centre of the city. This phenomenon is likely because the density
of venues further from the centre of the city is much lower than the density of
venues nearer the centre of the city. Since the Livehoods method uses a nearest
neighbours criterion for identifying adjacent venues, areas where venues are less
dense will cover larger areas when searching for adjacent venues and result in the
method creating boundaries with larger areas. Many of the clusters formed when
there are a higher number of clusters are either subsets of the clusters formed using a
lower number of clusters, or very similar to the clusters formed using a lower
number of clusters. The clear exception occurs where k = 74 and 72 clusters are
formed a previously undetected large cluster is formed. This is the qualitatively
different cluster mentioned earlier.
Donetti and Munoz (2004) have pointed out that the weakest part of the eigengap
heuristic is that we do not know how many eigenvalues (k in the Livehoods method)
should be calculated apriori. While Cranshaw et al (2012) also has not provided any
guidelines on how to choose the right value of k for cities of different sizes, cities
occupying a larger area could be seen to potentially contain more neighbourhoods,
and larger values of k should be used. As the Greater London area is much larger
than Pittsburgh, k should be larger than 45. A k value of 100 was arbitrarily chosen
in this work to test the effects of tuning the nearest neighbour parameter and the
alpha constant, to reflect the possibility of a higher number of neighbourhoods in
London. An even higher value may be more suitable as London is many times larger
than Pittsburgh, but this value was used to keep computation requirements
manageable.
31

Figure 1: Relationship between number of smallest eigenvalues (k) found and number of clusters formed

32

Figure 2: Boundaries formed for different number of clusters


5 clusters (k = 7)

7 clusters (k = 9)

11 clusters (k =13)

23 clusters (k = 25)

41 clusters (k = 43)

72 clusters (k = 74)

99 clusters (k = 101)

4.2. Tuning the alpha constant ()


To see if the alpha constant influenced the clusters formed using the Livehoods
method, clusters were formed with k = 100, m = 10 and varying from 0.00 to 0.05
In general, there was little difference in the clusters formed. Figure 3 depicts the
boundaries formed using the various alpha constants. Almost all clusters formed are
consistent or highly similar at the different alpha values. In certain rare instances,
some clusters are merged or subdivided into 2 clusters. This shows that varying the
alpha constant between 0.00 and 0.05 do not greatly influence the boundaries
formed. A clear exception occurs with the largest cluster in the shift from = 0.00
33

to = 0.01 it expands greatly to include many other parts of the Greater London
area. This boundary remains consistent as increases. This behaviour again
highlights the qualitatively different nature of this cluster.
Figure 3: Boundaries formed for different alpha constants
= 0.00

= 0.01

= 0.02

= 0.03

= 0.04

= 0.05

4.3. Tuning the nearest neighbours parameter (m)


To see if the nearest neighbours parameter influenced the clusters formed using the
Livehoods method, clusters were formed with k = 100, = 0.01, and m varying
from 5 to 20. Figures 4 depicts the boundaries formed for some of the values used.
When m = 5, the boundaries formed overlap many of the other boundaries. As m
increases, the number of overlaps decrease and more stable clusters are formed. For
m = 8 to m = 20, the clusters formed are largely consistent with each other. Smaller
clusters with a high density of venues are more consistent than larger clusters with
low density of venues. The largest cluster changes in shape and size as at different

34

levels of m. It is hard to determine the optimal number to use for m, but values of 8
and higher seem to generate reasonably consistent clusters.
Figure 4: Boundaries formed for different nearest neighbours parameter (m)
m=5

m=8

m = 10

m = 15

m = 18

m = 20

4.4. Using cosine similarity


It has been mentioned earlier that cosine similarity was preferred over other
similarity measures because cosine similarity does not take the magnitude of vectors
into account. In the case of forming neighbourhoods and determining venue
similarity, the relative frequency of the user checkins at each venue and across
venues matter more than the total number of user checkins at each venue. Similarity
measures that include magnitude such as Euclidean distance are thus less suitable
than the cosine similarity measure. Using Jaccard similarity, a variant of the cosine
similarity measure, produced results similar to the cosine similarity measure.

35

4.5. Nearest neighbours versus full similarity graph


The k-nearest neighbours similarity graph was chosen for constructing the affinity
matrix instead of the full similarity graph as the k-nearest neighbours graph better
captured check-in behaviour in neighbourhoods. While individuals have regular
mobility patterns and often return to a few highly frequented locations such as home,
school or work (Gonzlez et al., 2008), this differs from their check-in behaviour on
location based social media networks 60% to 80% of check-ins occur at places
that were not visited before by individual users (Noulas et al., 2012). Using the full
similarity graph meant that most of the similarity captured would relate to new
places that the users visited over the time period. This would create clusters of
venues that related to types of places that groups of users preferred to visit such as
museums, nightspots and stadiums, and generate boundaries that span most of the
city. These boundaries cannot be classified as neighbourhoods, given that they
overlap each other greatly and cover areas that are similar to each other.
The nearest neighbours graph, on the other hand, captures similarity relating to users
who visited sets of venues close to one another. The boundaries formed often have
clear separation from each other and there is very little overlap in terms of area
covered by the boundaries. These boundaries better fit the intuitive notion of
neighbourhoods in a city.
4.6. Summary
Through an investigation of the Livehoods method, I have found that using different
alpha values from 0.01 to 0.05 and nearest neighbours parameters above 8 generally
do not affect the results of the clusters formed. I have also found that using different
values for the number of smallest eigenvalues changes the resulting number of
36

clusters formed, with more clusters being formed when the number of eigenvalues
increases. The investigation also revealed that two types of clusters may be formed
by the method. One type of cluster is the contiguous geographic space that can be
associated with neighbourhoods, and another type of cluster seems to be large and
spans the entire city.
In the next two sections, I will use one of the sets of clusters / neighbourhoods
generated by the Livehoods method to illustrate the types of information that can be
derived from clusters formed using the Livehoods method, and neighbourhood
detection methods in general. In section 5, I combine the clusters formed with data
from Foursquares venues database and use it to describe the types of venues and
activities that take place within the cluster. Incorporating information from location
based social media to better understand the clusters / neighbourhoods formed is
common for researchers using neighbourhood detection methods.
In section 6, I attempt to combine the cluster / neighbourhoods formed using the
Livehoods method with data from administrative boundaries (the Greater London
Lower Super Output Areas in this case) and determine the ethnic diversity of the
clusters / neighbourhoods formed. Integrating cluster / neighbourhoods detected
using neighbourhood detection with data from administrative boundaries is rare in
the neighbourhood detection literature as most researchers using neighbourhood
detection methods have used them for developing recommendation engines that find
similar places based on social media activity. My attempt tries to add more meaning
to the clusters formed so that they can be used for other purposes, such as
investigating ethnic diversity issues within neighbourhoods.

37

5. DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS


5.1. Overview of neighbourhoods
For comparison, the Livehoods method was applied to the Foursquare data with k =
100, = 0.01, and m = 10. For the Greater London area, 72 clusters were generated.
Their boundaries are depicted in Figure 5. The numbers on the clusters will be used
as a reference for labelling and describing the results below. As mentioned earlier,
the largest cluster formed (cluster 66 in this case) is not depicted in the figures as it
is a qualitatively different type of cluster, and not included when describing the
clustering results. The boundaries for this cluster can be found in the appendix.
Table 1 contains summary statistics related to each cluster. The area for each cluster
ranged from to 0.11 square kilometers (cluster 48) to 203 square kilometers (cluster
18) with a median of 1.86 square kilometers per cluster. While tests (using Pythons
powerlaw package) show no support for a power law distribution, the distribution is
highly skewed with many small clusters and a few huge clusters. The huge clusters
also tend to have low density in terms of checkins and venues, and as such they
could be an artefact of the nearest neighbours proximity criterion. In sparse areas,
the nearest neighbours tend to be further apart from each other than in dense areas,
thus venues far apart from each other are more likely to be linked and clustered
together.
Figures 6a to 6c depict properties of the clusters in terms of absolute numbers - the
number of venues in each cluster ranged from 16 (cluster 45) to 279 (cluster 38)
with a median of 129.0; the number of check-ins in each cluster ranged from 43
(cluster 45) to 5147 (cluster 2) with a median of 412; and the number of unique
users checking-in in each cluster ranged from 10 (cluster 45) to 2585 (cluster 2) with
a median of 230. Figures 6d to 6f depict properties of the clusters relative to the area
38

of the cluster and the number of venues in the cluster the number of venues per
square kilometer ranged from 1.27 (cluster 18) to 1,304.61 (cluster 7) with a median
of 43.95; the number of checkins per venue ranged from 1.26 (cluster 65) to 40.09
(cluster 26) with a median of 3.23; and the number of unique users per venue ranged
from 0.55 (cluster 67) to 19.52 (cluster 16) with a median of 1.89.
Many of the distributions of cluster properties are highly skewed. Clusters 2, 13, 16
and 26 are particularly active clusters and are in the top 5 in terms of users and
checkins across all clusters, whether in absolute terms or on a per venue basis.
Collectively, the four clusters account for 29.5% of all checkins from 60% of unique
users despite containing only 5.7% of all venues across the city. This is
understandable for clusters 2 and 13 as they are in the city centre, and cluster 26 as it
is at Heathrow airport. Cluster 16 consists of Wembley stadium, and it is likely that
it had such high values for users and checkins during that period as it was the host
for the 2011 UEFA Champions League Final on 28 th May 2011, which is within the
period of analysis. People attending this event are highly likely to checkin on social
media as it is a rare and meaningful event for them. Under more normal
circumstances, cluster 16 likely would have values closer to the median.
Across all clusters, cluster 18 stands out with the largest area and relatively low
frequencies of users and venues over such a large area. It could be classified as an
outlier, but results for the cluster have been included for completeness. In addition,
all variations of the Livehoods method detect this cluster or a cluster similar to this
cluster. This is more likely an artefact of using the nearest neighbours proximity
criterion as discussed above.

39

Figure 5: Clustering results for London


Greater London area

City area

40

Table 1: Summary statistics for cluster results for London


Cluster

Area (sq
km)

Number of
checkins

Number of
users

Number of
venues

Number of
check-ins per
sq km

Number of
users per sq
km

Number of
venues per sq
km

Number of
check-ins per
venue

Number of
check-ins per
user

Number of
users per
venue

0
1

0.69
0.89

1002
469

641
321

238
165

1447.35
527.2

925.9
360.84

343.78
185.48

4.21
2.84

1.56
1.46

2.69
1.95

2
3

1.25
26.83

5147
356

2585
178

161
180

4121.23
13.27

2069.82
6.63

128.91
6.71

31.97
1.98

1.99
2

16.06
0.99

4
5

2.95
0.75

851
462

450
230

163
102

288.6
616.58

152.61
306.95

55.28
136.13

5.22
4.53

1.89
2.01

2.76
2.25

6
7

2.19
0.16

1055
695

556
447

239
215

481.71
4217.23

253.87
2712.38

109.13
1304.61

4.41
3.23

1.9
1.55

2.33
2.08

8
9

0.82
1.77

754
610

493
325

195
241

924.93
344.83

604.76
183.72

239.21
136.24

3.87
2.53

1.53
1.88

2.53
1.35

10
11

1.5
0.6

806
967

409
622

253
231

536.37
1602.32

272.18
1030.65

168.36
382.77

3.19
4.19

1.97
1.55

1.62
2.69

12
13
14

1.09
2.73
4.62

294
2888
540

163
2032
213

120
202
155

270.77
1056.98
116.81

150.12
743.7
46.07

110.52
73.93
33.53

2.45
14.3
3.48

1.8
1.42
2.54

1.36
10.06
1.37

15

0.62

1357

578

108

2184.13

930.31

173.83

12.56

2.35

5.35

16

22.55

3508

1737

89

155.54

77.01

3.95

39.42

2.02

19.52

17
18

1.74
203.11

691
257

322
110

165
157

396.12
1.27

184.59
0.54

94.59
0.77

4.19
1.64

2.15
2.34

1.95
0.7

19
20

0.88
2.08

248
556

154
296

101
154

280.51
267.1

174.19
142.2

114.24
73.98

2.46
3.61

1.61
1.88

1.52
1.92

21
22

23.94
12.1

831
453

398
304

257
157

34.71
37.43

16.63
25.12

10.74
12.97

3.23
2.89

2.09
1.49

1.55
1.94

41

Cluster

Area (sq
km)

Number of
checkins

Number of
users

Number of
venues

Number of
check-ins per
sq km

Number of
users per sq
km

Number of
venues per sq
km

Number of
check-ins per
venue

Number of
check-ins per
user

Number of
users per
venue

23

4.7

378

168

139

80.49

35.78

29.6

2.72

2.25

1.21

24
25
26

1.56
42.64
0.35

464
285
2165

296
121
975

123
135
54

296.6
6.68
6131.41

189.21
2.84
2761.26

78.62
3.17
152.93

3.77
2.11
40.09

1.57
2.36
2.22

2.41
0.9
18.06

27
28

0.41
0.31

348
167

235
117

163
48

844.05
543.27

569.97
380.61

395.34
156.15

2.13
3.48

1.48
1.43

1.44
2.44

29
30

1.24
1.71

827
1921

384
547

54
148

668.99
1126.03

310.63
320.63

43.68
86.75

15.31
12.98

2.15
3.51

7.11
3.7

31
32

0.75
136.96

160
432

124
340

31
131

214.22
3.15

166.02
2.48

41.5
0.96

5.16
3.3

1.29
1.27

4
2.6

33
34

25.62
0.21

405
637

224
394

141
188

15.81
3098.25

8.74
1916.34

5.5
914.4

2.87
3.39

1.81
1.62

1.59
2.1

35
36

0.15
22.11

181
321

94
140

38
93

1197.88
14.52

622.1
6.33

251.49
4.21

4.76
3.45

1.93
2.29

2.47
1.51

37
38

0.6
0.32

358
1169

183
740

73
279

600.17
3624.81

306.79
2294.57

122.38
865.12

4.9
4.19

1.96
1.58

2.51
2.65

39

1.4

1366

622

161

974.53

443.75

114.86

8.48

2.2

3.86

40

8.27

179

69

81

21.65

8.34

9.8

2.21

2.59

0.85

41
42

5.94
0.28

144
481

82
311

87
75

24.23
1702.65

13.79
1100.88

14.64
265.49

1.66
6.41

1.76
1.55

0.94
4.15

43
44

1.86
75.25

172
167

134
69

29
99

92.24
2.22

71.87
0.92

15.55
1.32

5.93
1.69

1.28
2.42

4.62
0.7

45
46

1.13
6.48

43
65

10
30

16
40

38.16
10.03

8.88
4.63

14.2
6.17

2.69
1.62

4.3
2.17

0.62
0.75

47

11.88

315

149

144

26.51

12.54

12.12

2.19

2.11

1.03

42

Cluster

Area (sq
km)

Number of
checkins

Number of
users

Number of
venues

Number of
check-ins per
sq km

Number of
users per sq
km

Number of
venues per sq
km

Number of
check-ins per
venue

Number of
check-ins per
user

Number of
users per
venue

48

0.11

199

155

36

1761.06

1371.68

318.58

5.53

1.28

4.31

49
50
51

31.95
0.66
0.55

173
255
385

86
117
248

89
99
131

5.42
387.71
705.65

2.69
177.89
454.55

2.79
150.52
240.1

1.94
2.58
2.94

2.01
2.18
1.55

0.97
1.18
1.89

52
53

39.21
1.12

775
751

287
413

129
209

19.77
670.36

7.32
368.65

3.29
186.56

6.01
3.59

2.7
1.82

2.22
1.98

54
55

87.89
5.6

202
316

93
98

107
123

2.3
56.39

1.06
17.49

1.22
21.95

1.89
2.57

2.17
3.22

0.87
0.8

56
57

18.86
1.12

551
189

287
105

200
79

29.21
168.69

15.21
93.72

10.6
70.51

2.76
2.39

1.92
1.8

1.44
1.33

58
59

0.33
21.86

766
412

444
195

132
193

2296.85
18.85

1331.33
8.92

395.8
8.83

5.8
2.13

1.73
2.11

3.36
1.01

60
61

47.01
1.27

228
115

88
60

107
56

4.85
90.25

1.87
47.08

2.28
43.95

2.13
2.05

2.59
1.92

0.82
1.07

62
63

1.99
9.31

181
47

56
20

66
28

90.82
5.05

28.1
2.15

33.12
3.01

2.74
1.68

3.23
2.35

0.85
0.71

64

8.39

1325

681

261

157.85

81.13

31.09

5.08

1.95

2.61

65

10.86

54

31

43

4.97

2.86

3.96

1.26

1.74

0.72

67
68

33.75
14.95

99
103

28
44

51
38

2.93
6.89

0.83
2.94

1.51
2.54

1.94
2.71

3.54
2.34

0.55
1.16

69
70

4.78
0.5

113
699

76
367

73
115

23.62
1388.01

15.89
728.75

15.26
228.36

1.55
6.08

1.49
1.9

1.04
3.19

71

34.32

532

323

221

15.5

9.41

6.44

2.41

1.65

1.46

43

Figure 6: Properties of Livehood clusters

44

45

5.2. Breakdown of individual neighbourhoods


The venues within each cluster are venues that can be found on the location based
social network Foursquare. Foursquare categorizes its venues in a category hierarchy
with three levels. The 10 main categories at the top of the hierarchy are: Arts &
Entertainment, College & University, Event, Food, Nightlife Spot, Outdoors &
Recreation, Professional & Other Places, Residence, Shop & Service, and Travel &
Transport. Each of these 10 main categories have their own subcategories, which
themselves can be further subcategorized. There are more than 200 subcategories and
sub-subcategories altogether. As places may be referred to at different levels of
granularity, some venues may not have a sub-subcategory. For example, London
Heathrows Terminal 5 falls in the Travel & Transport main category, the airport
subcategory, and the airport terminal sub-subcategory. The London Heathrow Airport,
on the other hand, falls in the same main and subcategories, but does not have a subsubcategory.
We can gain insight to the makeup of the city by creating city profiles using
information on venue categories of each and the behavior of the users of location based
social media networks. To calculate the distribution of venues / checkins by category
for the city, the formula used to calculate the value for each category (A) was:

.
100
.

Figure 7 shows the overall distribution of venues and checkins across all clusters
according to Foursquares main categories in percentage values. 29.23% of venues in
the data are in the food category, followed by 17.05% of venues in the nightlife spots
46

category. Users, however, check-in mostly at venues related to travel & transport
(23.04%), professional & other places (18.86%), and arts & entertainment venues
(15.68%). From here, we can observe that venues in the travel & transport, professional
& other places, nightlife spot and arts & entertainment receive a disproportionate
number of checkins. This means that clusters formed based on Foursquare checkins are
likely to be biased towards these venues in these categories, and may be more suitable
for research questions related to such categories (e.g. transport, culture).
Figure 7: Overall distribution of venues and checkins across clusters

% of venues

% of checkins

Similar profiles can be created for each cluster to form neighbourhood profiles. To
calculate the distribution of venues / checkins by category within a neighbourhood, the
formula used to calculate the value for each category (B) was:

47

.
100
.

This gives a sense of the type of venues in the clusters and the type of activities that
occur within them. These neighbourhood profiles were compared with the city profile
to understand which categories within the neighbourhood were overrepresented /
underrepresented. For each category, the formula was:

( )
100

Tables 2 and 3 contain the percentage difference figures for all clusters for venues and
checkins respectively, with the highest positive difference for each cluster highlighted.
These percentage differences between each category was used to determine which types
of venues occurred more frequently and which types of venues users checked-in at
more frequently within the cluster. For example, clusters 28 and 29 have more venues
and checkins in the travel and transport category, as these clusters are essentially the
London Heathrow airport terminals, which we expect to have a higher concentration of
venues and checkins related to travel and transport. Another example is clusters with
high levels of concentration of venues and checkins in the college & university category.
Clusters 27, 46 and 47 have percentage difference figures of over 1000% for users
checking-in, and they contain University College London, Brunel University London,
and the Queen Mary University of London respectively.
From tables 2 and 3, we again observe differences between checkin behaviour and types
of venues. For many clusters, the most overrepresented category in terms of venues is
different from the most overrepresented category in terms of checkins. Cluster 3, for
48

example, would be characterised as a cluster in the outdoors & recreation category in


terms of venues, and as a cluster in the residence category in terms of checkins.

49

Table 2: Percentage difference between proportion of venues within cluster to proportion of venues
within city in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain venues in that category
Cluster
Arts &
College &
Food
Nightlife
Outdoors &
Entertain
University
Spot
Recreation
ment

Professional &
Other Places

Residence

Shop &
Service

Travel &
Transport

-36.9

22.98

-12.42

-2.02

-69.76

99.3

-74.25

73.22

-75.35

1
2
3
4
5
6
7

56.72
-30.23
-38.26
-58.42
67.78
-9.85
-31.44

-36.97
9
-79.5
-77.73

-22.09
-31.03
11.54
18.11
34.81
9.09
60.26

94.24
-20.18
-4.65
-40.54
-45.16
13.45
56.88

-56.19
102.84
134.71
-22.52

-62.7
-20.29
58.7
18.75
-65.77

31.04
-78.1

56.2
193.33
-28.4
10.51
-30.51
47.02
2.95

-25.86
-75.62
-29.9
-57.63
-37.2
-17.33
-52.95

-73.22
-38.01
5.49
75.24
63.82
-35.31
-79.92

8
9
10

-54.73
-55.9
-48.85

-22.79
180.75
55.06

23.46
2.22
-18.64

-9.36
-4.14
-4.91

-49.38
-90.14
-42.81

2.55
133.34
122.44

-83.84
-62.21
-75.65

48.28
-42.23
-21.83

-18.77
-57.8
-15.52

11
12
13
14
15
16
17

49.53
26.02
30.04
-0.96
-20.1
129.2
21.43

154.99
-4.49

27.43
25.29
10.82
-13.19
5.53
-39.23
24.18

9.59
92.24
-22.51
21.4
-25.73
-34.44
54.36

-68.65
-76.52
9.05
-36.72
-14.57
-24.57

-8.55
-54.33
13.9
38.45
-13.13
1.53
-41.32

45.47
-80.73

10.19
-8.28
-43.21
-13.5
-73.83
33.45
6.06

-61.68
-71.29
40.73
-8.14
166.2
56.65
-44.67

18
19
20

-52.25
-14.22
-15.11

117.14
-51.24
286.03

-29.47
31.58
-8.37

-13.5
-38.66
-23.11

60.17
-4.08
58.2

-25.01
-53.37
7.68

263.68
22.5
-19.18

18.16
-25.08
-1.14

3.35
61.23
-12.97

21
22
23
24
25

13.19
60.08
-34.8
18.68
-49.49

-1.02
-33.82
-62.94
-10.05

-10.97
-18.15
-2.77
31.48
-28.26

31.98
70.65
-39.4
-60.4
-22.95

143.38
95.26
64.02
-55.77
88.27

-49.51
-36.71
24.05
-21.14
9.84

74.07
24.69
-6.9
-71.75
92.36

14.06
-36.45
-0.35
141.86
61.76

-46.44
-35.35
33.67
-52.68
9.32

26
27

5.15

557.5

-84.9
-5.91

-74.66
-9.77

52.44

-43.69

-80.66
-2.42

520.55
-68.56

280.27
78.17
-59.01
49.43

-42.13
-57.48
24.38
6.84
-4.74
-27.19

-86.13
-64.32
-69.56
34.47
-48.43
-20.11

-69.86
-6.33
-43.2
-67.33
-52.36

-57.63
-27.37
65.99
33.85
-13.9

413.88
383.09
-6.48
-46.44
66.35
34.75

21.84
15.75
-61.95
-17.32
54.33
-21.75
12.62
-7.7
-6.71
-73.29
-36.87
-56.6
-36.87

5.64
23.61
35.7
1.74
33.81
10.8
-37
69.65
-13.03
-77.59
-1.62
-27.16
5.95

-60.03
-30.97
212.06
-83.13
-51.87
23.14
15.34
-32
-12.38
166.26
184.75
107.09

12.27
-55.25
21.38
5.48
-64.45
4
-80.04
-81.31
-44.9
13.6
-23.29
269.19
11.88

118.52
-19.12
-14.69
-85.88
-24.21
-68.67
20.23
57.67
-86.72
-65.78
15.55

-79.64
-57.8
-4.63
143.13
-79.37
76.51
41.13
-73.56
180.57
328.44
71.79

7.84

-36.71

-33.73
73.62

-2.14
-45.37

143.45
42.38

7.09
-76.93

8.67
-72.2

-25.59

28
29
30
31
32
33

-51.49
-84.8
291.8
62.27
113.59

34
35
36
37
38
39
40
41
42
43
44
45
46

-52.34
85.21
-16.27
-35.32
171.65
-13.91
-44.94
3.16
-39.18
-21.64
-73.54

47
48

-6.69
473

40.38
-36.54
-14.21
226.23
17.28
38.28
20.32
461.5
112.17

-32
-12.38
20.96
28.6

-70.01
-53.57
41.43

-34.86
11.9
80.23
205.03
164.5
99.28
38.55
-89.22
-38.53
253.85
120.97

-62.21
263.68
76.33
99.89

50

Cluster

Arts &
Entertain
ment

College &
University

Food

Nightlife
Spot

Outdoors &
Recreation

Professional &
Other Places

Residence

Shop &
Service

Travel &
Transport

49

4.48

-15.42

-17.82

-12.38

-14.8

198.41

59.69

-10.74

50
51
52
53
54
55
56

19.84
36.96
-25.23
-58.63
-75.45
7.23
19.84

9
-61.07
27.5
-76.49
11.62
-2.48
-75.23

34.81
48.82
-55.4
30.43
-41.43
-12.28
16.98

-38.3
37.11
-62.58
-17.18
-1.71
-7.99
-22.1

-46.4
-61.71
-58.2
-19.06
37.23
43.88
-39.09

-13.13
-13.13
8.39
-47.53
-19.93
-45.59
-48.67

-31.54

25.6
-25.24
22.44
94.2
125.1
124.77
66.52

-18.09
-70.75
149.11
-8.11
17.43
-56.03
-25.54

57
58
59
60
61
62
63

-68.66
92.51
-88.42
-76.85
-53.7
-18.51
-7.39

-28.73
-63.52
-47.36
5.28

49.58
-15.24
-9.25
-21.08
-21.08
4.17
-36.87

25.51
0.94
65.54
-60.27
-7.3
74.81
58.92

-64.95
25.56
42.38
81.21
-48.23
173.36
3.55

-43.2
51.16
-66.44
-24.48
-16.09
-26.16
-32.87

-10.48
-31.27
98.37
429
32.25

36.88
-29.94
1.1
81.98
21.32
-46.62
142.65

-67.87
-17.77
2.85
-28.8
89.88
-58.23
-36.71

64
65
67

23.23

-62.64

-6.19
-48.94
-81.23

-10.71
2.83
25.99

65.34
-33
146.27

10.16
73.74
-0.22

-6.15
156.72
-21.37

-46.19
-73.83
20.23

34.75
104.77
50.53

40.38

-47.39
81.63
-17.67
-17.4

-11.71
-10.35
-15.9
13.15

38.06
75.23
-6.06
54.82

-77.62
-77.28
-31.49
-3.22

-11.83
79.04

-86.31
-81.66
-17.94

-15.61
-14.31
215.81
-22.27

68
69
70
71

-7.35

120.25
764.33
-68.66
18.68

-52.24
-10.05

140.23
5.16
83.76
117.82

154.23

51

Table 3: Percentage difference between proportion of users within cluster checking-in to proportion of users
within city checking-in in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain checkins at venues in that category
Cluster
Arts &
College &
Food
Nightlife
Outdoors &
Professional &
Entertain
University
Spot
Recreation
Other Places
ment

Residence

Shop &
Service

Travel &
Transport

-79.6

-0.55

40.74

27.39

-68.98

-18.97

-85.59

301.5

-56.14

1
2
3
4
5
6
7

-76.69
-96.62
-81.43
-72.77
126.08
116.72
-81.33

9.73
-58.82
-82.9
-86.93

102.86
-89.9
104.46
-4.76
45.69
21.01
259.98

443.13
-89.18
56.17
-70.33
-29.67
42.39
268.23

-84.49
-79.65
47.46
-73.65

-53.68
-94.49
512.27
4.96
-83.59

-48.17
-43.21

-9.12
397.75
-35.65
-20.14
-90.09
-15.14
-52.84

-62.41
-98.82
-40.5
-88.73
-64.74
50.33
-28.4

-75.13
-93.59
-20.42
154.67
54.72
-77.63
-87.57

8
9
10

-78.01
-85.31
-77.83

-39.08
425.78
92.76

159.78
109.76
14.47

23.12
105.26
19.42

-55.41
-98.28
-85.73

-63.36
67.16
52.78

-90.29
-40.13
-81.92

176.79
-46.84
8.74

-32.2
-47.13
26.19

11
12
13
14
15
16
17

-13.69
-48.3
-82.6
-38.17
-72.15
524.9
107.66

445.6
71.85

72.18
178.87
-61.13
13.87
-35.39
-91.76
10.02

60.66
242.42
-82.41
51.86
-67.18
-95.61
96.5

-68.8
-60.69
612.28
-92
-99.12
-12.62

-57.56
-44.18
-83.99
40.98
-73.94
-92.89
-78.23

-81.57
-78.63

207.14
-45.07
-94.21
-50.08
-96.01
-86.8
95.86

-75.39
-77.12
-28.25
13.97
236.68
-93.28
-68.76

18
19
20

-80.65
-34.14
-32.68

243.03
-61.93
524.9

21.36
244.84
32.26

83.09
8.38
-3.85

-41.15
-56.45
83.51

-43.26
-81.68
-38.03

822.86
142.79
75.02

37.06
-17.42
-42.16

-12.31
-14.89
-10.84

21
22
23
24
25

195.81
200.33
-84.82
-28.72
-89.82

-44.45
117.03
-24.29
41.17

12.16
8.52
61.91
65.53
2.18

78.69
100.78
-3.02
-59.82
15.6

-30.11
-40.43
41.43
-88.47
-7.1

-76.61
-73.89
-2.85
-39.35
-20.61

192.27
90.3
20.7
-67.85
331.64

-35.31
-72.75
-7.8
397.29
62.27

-77.81
-70.07
11.63
-90.41
28.77

26
27

-84.32

1098.59

-97.28
126.17

-94.73
174.43

-12.23

107.7

-99.06
-4.81

316.58
-73.98

31.14
186.79
-56.49
70.33

3.49
-75.58
-37.7
-24.24
-11.74
-26.53

-91.78
-87.72
-92.67
14.29
-65.94
3.9

-96.76
-84.51
-93.1
-84.3
-61.95

-73.64
-86.46
505.79
-35.42
-22.22

248.71
295.48
-84.18
-83.63
-45.16
-20.15

174.14
104.41
-67.45
29.23
205.93
-54.12
60.65
74.14
-14.85
-45.11
28.68
-65.28
28.21

129.52
58.14
32.59
11.4
166.06
70.66
-31.84
200.22
9.32
-84.94
103.37
-47.62
117.59

-40.18
204.98
69.69
-44.27
-95.43
-39.14
-24.61
-91.22
-93.95
4
405.12
16.57

-36.21
-83.29
-31.51
121.94
-81.88
95.53
-93.6
-92.07
-57.26
-61.82
-29.68
99.24
-48.91

167.03
-74.63
-60.61
-96.81
-22.12
-95.44
15.42
118.24
-82.46
-93.96
26.05

-86.18
-28.66
-17.95
-29.54
-76.82
53.02
56.91
-87.46
184.94
249.66
-10.42

35.75

-67.68

5.24
140.56

97.37
-80.56

-17.25
-37.51

-38.35
-91.78

47.99
-94.8

-44.08

28
29
30
31
32
33

-91.32
-82.04
352.94
287.79
95.82

34
35
36
37
38
39
40
41
42
43
44
45
46

-92.35
-79.11
71.27
5.11
5.28
-69.44
-87.99
-40.51
-89.89
-96.02
-90.23

47
48

-72.79
307.49

-44.45
-30.99
254.39
73.2
31.84
-3.99
614.44
1768.53
1136.03

108.91
-93.44
-77.61
-16.51

201.37
-89.89
290.4

-77.6
220.05
125.45
559.49
342.83
312.56
-33.15
-92.94
-84.07
1130.06
688.18

3.54
193.38
170.81
308.53

52

Cluster

Arts &
Entertain
ment

College &
University

Food

Nightlife
Spot

Outdoors &
Recreation

Professional &
Other Places

Residence

Shop &
Service

Travel &
Transport

49

-70.19

50
51
52
53
54
55
56

-65.64
-59.06
-82.27
-90.58
-61.77
-81.91
-70.65

128.44
-75.26
-13.16
-87.47
27.09
71.85
-82.26

65.61

61.18

-61.14

-11.45

396.48

22.82

-3.05

97.57
147.44
-74.25
97.41
7.92
126.99
114.2

95.08
234.55
-59.39
-0.14
80.89
36.97
31.27

-78.23
-32.08
-97.16
-61.31
-56.39
151.58
-77.69

-67.94
-55.35
-26.87
-53.28
-61.78
-81.39
-75.46

-39.3

60.82
52.53
-66.01
246.14
132.15
72.64
234.11

17.7
-70.58
196.19
-16.58
2.76
-70.58
-42.64

57
58
59
60
61
62
63

-64.41
53.65
-98.27
-89.93
-73.43
-89.11
-81.55

5.16
-87.98
-7.75
78.45

182.75
-21.77
50.14
9.43
66.67
84.43
131.49

161.91
93.28
175.73
-55.55
38.29
116.4
214.3

-75.95
195.51
-31.42
-59.18
-89.9
-17.19
-71.94

-74.7
-53.74
-83.35
-75.85
-57.5
-12.9
-70.48

151.47
-71.26
230.91
1749.17
40.82

38.05
-83.54
-44.71
225.95
41.18
-50.41
152.1

-39.99
-49.7
62.34
-46.96
68.04
17.08
-53.32

64
65
67

-81.27

-51.58

-14.51
16.28
-67.1

-6.47
163.13
115.05

285.27
-76.51
-46.83

-55.48
-1.15
11.85

-17.29
718.73
363.23

-81.84
-53.1
19.42

23.69
17.24
5.03

68
69
70
71

250.01
-93.49

184.94

-46.23
259.49
-40.83
9.13

21.66
121.86
-24.04
57.15

-56.55
28.76
-69.48
-43.88

-94.29
-79.16
-79.13
-24.1

-24.29
176.12

-90.11
-96.95
-41.98

-9.66
-29.98
232.53
-33.32

44.8

-38.81

55.53

-59.97
5.16

48.34
355.9
228.77
295.88

486.75

53

6. COMPARING LIVEHOODS CLUSTERS TO LOWER SUPER OUTPUT


AREAS
One aspect of neighbourhood detection using location based social media that is
underexplored is how the clusters formed using these methods compare to more
traditional administrative boundaries, and how data from traditional administrative
boundaries can be used to make clusters more informative and useful. To illustrate how
data from administrative boundaries can enhance clusters formed using neighbourhood
detection, I combine the clusters from the previous section with data from the 2011
census. The administrative boundaries used in this illustration comes from the lower
layer super output areas (LSOAs) for the 2011 Census in the UK. LSOAs were chosen
as they are designed to be more stable over time and consistent in size (Sturgis et al.,
2013), and thus provide an interesting contrast to clusters formed using the Livehoods
method and neighbourhood detection methods in general, as the clusters formed tend to
be dynamic and inconsistent in size. In 2011, there were 34,753 LSOAs in England and
Wales (Stokes, n.d.). Of these, I extracted data for 4,835 LSOAs from the Greater
London area.
For this illustration, I attempted to calculate the ethnic diversity of the clusters formed
using the Livehoods method. Ethnic diversity was chosen as the ethnic composition of
local neighbourhoods is often tied to issues of interpersonal trust and community
cohesion. Sturgis et al (2013), for example, studied how neighbourhood ethnic diversity
in London is related to perceived social cohesion of neighbourhood residents. In their
research, they used Londons LSOAs as proxies for neighbourhood boundaries. To
measure ethnic diversity, they use Hirschmans (1964) concentration index (HI):
54

where si is the share of ethnic group i, out of a total of n ethnic groups. This can be
interpreted as the probability that two randomly selected individuals from the same
area are of different ethnic origin (Sturgis et al., 2013). A higher score reflects a more
ethnically diverse population.
I use the same Hirschman concentration index to calculate the scores for Londons
LSOAs using counts of the following ethnic groups: white, mixed/multiple ethnic
groups, Asian/Asian British, Black/African/Caribbean/Black British, Others.
To calculate the Hirschman concentration index for the clusters formed using the
Livehoods method, I summed up the counts of the ethnic groups from LSOAs that
intersected the cluster, even if just partially, and calculated the Hirschman concentration
index based on those sums.
Figure 8 shows the Hirschman concentration index (HI) for each cluster along with the
average HI value for all LSOAs that intersect the cluster and the maximum and
minimum HI values amongst LSOAs within the cluster.

55

Figure 8: Hirschman concentration index (HI) for clusters

56

From the figure, we see that HI values for clusters are lower than the average HI values
for LSOAs within the clusters. This means that each cluster is less diverse than its
component parts in general. In some cases, such as clusters 16 and 52, the cluster HI
value is even lower than the minimum HI value amongst all LSOAs within the cluster.
It is clear that the picture of ethnic diversity is different depending on whether LSOAs
or the clusters from the Livehoods method are used, and which measure to use depends
on the research question being asked. This is expected whenever different boundaries
are used and is related to the modifiable areal unit problem (Openshaw, 1984), an issue
in spatial analysis where there is an almost infinite number of different ways by which
a geographical region of interest can be areally divided (Openshaw, 1984), and using
different areal units may affect the results of geographical studies. One way to manage
the problem is to rely on theory to select areal units that are relevant to the purpose of
the study (Openshaw, 1984).
In the case of measuring the effect of ethnic diversity on social cohesion, the idea that
people in the same neighbourhoods go to similar places may be important, as they then
come into contact with each other at these places, whether it is the neighbourhood
supermarket, bus stop or school. If this is indeed an important factor, the clusters
created by the Livehoods method may be a more suitable unit of analysis than LSOAs
as the clusters have been created to reflect areas where a similar set of people frequent,
while the LSOAs do not reflect such information. While not attempted in this work, it
would be interesting to replicate Sturgis et als (2013) research using the clusters
created using the Livehoods method and compare the results to the original results.
Other studies on neighbourhood characteristics and effects could similarly benefit from
57

the perspective offered by the clusters created using the Livehoods method. Examples
of other domains where this perspective could be useful are in the areas where
neighbourhood diversity is considered important, such as the effects of racial, ethnic,
religious and /or socioeconomic diversity on social trust, cohesion, crime and / or
voting behaviour within neighbourhoods. In these cases, the idea of neighbourhood may
be better represented by clusters / neighbourhoods formed using the Livehoods method
instead of LSOAs. Other neighbourhood detection methods may also produce clusters /
neighbourhoods that are more suitable than the LSOAs.

58

7. CONCLUSION
7.1. Concluding Remarks
In this work, I have argued that social media is a useful source of data for defining
neighbourhoods as it provides rich contextual information on user activity at different
times of day. The neighbourhood boundaries formed by neighbourhood detection
methods are dynamic and contain much contextual information about the city, and as
such can be useful for research and analysis in social science, policymaking, and urban
planning. While there is some research on how to generate neighbourhood boundaries
using location based social media, our understanding of these methods is limited and we
need to better understand these methods so that the methods and the boundaries/clusters
generated can be put to better use.
I pointed out that neighbourhood detection methods using location based social media
generally have three elements: the unit used for aggregation, the type of clustering
method, and the similarity measures used. To illustrate how we can better understand
neighbourhood detection methods, I undertook an in-depth exploration of Cranshaw et
als (2012) Livehoods method using Foursquare data from London and analysed the
various elements in the method. Through the analysis, I found that the method is
relatively robust even when the parameters in the method are tweaked. I also found that
the method may generate two qualitatively different types of clusters, where one type is
contiguous geographic spaces that can be associated with neighbourhoods, and the other
type may reflect the boundaries of the city as perceived by Foursquare users.

59

I then illustrated some types of information that clusters generated using Livehoods
method could provide, in terms of information derived from just social media, and also
in terms of information derived by combining the clusters with Lower Super Output
Area (LSOA) data. In the latter case, I showed how the ethnic diversity score can differ
based on whether the Livehoods clusters or the LSOA boundaries were used, and
argued that Livehoods clusters may be more suitable than LSOA boundaries in
situations where the concept of neighbourhoods encompasses the idea that
neighbourhoods are a set of places that a similar set of people go to. For social scientists
and policy makers, this may be cases where they wish to implement or evaluate policies
and programmes that are neighbourhood based and the effects are influenced by ethnic
diversity.
Beyond the Livehoods method, other methods can be investigated in a similar manner,
such as the Hoodsquare (Zhang et al., 2013) and LiveCities (Del Bimbo et al., 2014)
methods. In particular, it is important to better understand how tuning parameters for a
particular neighbourhood detection method changes the clusters formed. It is also
important to understand the type of information that is contained within the clusters
formed.
7.2. Limitations and Future Research
There were many other lines of inquiry that have not been explored regarding the
investigation of the Livehoods method. One example is how the clusters formed change
over time. The method could be applied over different time scales (e.g. weeks or
months) or different times of day (e.g. morning/night). This could give us an idea of
60

how dynamic or consistent neighbourhood detection using location based social media
is. A second line of inquiry would be to collect Foursquare data from different cities
and compare the results of tuning the Livehoods parameters in these cities. This
comparison would give us an idea of the methods robustness across cities and provide
clues at how characteristics of cities may influence neighbourhood detection methods in
general. A third line of inquiry would be to generate clusters using distance as a
proximity criterion instead of nearest neighbours. From the analysis in this work, using
the nearest neighbours as a proximity criterion seems to lead to larger clusters being
formed in areas where Foursquare venues are less dense. Clusters formed using the two
types of proximity criterion could be compared to provide insight for when it may be
better to use either criterion.
As mentioned earlier, better understanding neighbourhood detection methods can also
be about investigating certain elements of the neighbourhood detection process across a
number of methods. For example, it would be useful to consider how the venues-based
approach compares against the grid-based approach in terms of unit aggregation. While
the Hoodsquare method did not take a strict grid-based approach, Zhang et al (2013)
showed how a comparison could be done by comparing their Hoodsquare method with
the Livehoods method in terms of which could better predict a users home
neighbourhood.
In terms of the type of clustering method, we could learn from the community detection
literature on how to compare clustering algorithms. Lanchichinetti and Fortunato
(Lancichinetti and Fortunato, 2009), for example, evaluated a variety of clustering
algorithms against benchmark graphs in the community detection literature. Is there a
61

similar set of benchmark clusters/neighbourhoods that neighbourhood detection


methods could be evaluated against? What would the evaluation criteria be?
In terms of similarity measures used, it is especially important to look at what types of
data are useful for detecting neighbourhoods. The Livehoods method uses a limited set
of information to detect neighbourhoods, and it would be helpful to investigate if
different types of properties detect neighbourhoods that would be useful for different
purposes. Other methods have included temporal aspects (Falher et al., 2015; Zhang et
al., 2013) to detect boundaries using a richer set of data.
As mentioned previously, using data from location based social media has its
limitations. It is thus useful to think about what kinds of information outside location
based social media could be added to complement its context-rich nature. One
candidate is cellphone data. While cellphone data is not as context-rich as data from
location based social media, cellphone data is a good source for tracking movements of
large amounts of people across the city. The amount of time users spend in locations
across the city could serve as a strong signal of where users perceived neighbourhoods
are, and be used to validate or complement the boundaries detected by neighbourhood
detection methods. Such research could be based on Ratti et als (2010) research, which
took a grid-based approach and clustered grid squares based on total talk time on land
lines between grid squares to identify regions within Great Britain.
To detect neighbourhoods that better match the perceptions of its residents,
neighbourhood detection methods could also incorporate information based on the 3
dimensions of neighbourhoods mentioned earlier social ties, physical demarcations
62

and residents experiences (Chaskin, 1998). Neighbourhood detection using location


based social media has focused on residents experiences thus far, but they could also
attempt to incorporate information on social ties and physical demarcations. Social
media platforms contain rich information on social ties, but researchers would have to
find ways to differentiate between neighbourly ties and other forms of social ties.
Neighbourhood detection using location based social media is a relatively new
development, given that the location based social media platforms have only existed for
a short time. These platforms offer an exciting new way for researchers to think about
city and neighbourhood boundaries, given the amount of contextual data they provide.
It allows researchers, policy makers and urban planners to look at neighbourhoods as
dynamic entities that may have different characteristics when studied over different
time periods, and lead to a more dynamic understanding of cities.

63

8. BIBLIOGRAPHY

Campbell, E., Henly, J.R., Elliott, D.S., Irwin, K., 2009. Subjective constructions of
neighborhood boundaries: Lessons from a qualitative study of four neighborhoods. J.
Urban Aff. 31, 461490. doi:10.1111/j.1467-9906.2009.00450.x
Chaskin, R.J., 1998. Neighborhood as a Unit of Planning and Action: A Heuristic Approach.
J. Plan. Lit. 1130. doi:0803973233
Chaskin, R.J., 1997. Perspectives on Neighborhood and Community: A Review of the
Literature. Soc. Serv. Rev. 71, 521547.
Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh, N., 2012. The Livehoods Project: Utilizing
Social Media to Understand the Dynamics of a City. Icwsm 5865.
Cranshaw, J., Yano, T., 2010. Seeing a home away from the home: Distilling protoneighborhoods from incidental data with Latent Topic Modeling, in: CSSWC
Workshop at NIPS. doi:10.1109/SocialCom-PASSAT.2012.93
Del Bimbo, A., Ferracani, A., Pezzatini, D., DAmato, F., Sereni, M., 2014. LiveCities:
Revealing the Pulse of Cities by Location- Based Social Networks Venues and Users
Analysis. Proc. Companion Publ. 23rd Int. Conf. World Wide Web Companion 163
166. doi:http://dx.doi.org/10.1145/2567948.2577035
Donetti, L., Munoz, M. a., 2004. Detecting Network Communities: a new systematic and
efficient algorithm. J. Stat. Mech. Theory Exp. 10, 8. doi:10.1088/17425468/2004/10/P10012
Falher, L., Gionis, A., Mathioudakis, M., 2015. Where is the Soho of Rome? Measures and
algorithms for finding similar neighborhoods in cities.
Fortunato, S., Barthlemy, M., 2007. Resolution limit in community detection. Proc. Natl.
Acad. Sci. U. S. A. 104, 3641. doi:10.1073/pnas.0605965104
Gonzlez, M.C., Hidalgo, C. a, Barabsi, A.-L., 2008. Understanding individual human
mobility patterns. Nature 453, 779782. doi:10.1038/nature06958
Good, B.H., De Montjoye, Y.A., Clauset, A., 2010. Performance of modularity
maximization in practical contexts. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys.
81, 120. doi:10.1103/PhysRevE.81.046106
Hirschman, A.O., 1964. The Paternity of an Index. Am. Econ. Rev. 54, 761.
Huang, A., 2008. Similarity measures for text document clustering. Proc. Sixth New Zeal.
4956.
64

Jones, E., Oliphant, T., Peterson, P., Others, 2001. SciPy: Open Source Scientific Tools for
Python [WWW Document]. URL http://www.scipy.org/ (accessed 7.14.15).
Lancichinetti, A., Fortunato, S., 2011. Limits of modularity maximization in community
detection. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys. 84, 19.
doi:10.1103/PhysRevE.84.066122
Lancichinetti, A., Fortunato, S., 2009. Community detection algorithms: a comparative
analysis. Phys. Rev. E 80, 112.
Noulas, a, Scellato, S., Mascolo, C., Pontil, M., 2011. Exploiting Semantic Annotations for
Clustering Geographic Areas and Users in Location-based Social Networks.
Noulas, A., Scellato, S., Lathia, N., Mascolo, C., 2012. A random walk around the city:
New venue recommendation in location-based social networks. Proc. - 2012
ASE/IEEE Int. Conf. Privacy, Secur. Risk Trust 2012 ASE/IEEE Int. Conf. Soc.
Comput. Soc. 2012 144153. doi:10.1109/SocialCom-PASSAT.2012.70
Openshaw, S., 1984. The modifiable area unit problem. Concepts Tech. Mod. Geogr. 38, 1
41.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., Duchesnay, ., 2011. Scikit-learn: Machine Learning in
Python. J. Mach. Learn. Res. 12, 28252830.
Planck, M., Luxburg, U. Von, 2006. A Tutorial on Spectral Clustering A Tutorial on
Spectral Clustering. Stat. Comput. 17, 395416. doi:10.1007/s11222-007-9033-z
Preotiuc-Pietro, D., Cohn, T., 2013. Mining User Behaviours: A Study of Check-in Patterns
in Location Based Social Networks. Staffwww.Dcs.Shef.Ac.Uk.
doi:10.1145/2464464.2464479
Preoiuc-Pietro, D., Cranshaw, J., Yano, T., 2013. Exploring venue-based city-to-city
similarity measures. Proc. 2nd ACM 14. doi:10.1145/2505821.2505832
Ratti, C., Sobolevsky, S., Calabrese, F., Andris, C., Reades, J., Martino, M., Claxton, R.,
Strogatz, S.H., 2010. Redrawing the map of Great Britain from a network of human
interactions. PLoS One 5. doi:10.1371/journal.pone.0014248
Sampson, R.J., Morenoff, J.D., Gannon-Rowley, T., 2002. Assessing Neighborhood
Effects: Social Processes and New Directions in Research. Annu. Rev. Sociol. 28,
443478. doi:10.1146/annurev.soc.28.110601.141114

65

Scellato, S., Mascolo, C., 2011. Measuring user activity on an online location-based social
network. 2011 IEEE Conf. Comput. Commun. Work. (INFOCOM WKSHPS) 918
923. doi:10.1109/INFCOMW.2011.5928943
Silva, T.H., Vaz de Melo, P.O.S., Almeida, J.M., Loureiro, A.A.F., 2013. Social Media as a
Source of Sensing to Study City Dynamics and Urban Social Behavior: Approaches,
Models, and Opportunities, in: Atzmueller, M., Chin, A., Helic, D., Hotho, A. (Eds.),
Ubiquitous Social Media Analysis. Springer, pp. 6387.
Silva, T.H., Vaz De Melo, P.O.S., Almeida, J.M., Salles, J., Loureiro, A.A.F., 2012.
Visualizing the invisible image of cities. Proc. - 2012 IEEE Int. Conf. Green Comput.
Commun. GreenCom 2012, Conf. Internet Things, iThings 2012 Conf. Cyber, Phys.
Soc. Comput. CPSCom 2012 382389. doi:10.1109/GreenCom.2012.62
Stokes, P., n.d. 2011 Census, Population and Household Estimates for Small Areas in
England and Wales.
Sturgis, P., Brunton-Smith, I., Kuha, J., Jackson, J., 2013. Ethnic diversity, segregation and
the social cohesion of neighbourhoods in London. Ethn. Racial Stud. 121.
doi:10.1080/01419870.2013.831932
Van Der Walt, S., Colbert, S.C., Varoquaux, G., 2011. The NumPy array: A structure for
efficient numerical computation. Comput. Sci. Eng. 13, 2230.
doi:10.1109/MCSE.2011.37
Weiss, L., Ompad, D., Galea, S., Vlahov, D., 2007. Defining Neighborhood Boundaries for
Urban Health Research. Am. J. Prev. Med. 32, 154159.
doi:10.1016/j.amepre.2007.02.034
Xia, P., Zhang, L., Li, F., 2015. Learning similarity with cosine similarity ensemble. Inf.
Sci. (Ny). 307, 3952. doi:10.1016/j.ins.2015.02.024
Zelnik-Manor, L., Perona, P., 2004. Self-Tuning Spectral Clustering. Adv. Neural Inf.
Process. Syst. 2, 16011608.
Zhang, A.X., Noulas, A., Scellato, S., Mascolo, C., 2013. Hoodsquare: Modeling and
recommending neighborhoods in location-based social networks. Proc. - Soc. 2013
6974. doi:10.1109/SocialCom.2013.17

66

9. APPENDIX
9.1. Scripts for collecting and formatting data for analysis
9.1.1. IPython notebook: twitter_streaming.ipynb
# this script is used to collect tweets from the Twitter streaming API. It runs
throughout the period of data collection.
import json
import sys
import tweepy

# from http://stackoverflow.com/questions/21129020/how-to-fixunicodedecodeerror-ascii-codec-cant-decode-byte
# this handles unicode errors
reload(sys)
sys.setdefaultencoding('utf8')

def tweepy_oauth():

CONSUMER_KEY = [your consumer key]


CONSUMER_SECRET = [your consumer secret]
ACCESS_TOKEN = [your access token]
ACCESS_TOKEN_SECRET = [your access token secret]

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)


auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

return auth

# Code from http://badhessian.org/2012/10/collecting-real-time-twitter-data-withthe-streaming-api/ with minor modifications


import json, time, sys
67

from tweepy import StreamListener

# create an instance of a tweepy StreamListener to handle the incoming data.


class SListener(StreamListener):

def __init__(self, fprefix = 'streamer'):


# self.api = api or API()
self.counter = 0
self.fprefix = fprefix
self.output = open('../Dissertation/twitter_data/raw/' + fprefix + '.' +
time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
self.delout = open('delete.txt', 'a')

def on_data(self, data):

if 'in_reply_to_status' in data:
self.on_status(data)
elif 'delete' in data:
delete = json.loads(data)['delete']['status']
if self.on_delete(delete['id'], delete['user_id']) is False:
return False
elif 'limit' in data:
if self.on_limit(json.loads(data)['limit']['track']) is False:
return False
elif 'warning' in data:
warning = json.loads(data)['warnings']
print warning['message']
return False

def on_status(self, status):


self.output.write(status)
68

self.counter += 1

if self.counter >= 5000: # New file is started every 5,000 tweets, tagged with
prefix and a timestamp.
self.output.close()
self.output = open('../Dissertation/twitter_data/raw/' + self.fprefix + '.'
+ time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
self.counter = 0

return

def on_delete(self, status_id, user_id):


self.delout.write( str(status_id) + "\n")
return

def on_limit(self, track):


sys.stderr.write(track + "\n")
return

def on_error(self, status_code):


sys.stderr.write('Error: ' + str(status_code) + "\n")
return True # Don't kill the stream

def on_timeout(self):
sys.stderr.write("Timeout, sleeping for 60 seconds...\n")
time.sleep(60)
return True # Don't kill the stream

# This sets up the Twitter Streaming API

69

twitter_api = tweepy_oauth()
Q = "twitter.com"

# the streaming api will filter for the cities indicated by their bounding boxes here.
# the order is West Lat, South Lon, East Lat, North Lon
# using the maximum number of bounding boxes allowed (25) by Twitter Streaming
API
locations = [103.549467,1.145502,104.123447,1.478481, # Singapore
-0.489, 51.28, 0.236, 51.686, # London
-74.255641,40.495865,-73.699793,40.91533, # New York City
]

# Create a streaming API and set a timeout value of 60 seconds.


streaming_api = tweepy.streaming.Stream(twitter_api, SListener(), timeout=60)

# Used infinite loop from https://github.com/ryanmcgrath/twython/issues/288 cause


# I kept getting InComplete Read Error
# streaming_api.filter(follow=None, track=None, locations=locations)
while True: #Endless loop
try:
streaming_api.filter(follow=None, track=None, locations=locations,
stall_warnings=True)
except:
try:
continue
except IOError:
continue

9.1.2. IPython notebook: extract_twitter_data.ipynb


# This script looks in the folder path given for twitter json files and generates a json
containing all foursquare checkins
70

day = '20150824'

json_path = 'twitter_raw/' + day + '/*.json' # folder path where the tweets are (json
files)
output_path = 'foursquare/' + day + '.json' # name of file to save the final output to

from sqlalchemy import create_engine


# create the connection string to the MySQL database
# engine =
create_engine('mysql://[username]:[password]@[host]:[port]/[schemaname]',
encoding = 'utf-8')
################################

def main(json_path, output_path):

tweets = merge_json(json_path) # Get all tweets in the files in the folder


print("Total number of tweets: " + str(len(tweets)))

df = extract_tweets(tweets) # Extract all tweets that have coordinates and return


a pandas DataFrame
print()
print("Number of tweets with coordinates: " + str(len(df)))

df.to_sql('tweets', engine, if_exists='append', chunksize=10000)

# Extract tweets from FourSquare


# Comment out this block if all tweets with coordinates are needed instead
71

df = df.loc[df['source'] == '<a href="http://foursquare.com"


rel="nofollow">Foursquare</a>']
print("Number of tweets from FourSquare: " + str(len(df)))

df.to_json(output_path) # Save the FourSqaure tweets in json


print("Saved at: " + output_path)

################################
import json
import glob

def merge_json(json_path):
# this merges all the json files given in the json_path and returns a list of tweets
tweets = []
for f in glob.glob(json_path):
tweets_file = open(f,"r")
for tweet in tweets_file:
if tweet != '\n':
tweets.append(tweet)

return tweets

#################################
import pandas as pd
from pandas import DataFrame, Series
72

def extract_tweets(tweets):
# this takes a list of tweets and returns those with coordinates data as a pandas
DataFrame
# it is also much faster than iterrows method for some reason
df = []
for tweet in tweets:
if len(df) % 10000 == 0:
print (len(df), end = ' ')

data = {}
try:
t = json.loads(tweet)
except (ValueError):
pass

if t['coordinates']:
data['tweetId'] = t['id_str']
data['dateTime'] = t["created_at"]
data['tweet'] = t["text"]
data['lng'] = t["coordinates"]["coordinates"][0]
data['lat'] = t["coordinates"]["coordinates"][1]
data['source'] = t["source"]
data ['userId'] = t["user"]["id_str"]

73

if (t["entities"]["hashtags"]):
data['hashtags'] = str([h['text'] for h in
t["entities"]["hashtags"]]).strip('[]')

if (t["entities"]["urls"]):
data['url'] = t["entities"]["urls"][0]["expanded_url"]

df.append(data)

df = pd.DataFrame.from_dict(df)

df["dateTime"] = pd.to_datetime(df["dateTime"]) # convert to datetime format


import pytz

# Converts DateTime column to index, localizes it to UTC time, then store it back
in DateTime column
df["dateTime"] = pd.DatetimeIndex(df["dateTime"]).tz_localize('UTC')

# Converts DateTime column to index, localizes it to UTC time, converts it to SG


time, then store it back in DateTime column
#df["DateTime"] =
pd.DatetimeIndex(df["DateTime"]).tz_localize('UTC').tz_convert('Singapore')

# Used when converting for Processing - I don't know how to deal with epoch
time in Processing
# df["Year"] = pd.DatetimeIndex(df["DateTime"]).year
# df["Month"] = pd.DatetimeIndex(df["DateTime"]).month
74

# df["Day"] = pd.DatetimeIndex(df["DateTime"]).day

# df["Hour"] = pd.DatetimeIndex(df["DateTime"]).hour
# df["Minute"] = pd.DatetimeIndex(df["DateTime"]).minute
# df["Second"] = pd.DatetimeIndex(df["DateTime"]).second

return df

#################################
main(json_path, output_path)

9.1.3. IPython notebook: foursquare_search_place.ipynb


# This script is used to find and add the foursquare venueID for every foursquare
checkin extracted from the Twitter data
from __future__ import print_function
import pandas as pd
import numpy as np
import json

# variables for main()


day = '20150824'

tweets_path = 'foursquare/' + day + '.json' # load the tweets data stored here
venues_path = 'venues/venues_info.json' # where the master venues info file is
stored. Data will be loaded from and saved to this file
75

# foursquare client details


client_id= [your client id]
client_secret=[your client secret]
import foursquare
client = foursquare.Foursquare(client_id=client_id, client_secret=client_secret)

################################

def main(tweets_path, client, venues_path):

tweets = pd.read_json(tweets_path) # load tweets


tweets.index = np.arange(len(tweets))
print("Total foursquare tweets: " + str(len(tweets)))

# Finds the venue ids based on the Twitter check-ins, and saves them to the json
file
# returns a df with the original tweets info plus the venue_ids
tweets_w_venues = extract_venues(tweets, tweets_path)

# Cross reference venues in tweets_w_venues with master venue file and adds
any new venues found to the master venue file
add_venues_to_master(venues_path, tweets_w_venues, client)

################################
from lxml import html
76

import requests
import collections

def extract_venue_id(url):
# extracts the tweet url, goes to the foursquare page based on the url and finds the
venue id
# returns the url if the page/tweet/venue info cannot be found
try:
page = requests.get(url) # insert tweet url here
except (requests.ConnectionError, requests.exceptions.MissingSchema,
requests.exceptions.InvalidSchema): # Raises error when the url leads to localhost
instead of an actual online page
return url

try:
tree = html.fromstring(page.text)
except (ValueError):
return url

venue_url = tree.xpath('.//div[@class="venue push"]/h1/a/@href')

try:
venue_id = (venue_url[0].split('/'))[-1]
return venue_id
except (IndexError):
return url
77

################################
def extract_venue_info(venue_id, client):
# uses the venue_id to request for venue info from the FourSquare api
venue_info = client.venues(venue_id.strip())
return venue_info

################################
def extract_venues(tweets, tweets_path):
first_iter_errors = 0
second_iter_errors = 0
first_list = []
second_list = []

print("Getting venue Ids...")

for index, row in tweets.iterrows():


if index % 10 == 0:
print(index, end = ' ')

tweets.ix[index, 'venueId'] = extract_venue_id(tweets.ix[index,'url'])


if tweets.ix[index, 'venueId'] == tweets.ix[index,'url']:
first_iter_errors +=1
first_list.append(tweets.ix[index,'url'])

78

# Doing a second iteration because sometimes urls are returned even when they
are perfectly legit
# ConnectionError handling might trigger when the net connection is unstable
for example
# Doing this twice seems to catch all the venue_ids that were missed the first time
print()
print("Second iteration getting venues IDs...")
for index, row in tweets.iterrows():
if index % 100 == 0:
print(index, end = ' ')

if tweets.ix[index, 'venueId'] == tweets.ix[index,'url']:


tweets.ix[index, 'venueId'] = extract_venue_id(tweets.ix[index,'url'])
if tweets.ix[index, 'venueId'] == tweets.ix[index,'url']:
second_iter_errors +=1
second_list.append(tweets.ix[index,'url'])

# Probably does the same thing just as slowly. The one above prints the count so
I know progress
# tweets['VenueID'] = tweets.apply(lambda row: extract_venue_id(row['url']),
axis=1)
# tweets['VenueID']

print()
tweets.to_json(tweets_path)
print("Tweets saved at: " + tweets_path)

79

print("Errors from first iteration: " + str(first_iter_errors))


print(first_list)
print("Errors from second iteration: " + str(second_iter_errors))
print(second_list)

return tweets

################################
def add_venues_to_master(venues_path, tweets, client):

venues_info = pd.read_json(venues_path)

res = client.venues.categories()
maincats_dict = dict([(category['id'], category['name']) for category in
res['categories']])
subcats_dict = dict([(subcat['id'], subcat['name']) for category in res['categories']
for subcat in category['categories']])
subsubcats_dict = dict([(subsubcat['id'], subsubcat['name']) for category in
res['categories'] for subcat in category['categories'] for subsubcat in
subcat['categories']])

# Create conversion dictionaries


subsubcat_to_subcat_ids_dict = dict([(subsubcat['id'], subcat['id']) for category
in res['categories'] for subcat in category['categories'] for subsubcat in
subcat['categories']])
subcat_to_maincat_ids_dict = dict([(subcat['id'], category['id']) for category in
res['categories'] for subcat in category['categories']])

80

added = 0
errors = 0
skipped = 0
venues_w_errors = []

print("Getting venue details for venues master file...")


for index, row in tweets.iterrows():
if index % 100 == 0:
print(index, end = ' ')

# check for this to ignore all failed extract_venue_id() attempts and ignore all
venue_ids that are already in venues_info file
if (tweets.ix[index, 'venueId'] != tweets.ix[index, 'url']) and
(tweets.ix[index,'venueId'] not in venues_info['venueId']):
try:
venue_info = extract_venue_info(tweets.ix[index,'venueId'], client)

key = venue_info['venue']['id']

venues_info.ix[key,'venueId'] = venue_info['venue']['id']
venues_info.ix[key,'name'] = venue_info['venue']['name']

venues_info.ix[key,'lat'] = venue_info['venue']['location']['lat']
venues_info.ix[key,'lng'] = venue_info['venue']['location']['lng']

if ('categories' in venue_info['venue']):
81

for cat in venue_info['venue']['categories']:


if 'primary' in cat:
if cat['primary']:
venues_info.ix[key,'categoryId'] = cat['id']
venues_info.ix[key,'category'] = cat['name']

if cat['id'] in subsubcats_dict:
venues_info.ix[key,'subsubcatId'] = cat['id']
venues_info.ix[key,'subsubcat'] = cat['name']
venues_info.ix[key,'subcatId'] =
subsubcat_to_subcat_ids_dict[cat['id']]
venues_info.ix[key,'subcat'] =
subcats_dict[venues_info.ix[key,'subcatId']]
venues_info.ix[key,'maincatId'] =
subcat_to_maincat_ids_dict[venues_info.ix[key,'subcatId']]
venues_info.ix[key,'maincat'] =
maincats_dict[venues_info.ix[key,'maincatId']]

elif cat['id'] in subcats_dict:


venues_info.ix[key,'subsubcatId'] = None
venues_info.ix[key,'subsubcat'] = None
venues_info.ix[key,'subcatId'] = cat['id']
venues_info.ix[key,'subcat'] = cat['name']
venues_info.ix[key,'maincatId'] =
subcat_to_maincat_ids_dict[cat['id']]
venues_info.ix[key,'maincat'] =
maincats_dict[venues_info.ix[key,'maincatId']]
82

elif cat['id'] in maincats_dict:


venues_info.ix[key,'subsubcatId'] = None
venues_info.ix[key,'subsubcat'] = None
venues_info.ix[key,'subcatId'] = None
venues_info.ix[key,'subcat'] = None
venues_info.ix[key,'maincatId'] = cat['id']
venues_info.ix[key,'maincat'] = cat['name']

if ('description' in venue_info['venue']):
venues_info.ix[key,'description'] = venue_info['venue']['description']

added += 1

except (foursquare.ParamError):
errors += 1

venues_w_errors.append((tweets.ix[index,'venueId'],tweets.ix[index,'lat'],tweets.ix[i
ndex,'lng']))
pass

# Overwrites master venues_info file


venues_info.to_json(venues_path)

# backs up new copy of master venues_info file


83

backup_path = 'venues/venues_info_caa_' + day + '.json'


venues_info.to_json(backup_path)

print()
print(venues_path + " has been updated with " + str(added) + " venues")
print(str(errors) + " venues could not be found from the venue_id")
print(venues_w_errors)

return(venues_w_errors)

################################
main(tweets_path, client, venues_path)

9.1.4. IPython notebook: format_data_for_analysis.ipynb


# This script is used to create the checkin files which are later used as input by the
scripts for the Livehoods method.
locations =[London]
dates = list(range(20150406,20150431)) + list(range(20150501,20150532))
# use get_checkins to create file for new date range if needed, and extract checkins
within polygons
# These two steps are not needed if file already exists.
checkins = get_checkins(dates = dates)
extract_checkins_within_polygons(checkins, locations = locations)
def get_checkins(dates, columns = ['tweetId', 'dateTime', 'lat', 'lng', 'userId',
'venueId']):

84

# returns a df of all checkins on the dates provided. Also saves the df as a json
file.
print("Getting data...")
df = pd.DataFrame({}, columns = columns)

for day in dates:


print(day)
day = day
checkins_path = 'foursquare/' + str(day) + '.json'
checkins = pd.read_json(checkins_path)
checkins['tweetId'] = checkins['tweetId'].astype('str')
df = df.append(checkins)

df = df[columns]
df = df.drop_duplicates() # remove duplicates if any
df.index = np.arange(len(df))

# save output to .json


outpath = 'checkins_' + str(dates[0]) + '_' + str(dates[-1]) + '.json'
df.to_json(outpath)
print('data saved to %s' % outpath)

return df
def filter_by_location(df, locations = ['London', 'Singapore', 'New York']):
# filter the checkins by list of locations to analyze. Also adds a column
indicating the name of the city
df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''

df_out = pd.DataFrame(columns = df.columns)


85

for location in locations:


print("Getting checkins for %s..." % location)
# assign bounding boxes
if location.lower() in ['london', 'ldn']:
location = 'London'
bb = [-0.489, 51.28, 0.236, 51.686]
elif location.lower() in ['singapore', 'sg']:
location = 'Singapore'
bb = [103.549467,1.145502,104.123447,1.478481]
elif location.lower() in ['new york', 'newyork', 'nyc']:
location = 'New York'
bb = [-74.255641,40.495865,-73.699793,40.91533]
else:
raise NameError('Cannot find one of the locations, or you have entered a
string instead of a list')

# use bounding boxes to look for tweets that are within these boxes
df_int = df[ (df['lat']>=bb[1]) & (df['lat']<=bb[3]) & (df['lng']>=bb[0]) &
(df['lng']<=bb[2]) ].copy()
df_int.loc[:,'city'] = location
print('Number of checkins from %s: %s ' %(location, len(df_int)) )
df_out = df_out.append(df_int)

df_out.index = np.arange(len(df_out))
print('Total number of checkins: %s' % len(df_out))
print()
return df_out

86

from shapely.geometry import mapping, Polygon, Point, MultiPoint, shape,


MultiPolygon
import fiona
import json
import geopandas as geopd

def extract_checkins_within_polygons(df, locations = ['London', 'Singapore',


'New York']):
# extract the checkins by list of locations to analyze. Also adds a column
indicating the name of the city
# Uses city boundaries according to the geojson boundaries instead of
bounding boxes

df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''

df_out = pd.DataFrame(columns = df.columns)

for location in locations:


print("Getting checkins for %s..." % location)
# assign bounding boxes
if location.lower() in ['london', 'ldn']:
location = 'London'
loc = 'ldn'
bb = [-0.187894,51.483718,-0.109978,51.516466]
elif location.lower() in ['singapore', 'sg']:
location = 'Singapore'
loc = 'sg'
bb = [103.549467,1.145502,104.123447,1.478481]
elif location.lower() in ['new york', 'newyork', 'nyc']:
87

location = 'New York'


loc = 'nyc'
bb = [-74.255641,40.495865,-73.699793,40.91533]
else:
raise NameError('Cannot find one of the locations, or you have entered a
string instead of a list')

with open(loc + '_boundaries.geojson') as f:


boundary = json.load(f)
boundary = shape(boundary['features'][0]['geometry'])

# use shapely polygons to look for tweets that are within ciy boundaries
df_int = df[ (df['lat']>=bb[1]) & (df['lat']<=bb[3]) & (df['lng']>=bb[0]) &
(df['lng']<=bb[2]) ].copy()

pts = geopd.GeoDataFrame([Point(x, y) for x, y in zip(df_int['lng'].values,


df_int['lat'].values)], columns = ['geometry'])
df_int['city'] = np.where(pts['geometry'].within(boundary), location, None)
df_int = df_int[df_int['city'].notnull()]

print('Number of checkins from %s: %s ' %(location, len(df_int)) )


df_out = df_out.append(df_int)

df_out.index = np.arange(len(df_out))
print('Total number of checkins: %s' % len(df_out))

outpath = 'checkins_' + str(dates[0]) + '_' + str(dates[-1]) + '.json'


df_out.to_json(outpath)
print('data saved to %s' % outpath)
print()

88

9.2. Scripts for Livehoods clustering method


This set of scripts was used to run the Livehoods clustering method on the Amazon
cloud server.
9.2.1. Bash script: install.sh
# This script sets up all the software needed to run the Livehoods clustering method
on the Cloud server

ln -s /mnt /home/ubuntu/mnt
byobu-enable
wget https://bootstrap.pypa.io/get-pip.py
sudo apt-get -y install zip unzip git
sudo python get-pip.py
sudo apt-get -y install python-numpy python-scipy python-matplotlib python-sympy
python-nose git
sudo apt-get -y install python-dev htop
sudo pip install pandas
sudo pip install descartes
sudo pip install -U scikit-learn
sudo apt-get -y install libspatialindex-dev

sudo add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable


sudo apt-get -y update
sudo apt-get -y install libgdal-dev gdal-bin

sudo pip install fiona


sudo pip install git+git://github.com/geopandas/geopandas.git
sudo pip install Shapely
sudo pip install sqlalchemy

89

9.2.2. Bash script: runLDN.sh


# This runs the different variations of the Livehoods method in parallel on the cloud
server.
python clustering.py -c London -r set7 -n 10 -a 0.01 &
python clustering.py -c London -r set8 -n 10 -a 0.01 &
python clustering.py -c London -r set8 -n 10 -a 0.01 &
python clustering.py -c London -r set10 -n 10 -a 0.01 &
python clustering.py -c London -r set11 -n 10 -a 0.01 &
python clustering.py -c London -r set12 -n 10 -a 0.01 &

python clustering.py -c London -r set7_00 -n 10 -a 0.00 &


python clustering.py -c London -r set7_01 -n 10 -a 0.01 &
python clustering.py -c London -r set7_02 -n 10 -a 0.02 &
python clustering.py -c London -r set7_03 -n 10 -a 0.03 &
python clustering.py -c London -r set7_04 -n 10 -a 0.04 &
python clustering.py -c London -r set7_05 -n 10 -a 0.05 &

python clustering.py -c London -r set7 -n 5 -a 0.01 &


python clustering.py -c London -r set7 -n 6 -a 0.01 &
python clustering.py -c London -r set7 -n 7 -a 0.01 &
python clustering.py -c London -r set7 -n 8 -a 0.01 &
python clustering.py -c London -r set7 -n 9 -a 0.01 &
python clustering.py -c London -r set7 -n 11 -a 0.01 &
python clustering.py -c London -r set7 -n 12 -a 0.01 &
python clustering.py -c London -r set7 -n 13 -a 0.01 &
python clustering.py -c London -r set7 -n 14 -a 0.01 &
python clustering.py -c London -r set7 -n 15 -a 0.01 &
python clustering.py -c London -r set7 -n 16 -a 0.01 &
python clustering.py -c London -r set7 -n 17 -a 0.01 &
python clustering.py -c London -r set7 -n 18 -a 0.01 &
python clustering.py -c London -r set7 -n 19 -a 0.01 &
90

python clustering.py -c London -r set7 -n 20 -a 0.01 &

91

9.2.3. Python script: clustering.py


#!/usr/bin/python
# This script is called by runLDN.sh. It runs the Livehoods clustering method and
assigns parameters depending on the arguments supplied.

from __future__ import print_function


import sys
import getopt
from getdata import *
from clusteringalgo import spectral_clustering
from formatresults import format_results_for_viz, send_to_mysql

__author__ = 'Tai Tong KAM'

def main(argv):
city = ''
results_set = ''
n_neighbors = 10
local_sql = False
alpha = 0.01

try:
opts, args = getopt.getopt(argv,"c:r:n:a:l",["city=", "results_set=",
"n_neighbors=", "alpha=", "local_sql="])
except getopt.GetoptError:
sys.exit(2)

print('e.g. command: python clustering.py -c London -r set7 -n 10 -a 0.01 -l')


print(argv)

for opt, arg in opts:


92

if opt in ("-c", "--city"):


city = arg
elif opt in ("-r", "--results_set"):
results_set = arg
elif opt in ("-n", "--n_neighbors"):
n_neighbors = int(arg)
elif opt in ("-a", "--alpha"):
alpha = float(arg)
elif opt in ("-l", "--local_sql"):
local_sql = True

clustering = get_set_info(results_set, 'clustering')

if clustering == 'spectral':
metric = get_set_info(results_set, 'metric')
input_matrix = get_set_info(results_set, 'input_matrix')

results_dict = spectral_clustering(city, n_neighbors, metric, input_matrix,


results_set, alpha)
save_city_file(city, 'results_dict', results_dict, **{'results_set':results_set,
'n_neighbors':n_neighbors})

#send_to_mysql(results_dict, city, results_set, n_neighbors, local_sql)


format_results_for_viz(city, results_set, n_neighbors)

if __name__ == "__main__":
main(sys.argv[1:])

93

9.2.4. Python script: clusteringalgo.py


#!/usr/bin/python
# This script contains functions used to run spectral clustering for the Livehoods
method.

from __future__ import print_function


import sys
from getdata import *

__author__ = 'Tai Tong KAM'

def spectral_clustering(city, n_neighbors, metric, input_matrix, results_set, alpha):


# input_matrix can be 'full_graph' or 'nearest_neighbors'
from sklearn.cluster import SpectralClustering
print()
print('Performing spectral clustering for %s city, %s nearest neighbors, using %s
and %s metric'
% (city, str(n_neighbors), input_matrix, metric))

venues_latlng = get_venues_latlng(city)

if input_matrix == 'full_graph':
matrix = get_social_similarity_matrix(city, metric)
matrix = (matrix + matrix.T) / 2
elif input_matrix == 'nearest_neighbors':
matrix = get_affinity_matrix(city, n_neighbors, metric, alpha)
matrix = (matrix + matrix.T) / 2
else:
matrix = None

94

(n_clusters, n_clusters_chart) = get_optimal_n_clusters_value(matrix, k=200,


min_clusters=30, max_clusters=100)
print(n_clusters, 'clusters')
print()

spec = SpectralClustering(n_clusters=n_clusters, affinity='precomputed')


spec.fit_predict(matrix)

results_df = pd.DataFrame(venues_latlng).merge(pd.DataFrame(spec.labels_),
left_index=True, right_index=True)
results_df.drop(['city', 'venueId'], axis=1, inplace=True)
results_df.columns = ['lat', 'lng', 'label']

dates = get_city_info(city, 'dates')

results = {'results_df': results_df,


'city': city,
'n_neighbors': n_neighbors,
'metric': metric,
'n_clusters': n_clusters,
'n_clusters_chart': n_clusters_chart,
'input_matrix': input_matrix,
'dates': dates,
'results_set': results_set
}

return results

def get_affinity_matrix(city, n_neighbors, metric, alpha=None):


if alpha is None:
95

alpha=0.01
nearest_neighbors_matrix = get_nearest_neighbors_matrix(city, n_neighbors)
social_similarity_matrix = get_social_similarity_matrix(city, metric=metric)
affinity_matrix = create_affinity_matrix(nearest_neighbors_matrix,
social_similarity_matrix, alpha=alpha)

return affinity_matrix

def create_affinity_matrix(nearest_neighbors_matrix, social_similarity_matrix,


alpha=None):
if alpha is None:
alpha=0.01
out = np.multiply(nearest_neighbors_matrix, social_similarity_matrix)
alpha_matrix = nearest_neighbors_matrix*alpha
out = np.add(out, alpha_matrix)
np.fill_diagonal(out, 0)

return out

def get_nearest_neighbors_matrix(city, n_neighbors):


# using the distance matrix, this creates a matrix identifying the nearest n
neighbors for each venue.
# venues that are nearest n neighbors are assigned 1, and 0 otherwise.
# used to create affinity matrix later. Returns an np array
distance_arr = get_distance_matrix(city)
out = np.apply_along_axis(find_nearest_neighbors,
1,
distance_arr,
n_neighbors)
96

return out

def find_nearest_neighbors(a, n_neighbors):


# used by create_nearest_neighbors_matrix. This is applied to each row/column
of the distance_matrix.
ranks = np.argsort(a, kind='mergesort')
out = (a <= a[ranks][:n_neighbors].max()).astype(int)
return out

def get_social_similarity_matrix(city, metric):


from scipy.sparse import csc_matrix
# this creates a sparse matrix where the columns are users and rows are venues.
# it tells us how many times a user checked into a particular venue over the time
period

users_by_venue = get_users_by_venue(city)
users_dict = get_users_dict(city)
venues_dict = get_venues_dict(city)
users_by_venue['userId'].replace(users_dict, inplace=True)
users_by_venue['venueId'].replace(venues_dict, inplace=True)
users_by_venue.drop('city', axis=1, inplace=True)

df_int = users_by_venue.copy()
df_int.index = np.arange(len(df_int))
row = df_int['venueId'].values
col = df_int['userId'].values
data = df_int['count'].values
try:

97

matrix_int = csc_matrix((data, (row, col)),


shape=((len(df_int['venueId'].unique()) + 0),
(len(df_int['userId'].unique()) + 0))).toarray()
except ValueError:
matrix_int = csc_matrix((data, (row, col)),
shape=((len(df_int['venueId'].unique()) + 1),
(len(df_int['userId'].unique()) + 1))).toarray()

# calculate the social similarity for venues


from scipy.spatial.distance import pdist, squareform
social_similarity_matrix = squareform(pdist(matrix_int, metric=metric))
#social_similarity_matrix = pairwise_kernels(matrix_int, metric=metric)

return social_similarity_matrix

def get_optimal_n_clusters_value(matrix, k, min_clusters, max_clusters):


# using eigengap heuristic to find how many clusters should be used for spectral
clustering
from scipy.sparse import csgraph
from scipy.sparse import linalg as la
l = csgraph.laplacian(matrix, normed=True)
eigenvalues = la.eigs(l, k=k, which='SM', return_eigenvectors=False)
eigenvalues.sort()

n_clusters = get_n_clusters_value(eigenvalues, min_clusters, max_clusters)


n_clusters_chart = [(get_n_clusters_value(eigenvalues, 0, x)) for x in range(2, k)]

return (n_clusters, n_clusters_chart)

def get_n_clusters_value(eigenvalues, min_clusters, max_clusters):


98

eigenvalues = eigenvalues[min_clusters: max_clusters]


n_clusters = np.argmax(np.ediff1d(eigenvalues)) + min_clusters
return n_clusters

99

9.2.5. Python script: getdata.py


#!/usr/bin/python
# This script contains functions that get data required for the Livehoods method or
other analysis and visualizations as needed. The data is loaded if the file already
exists, and generated from raw checkin data if the file does not exist.
from __future__ import print_function
import sys
from utils import *

__author__ = 'Tai Tong KAM'

def get_checkins(city):
# returns a df of all checkins on the dates provided.
if city_file_exists(city, 'checkins'):
checkins = load_city_file(city, 'checkins')
else:
columns = ['tweetId', 'dateTime', 'lat', 'lng', 'userId', 'venueId']
df = pd.DataFrame({}, columns=columns)
dates = get_city_info(city, 'dates')

for day in dates:


print(day)
checkins_path = '../_Data/foursquare/' + str(day) + '.json'
checkins = pd.read_json(checkins_path)
checkins['tweetId'] = checkins['tweetId'].astype('str')
df = df.append(checkins)

df = df[columns]
df = df.drop_duplicates() # remove duplicates if any
df.index = np.arange(len(df))
100

checkins = extract_checkins_within_polygons(df)
save_city_file(city, 'checkins', checkins)

df_out = filter_by_location(checkins, city)

return df_out

def extract_checkins_within_polygons(df):
# extract the checkins by list of locations to analyze. Also adds a column
indicating the name of the city
# Uses city boundaries according to the geojson boundaries instead of bounding
boxes

from shapely.geometry import Point, shape


import json
import geopandas as geopd

df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''

df_out = pd.DataFrame(columns=df.columns)

for city in ['London', 'Singapore', 'New York']:


# assign bounding boxes
bb = get_city_info(city, 'bounding_box')
path = get_city_file_path(city, 'boundaries')

with open(path) as f:
101

boundary = json.load(f)
boundary = shape(boundary['features'][0]['geometry'])

# use shapely polygons to look for tweets that are within ciy boundaries
df_int = df[(df['lat'] >= bb[1]) & (df['lat'] <= bb[3]) & (df['lng'] >= bb[0])
& (df['lng'] <= bb[2])].copy()

pts = geopd.GeoDataFrame([Point(x, y) for x, y in zip(df_int['lng'].values,


df_int['lat'].values)],
columns=['geometry'])
df_int['city'] = np.where(pts['geometry'].within(boundary), city, None)
df_int = df_int[df_int['city'].notnull()]
df_out = df_out.append(df_int)

df_out.index = np.arange(len(df_out))

return df_out

def filter_by_location(df, city):


# filter the checkins by list of locations to analyze.
df['venueId'].apply(str)
df['userId'].apply(str)

bb = get_city_info(city, 'bounding_box')

# use bounding boxes to look for tweets that are within these boxes
df_out = df[(df['lat'] >= bb[1]) & (df['lat'] <= bb[3]) & (df['lng'] >= bb[0]) &
(df['lng'] <= bb[2])].copy()
df_out.index = np.arange(len(df_out))

102

return df_out

def get_value_counts(city, col):


# counts how many checkins were made based on the list of df_cols given (e.g.
['venueId', 'userId'])
# returns a dictionary with the df_cols as keys and pd dataframes as values

if col == 'venueId':
file = 'venue_counts'
elif col == 'userId':
file = 'user_counts'
else:
sys.exit('Error in col')

if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the column, date, and hour. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'],
checkins[col]]).tweetId.count())
# this returns a df where the multiindex is put in columns in a normal df
instead.
df_out = df_out.reset_index()
df_out.columns = ['city', col, 'count']
df_out = df_out.sort('count', ascending=False)
df_out.index = np.arange(len(df_out))

save_city_file(city, file, df_out)

103

return df_out

def get_df_by_time(city, col):


# creates dfs for each df_col provided with the columns col, 'date', 'hour', 'count'
# returns a dictionary with the df_cols as keys and dataframes as values
if col == 'venueId':
file = 'venues_by_time'
elif col == 'userId':
file = 'users_by_time'
else:
sys.exit('Error in col')

if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the column, date, and hour. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'], checkins[col],
pd.DatetimeIndex(checkins['dateTime']).date,

pd.DatetimeIndex(checkins['dateTime']).hour]).tweetId.count())
# this returns a df where the multiindex is put in columns in a normal df
instead.
df_out = df_out.reset_index()
df_out.columns = ['city', col, 'date', 'hour', 'count']
save_city_file(city, file, df_out)

return df_out

104

def get_haversine_distance(pos1, pos2, r=6371*1000):


# haversine calculation adapted from from
http://stackoverflow.com/a/19414306/4632696
# using this instead of vincenty() from geopy because this much much faster
# r above is the radius of the Earth. Usually in km. I converted to metres. for
miles, use 3958.75
# added np.around to round to nearest metre

pos1 = pos1 * np.pi / 180


pos2 = pos2 * np.pi / 180
cos_lat1 = np.cos(pos1[..., 0])
cos_lat2 = np.cos(pos2[..., 0])
cos_lat_d = np.cos(pos1[..., 0] - pos2[..., 0])
cos_lon_d = np.cos(pos1[..., 1] - pos2[..., 1])

return np.around(r * np.arccos(cos_lat_d - cos_lat1 * cos_lat2 * (1 - cos_lon_d)))

def get_checkins_by_hour(city):
file = 'checkins_by_hour'

if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# use venues_by_time or users_by_time as the df. The result is the same
df_int = checkins.copy()
df_int['dateTime'] = df_int.apply(
lambda row: row['dateTime'].normalize() +
pd.DateOffset(hours=row['dateTime'].hour), axis=1)

105

df_int = df_int.groupby(df_int['dateTime']).count()

date_range = pd.date_range(start=df_int.index.min(),
end=df_int.index.max(),
freq='H')

df_out = pd.DataFrame(index=date_range)
df_out = df_out.merge(df_int, how='left', left_index=True, right_index=True)
df_out = df_out.fillna(value=0)
df_out = pd.DataFrame(df_out['tweetId'])
df_out.columns = ['count']

save_city_file(city, file, df_out)

return df_out

def get_users_by_venue(city):
# used as input for the affinity matrix later. It gives information on number of
checkins
file = 'users_by_venue'

if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the city, userId, venueId. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'],
checkins['userId'],
checkins['venueId']]).tweetId.count())

106

# this returns a df where the multiindex is put in columns in a normal df


instead.
df_out = df_out.reset_index()
df_out.columns = ['city', 'userId', 'venueId', 'count']

save_city_file(city, file, df_out)

return df_out

def get_users_dict(city):
# Create users_dict for city
file = 'users_dict'

if city_file_exists(city, file):
dict_out = load_city_file(city, file)
else:
users_by_venue = get_users_by_venue(city)
dict_out = dict(zip(users_by_venue['userId'].unique(),
np.arange(len(users_by_venue['userId'].unique()))))
save_city_file(city, file, dict_out)

return dict_out

def create_distance_matrix(city):
# Creates distance matrices for each city using venues in the df supplied.
# Saves two files in the folder: a numpy array containing the distance
calculations,
# and a dictionary containing the keys and index numbers for the venues
# Returns a dictionary with the city name as keys and dataframes as values
107

# It takes a really long time to complete the process.

checkins = get_checkins(city)

df_int = checkins.copy()
df_int = df_int.iloc[df_int[['venueId']].drop_duplicates().index]
df_int = df_int[['lat', 'lng', 'venueId']]

print('Creating distance matrix for %s venues in %s' % (len(df_int), city))

locations = df_int[['lat', 'lng']].values


df_out = np.empty([len(locations), len(locations)])
for i in range(len(locations)):
if i % 100 == 0:
print(i, end=' ')
df_out[i] = get_haversine_distance(locations[i:i + 1, None], locations)

df_out = pd.DataFrame(df_out, index=df_int['venueId'],


columns=df_int['venueId'])
array = df_out.values

print()
print('Saving distance matrix..')
# Saving the output - one venues_dict and one np array
venues_dict = dict(zip(df_out.index, np.arange(len(df_out.index))))
save_city_file(city, 'venues_dict', venues_dict)
save_city_file(city, 'distances', array)

return array

108

def get_distance_matrix(city):
file = 'distances'

if city_file_exists(city, file):
array = load_city_file(city, file)
else:
create_distance_matrix(city)
array = load_city_file(city, file)
return array

def get_venues_dict(city):
file = 'venues_dict'

if city_file_exists(city, file):
venues_dict = load_city_file(city, file)
else:
create_distance_matrix(city)
venues_dict = load_city_file(city, file)
return venues_dict

def get_venues_latlng(city):
file = 'venues_latlng'

if city_file_exists(city, file):
venues_latlng = load_city_file(city, file)
venues_latlng.index = venues_latlng['venueId']
else:
checkins = get_checkins(city)
venues_dict = get_venues_dict(city)
109

venues_latlng = checkins[['venueId', 'city', 'lat',


'lng']].drop_duplicates('venueId')
venues_latlng['venueId'].replace(venues_dict, inplace=True)
venues_latlng.index = venues_latlng['venueId']
save_city_file(city, file, venues_latlng)

return venues_latlng

110

9.2.6. Python script: utils.py


#!/usr/bin/python
# This script contains utility functions for saving and loading files, and contains
information on the different variations used for the Livehoods method that can be
retrieved by other scripts.
from __future__ import print_function
import pandas as pd
import numpy as np
import pickle
import os.path
from scipy import sparse
import json
from sqlalchemy import create_engine
import sys
sys.path.append('D:/Dissertation/')
sys.path.append('/home/ucfntka/dissertation/')

__author__ = 'Tai Tong KAM'


"""
get_set_info(set, field)
get_city_info(city, field)
save_city_file(city, file, data, local_sql = False, **kwargs)
load_city_file(city, file, **kwargs)
get_city_file_path(city, file, results_set=None, n_neighbors=10)
"""

def get_set_info(set, field):


set_info = {'settest': {'name': 'settest',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
111

},
'set7': {'name': 'set7',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set8': {'name': 'set8',
'input_matrix': 'full_graph',
'metric': 'cosine',
'clustering': 'spectral'
},
'set9': {'name': 'set9',
'input_matrix': 'nearest_neighbors',
'metric': 'euclidean',
'clustering': 'spectral'
},
'set10': {'name': 'set10',
'input_matrix': 'full_graph',
'metric': 'euclidean',
'clustering': 'spectral'
},
'set11': {'name': 'set11',
'input_matrix': 'nearest_neighbors',
'metric': 'jaccard',
'clustering': 'spectral'
},
'set12': {'name': 'set12',
'input_matrix': 'full_graph',
'metric': 'jaccard',
'clustering': 'spectral'
},
112

'set7_00': {'name': 'set7_00',


'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set7_01': {'name': 'set7_01',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set7_02': {'name': 'set7_02',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set7_03': {'name': 'set7_03',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set7_04': {'name': 'set7_04',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set7_05': {'name': 'set7_05',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
}
113

return set_info[set][field]

def get_city_info(city, field):


city_info = {'London': {'dates': [20110406, 20110531],
'bounding_box': [-0.489, 51.28, 0.236, 51.686],
'loc': 'ldn'
},
'Singapore': {'dates': list(range(20150406,20150431)) +
list(range(20150501,20150532)),
'bounding_box': [103.549467, 1.145502, 104.123447, 1.478481],
'loc': 'sg'
},
'New York': {'dates': list(range(20150406,20150431)) +
list(range(20150501,20150532)),
'bounding_box': [-74.255641, 40.495865, -73.699793, 40.91533],
'loc': 'nyc'
},
'testcity': {
#'dates': [20150406, 20150408],
'dates':list(range(20150406,20150431)) +
list(range(20150501,20150532)),
'bounding_box': [-0.489, 51.28, 0.236, 51.686],
'loc': 'testcity'
},
}

return city_info[city][field]

114

def save_city_file(city, file, data, local_sql = False, **kwargs):


path = get_city_file_path(city, file, **kwargs)

if file in ['checkins', 'venue_counts', 'user_counts',


'venues_by_time', 'users_by_time', 'users_by_venue',
'checkins_by_hour', 'venues_latlng']:
data.to_json(path)
elif file == 'distances_dict':
pd.DataFrame.from_dict(data, orient='index').to_json(path)
elif file == 'distances':
np.save(path, data)
elif file in ['users_dict', 'venues_dict', 'results_dict']:
with open(path, 'wb') as fp:
pickle.dump(data, fp)
elif file == 'affinity_sparse':
np.savez(path, data=data.data, indices=data.indices,
indptr=data.indptr, shape=data.shape )
elif file == 'results_geojson':
open(path, "wb").write(json.dumps(data).encode('utf-8'))
elif file == 'results_sql':
# create the connection string to the MySQL database
# engine =
create_engine('mysql://[username]:[password]@[host:port]/[tablename]?charset=
utf8', encoding='utf-8')
if local_sql:
engine =
create_engine('mysql://[username]:[password]@[host:port]/[tablename]?charset=
utf8', encoding='utf-8')
else:

115

create_engine('mysql://[username]:[password]@[host:port]/[tablename]?charset=
utf8', encoding='utf-8')
# send results to mysql
# if operational error occurs, it's probably because too much info being sent
over.
# give a smaller chunksize
data.to_sql(path, engine, if_exists='replace', chunksize = 3000)

print('Data saved to:', path)

def load_city_file(city, file, **kwargs):


path = get_city_file_path(city, file, **kwargs)

if file in ['checkins', 'venue_counts', 'user_counts',


'venues_by_time', 'users_by_time', 'users_by_venue',
'checkins_by_hour', 'venues_latlng']:
data = pd.read_json(path)
elif file == 'distances':
data = np.load(path)
elif file in ['users_dict', 'venues_dict', 'results_dict']:
try:
data = pickle.load(open(path, "rb"))
except UnicodeDecodeError:
data = pickle.load(open(path, "rb"), encoding='latin1')
elif file == 'affinity_sparse':
loader = np.load(path)
data = sparse.csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
elif file == 'results_geojson':
116

with open(path) as f:
data = json.load(f)
else:
data = None
print('No data')

return data

def get_city_file_path(city, file, results_set=None, n_neighbors=10):


dates = get_city_info(city, 'dates')
suffix = '_' + str(dates[0]) + '_' + str(dates[-1]) + '.json'
p_suffix = '_' + str(dates[0]) + '_' + str(dates[-1]) + '.p'
npy_suffix = '_' + str(dates[0]) + '_' + str(dates[-1]) + '.npy'
sparse_suffix = '_' + str(dates[0]) + '_' + str(dates[-1])+'.npz'
loc = get_city_info(city, 'loc')

if file == 'checkins':
path = '../_Data/' + file + suffix
elif file in ['venue_counts', 'user_counts',
'venues_by_time', 'users_by_time', 'users_by_venue',
'checkins_by_hour', 'venues_latlng'
]:
path = '../_Data/' + loc + '_' + file + suffix
elif file == 'distances':
path = '../_Data/' + loc + '_distances' + npy_suffix
elif file in ['users_dict', 'venues_dict']:
path = '../_Data/' + loc + '_' + file + p_suffix
elif file == 'boundaries':
path = '../_Data/' + loc + '_boundaries.geojson'
elif file == 'affinity_sparse':
117

path = '../_Analysis/' + results_set + '/' + loc + '_csr' + sparse_suffix


elif file == 'results_dict':
path = '../_Analysis/' + results_set + '_' + loc + '_' + str(n_neighbors) +
'_results.p'
elif file == 'results_geojson':
path = '../_Analysis/wamp/' + results_set + '_' + loc + '_' + str(n_neighbors)
+ '_results.geojson'
elif file == 'results_sql':
path = results_set + '_' + loc + '_' + str(n_neighbors)
else:
path = None

return path

def city_file_exists(city, file, **kwargs):


path = get_city_file_path(city, file, **kwargs)
return(os.path.isfile(path))

118

9.3. Scripts for visualizing cluster results


These scripts contain functions used to format the results of the Livehoods method
and generate many of the figures and diagrams in this work. The functions were
executed in IPython notebooks
9.3.1. Python script: formatresults.py
#!/usr/bin/python
# This script contains functions used to format the results of the Livehoods method
in GeoJSON format, and to send the results to a MySQL database:
from __future__ import print_function
from datetime import datetime, timedelta
import sys
from shapely.geometry import mapping, Point, MultiPoint, shape, MultiPolygon
sys.path.append('D:/Dissertation/')
sys.path.append('/home/ucfntka/dissertation/')
from _PythonScripts.getdata import *

__author__ = 'Tai Tong KAM'

def format_results_for_viz(city, results_set, n_neighbors):


results_dict = load_city_file(city, 'results_dict', results_set=results_set,
n_neighbors=n_neighbors)

feature_collection = create_results_geojson(results_dict)
save_city_file(city, 'results_geojson', feature_collection, results_set=results_set,
n_neighbors=n_neighbors)

add_properties_to_geojson(city, results_set, n_neighbors)

119

def create_results_geojson(results_dict):
results = results_dict['results_df']
# remove largest cluster from the results, because spectral clustering usually
returns a cluster
# that covers the entire area
# results = results[results.label != results.groupby('label').count().lat.idxmax()]
feature_collection = create_polygons_for_clusters(results)
for key in ['city', 'n_neighbors', 'metric', 'n_clusters', 'n_clusters_chart',
'input_matrix', 'results_set']:
feature_collection[key] = str(results_dict[key])

return feature_collection

def send_to_mysql(results_dict, city, results_set, n_neighbors, local_sql):

venues_dict = get_venues_dict(city)
venue_counts = get_value_counts(city, 'venueId')
venues_info = pd.read_json('../_Data/venues/venues_info.json')

# merge data
results_for_sql = venue_counts.merge(venues_info, how='left', on='venueId')
results_for_sql.index = results_for_sql['venueId'].replace(venues_dict)

results = results_dict['results_df']
results.drop(['lat', 'lng'], axis=1, inplace=True)
results.columns = ['label']
results_for_sql = results_for_sql.merge(results, left_index=True,
right_index=True)

120

save_city_file(city, 'results_sql', results_for_sql, local_sql = local_sql,


**{'results_set': results_set,
'n_neighbors': n_neighbors,})

def create_polygons_for_clusters(results):
"""Formats the spectral clustering results in a form that can be saved in
geoJSON"""
features = []

# shapely stuff adapted from http://blog.thehumangeo.com/2014/05/12/drawingboundaries-in-python/


for label in results['label'].unique():

subset = results[results['label'] == label]

# Create shapely points from results (note that lng comes first)
points = [Point(row[0], row[1]) for row in subset[['lng', 'lat']].values]

# Instantiate a MultiPoint, then ask the MultiPoint for its envelope, which is a
Polygon.
# convex_hull and buffer used so that it is a smoth shape
point_collection = MultiPoint(list(points))
point_collection.envelope
convex_hull_polygon = point_collection.convex_hull.buffer(0.0001)

# Define the polygon feature


fea = {'type': 'Feature',
'geometry': mapping(convex_hull_polygon),
'properties': {'id': str(label),
'label': str(label)}
121

features.append(fea)

# create feature collection for geojson


# geojson format from http://gis.stackexchange.com/a/41658
feature_collection = {'type': 'FeatureCollection',
'features': features,
'crs': {'type': 'link'}
}

return feature_collection

def create_ordered_multi(feature_collection):
"""Convert feature collection to Shapely multipolygon sorted according to their
label property"""
ordered_features = [None] * len(feature_collection['features'])

for feature in feature_collection['features']:


label = feature['properties']['label']
ordered_features[int(label)] = feature

# creating multipolygon for shapely - adapted from


http://gis.stackexchange.com/a/70608
ordered_multi = MultiPolygon([shape(feature['geometry']) for feature in
ordered_features])

return ordered_multi

122

def add_properties_to_geojson(city, results_set, n_neighbors):


feature_collection = load_city_file(city, 'results_geojson', results_set=results_set,
n_neighbors=n_neighbors)

from pyproj import Proj


# create the connection string to the MySQL database
# engine =
create_engine('mysql://[username]:[password]@[host]:[port]/[schemaname]',
encoding = 'utf-8', echo=False)
engine =
create_engine('mysql://[username]:[password]@[host:port]/[tablename]?charset=
utf8', encoding='utf-8')
# make the connection to the database
conn = engine.raw_connection()

dates = get_city_info(city, 'dates')


q = "SELECT * FROM checkins WHERE city = '" + city + "' AND dateTime
BETWEEN '" + str(datetime.strptime(str(dates[0]), "%Y%m%d").date()) + "' AND
'" + str(datetime.strptime(str(dates[-1]), "%Y%m%d").date() + timedelta(days=1))
+ "'"
#q = "SELECT * FROM checkins WHERE city = 'London' AND dateTime
BETWEEN '" + str(datetime.strptime(str(dates[0]), "%Y%m%d").date()) + "' AND
'" + str(datetime.strptime(str(dates[-1]), "%Y%m%d").date() + timedelta(days=1))
+ "'"

checkins = pd.read_sql(q, conn)

for feature in feature_collection['features']:


polygon = shape(feature['geometry'])
bounds = polygon.bounds
label = feature['properties']['label']
123

# Select venues that are within the polygon's bounding box from SQL
q = "SELECT * FROM venues_info WHERE lng > " + str(bounds[0]) + \
" AND lng < " + str(bounds[2]) + " AND lat > " + str(bounds[1]) + " AND
lat < " + str(bounds[3])
q_result = pd.read_sql(q, conn)

# Further filter for venues that are within the polygon itself. 2 stages are used
because .within is not fast
for i, row in q_result.iterrows():
if not (Point(row['lng'], row['lat']).within(polygon)):
q_result.drop(i, inplace=True)

# generate list of venues to use for filtering for checkins/tweets within the
polygon
q_result_venues = list(q_result['urlId'].values) +
list(q_result['venueId'].values)
for_merge = q_result[['maincat', 'maincatId', 'subcat', 'subcatId', 'name',
'urlId']].copy()

if dates == [20110406, 20110531]:


for_merge['urlId'] = q_result['venueId']

# create c, the df of checkins that occurred within the polygon


c = checkins.copy()
c['label'] = np.where(c['venueId'].isin(q_result_venues), label, None)
c = c[c['label'] == label]
c = c.merge(for_merge, how='left', left_on='venueId', right_on='urlId')

# add properties to feature_collection


feature['properties']['num_venues'] = str(len(c['urlId'].unique()))
124

feature['properties']['num_checkins'] = str(len(c))
feature['properties']['num_users'] = str(len(c['userId'].unique()))
feature['properties']['venues'] = str(list(c['urlId'].unique()))

checkins_by_categories_dict =
dict(c.groupby('maincat').size().order(ascending=False))
checkins_by_sub_categories = c.groupby(['maincat', 'subcat']).size()
for k, v in checkins_by_categories_dict.items():
checkins_by_categories_dict[k] = {'count': str(v),
'checkins_by_sub_categories_dict':
dict(zip(checkins_by_sub_categories[k].index,

list(map(int,checkins_by_sub_categories[k].values))))}
feature['properties']['checkins_by_categories'] = checkins_by_categories_dict

venues_by_categories_dict =
dict(c.drop_duplicates('venueId').groupby('maincat').size().order(ascending=False))
venues_by_sub_categories = c.drop_duplicates('venueId').groupby(['maincat',
'subcat']).size()
for k, v in venues_by_categories_dict.items():
venues_by_categories_dict[k] = {'count': str(v),
'venues_by_sub_categories_dict':
dict(zip(venues_by_sub_categories[k].index,

list(map(int,venues_by_sub_categories[k].values))))}
feature['properties']['venues_by_categories'] = venues_by_categories_dict

users_by_categories_dict =
dict(c.drop_duplicates('userId').groupby('maincat').size().order(ascending=False))
users_by_sub_categories = c.drop_duplicates('userId').groupby(['maincat',
'subcat']).size()
125

for k, v in users_by_categories_dict.items():
users_by_categories_dict[k] = {'count' : str(v),
'users_by_sub_categories_dict' :
dict(zip(users_by_sub_categories[k].index,

list(map(int,users_by_sub_categories[k].values))))}
feature['properties']['users_by_categories'] = users_by_categories_dict
# calculate polygon area from http://stackoverflow.com/a/4683144
lng, lat = zip(*list(polygon.exterior.coords))
pa = Proj("+proj=aea")
x, y = pa(lng, lat)
cop = {"type": "Polygon", "coordinates": [zip(x, y)]}
feature['properties']['area'] = round(shape(cop).area/1000000, 4) # convert
to square km
feature['properties']['label'] = label

save_city_file(city, 'results_geojson', feature_collection, results_set=results_set,


n_neighbors=n_neighbors)

126

9.3.2. Python script: visualize_cluster_results.py


# This script contains functions for creating many of the figures found in this work.
__author__ = 'Tai Tong KAM'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('D:/Dissertation/')
from _PythonScripts.utils import load_city_file
#from utils import *

categories = ['Arts & Entertainment', 'College & University', 'Food', 'Nightlife Spot',
'Residence',
'Outdoors & Recreation', 'Professional & Other Places', 'Shop & Service',
'Travel & Transport']
colors = dict(zip(categories, sns.color_palette("hls", 10)))

def get_cluster_properties(city, results_set, n_neighbors, filter=False):


feature_collection = load_city_file(city, 'results_geojson', results_set=results_set,
n_neighbors=n_neighbors)
properties = pd.DataFrame([feature['properties'] for feature in
feature_collection['features']])
properties['checkins_per_sqkm'] = properties['num_checkins'].astype(float) /
properties['area'].astype(float)
properties['users_per_sqkm'] = properties['num_users'].astype(float) /
properties['area'].astype(float)
properties['venues_per_sqkm'] = properties['num_venues'].astype(float) /
properties['area'].astype(float)

127

properties['checkins_per_venue'] = properties['num_checkins'].astype(float) /
properties['num_venues'].astype(float)
properties['checkins_per_user'] = properties['num_checkins'].astype(float) /
properties['num_users'].astype(float)
properties['users_per_venue'] = properties['num_users'].astype(float) /
properties['num_venues'].astype(float)
if filter:
properties.drop(properties['area'].argmax(), inplace=True)
return properties

def get_city_properties_by_category(cluster_properties, category):


data = get_all_clusters_df(cluster_properties, category=category) # category =
'main' or 'sub'
if category == 'main':
data = data.groupby('category').sum()
data['category'] = data.index
elif category == 'sub':
data = data.groupby('subcategory').sum()
data['subcategory'] = data.index

for prop in ['venues', 'checkins', 'users']:


data[prop + '_perc'] = 100 * data[prop] / data[prop].sum()

if category == 'main':
data = data[['category', 'venues', 'checkins', 'users', 'venues_perc',
'checkins_perc', 'users_perc']]
elif category == 'sub':
data = data[['subcategory', 'venues', 'checkins', 'users', 'venues_perc',
'checkins_perc', 'users_perc']]

128

return data

def get_cluster_counts_by_category_df(cluster_properties, prop, label):


cluster_counts_by_category =
pd.DataFrame.from_dict(cluster_properties[cluster_properties['label'] ==
str(label)][prop + '_by_categories'].values[0], orient='index')
df = pd.DataFrame([(maincat, subcat ,count) for maincat, row in
cluster_counts_by_category.iterrows()
for subcat, count in row[prop + '_by_sub_categories_dict'].items()])
df.columns = ['category', 'subcategory', prop]
return df

def get_all_clusters_df(cluster_properties, category):


data = pd.DataFrame()
columns = []
for i, row in cluster_properties.iterrows():
if category == 'sub':
cluster_data_by_sub_category = get_cluster_df(cluster_properties,
row['label'], category='sub')
elif category == 'main':
cluster_data_by_sub_category = get_cluster_df(cluster_properties,
row['label'], category='main')
cluster_data_by_sub_category['label'] = row['label']
cluster_data_by_sub_category['area'] = row['area']
cluster_data_by_sub_category['checkins_per_sqkm'] =
cluster_data_by_sub_category['checkins'] / row['area']
cluster_data_by_sub_category['users_per_sqkm'] =
cluster_data_by_sub_category['users'] / row['area']

129

cluster_data_by_sub_category['venues_per_sqkm'] =
cluster_data_by_sub_category['venues'] / row['area']

data = data.append(cluster_data_by_sub_category.values.tolist())
columns = list(cluster_data_by_sub_category.columns)

data.columns = columns
data.index = np.arange(len(data))

return data

def get_perc_diff_by_category_data(cluster_properties, category, prop):


city_data = get_city_properties_by_category(cluster_properties, category)
clusters_data = get_all_clusters_df(cluster_properties, category)

# 'Event' was taken out from this list of categories because it doesn't appear in
the London data
# and causes an error
categories = ['Arts & Entertainment', 'College & University', 'Food', 'Nightlife
Spot', 'Residence',
'Outdoors & Recreation', 'Professional & Other Places', 'Shop &
Service', 'Travel & Transport']

for category in categories:


for prop_perc in ['checkins_perc', 'venues_perc', 'users_perc']:
clusters_data.ix[clusters_data.category==category, prop_perc + '_diff'] = \
np.round((clusters_data.ix[clusters_data.category==category, prop_perc]
city_data[prop_perc][category]) / city_data[prop_perc][category],
4) * 100
130

clusters_data = clusters_data[['label', 'category', 'checkins_perc_diff',


'venues_perc_diff', 'users_perc_diff']]

data = clusters_data.pivot('label', 'category', prop + '_perc_diff').copy()


return data

def get_cluster_df(cluster_properties, label, category):


cluster_venue_counts_by_category_df =
get_cluster_counts_by_category_df(cluster_properties, 'venues', label)
cluster_checkin_counts_by_category_df =
get_cluster_counts_by_category_df(cluster_properties, 'checkins', label)
cluster_users_counts_by_category_df =
get_cluster_counts_by_category_df(cluster_properties, 'users', label)
data =
cluster_venue_counts_by_category_df.merge(cluster_checkin_counts_by_category_df,
on=['category', 'subcategory']).merge(cluster_users_counts_by_category_df,
on=['category', 'subcategory'])

if category == 'main':
data = data.groupby('category').sum()
data['category'] = data.index
elif category == 'sub':
data.index = data['subcategory']

data['checkins_per_venue'] = np.round(data['checkins'] / data['venues'], 2)


data['checkins_per_user'] = np.round(data['checkins'] / data['users'], 2)
data['users_per_venue'] = np.round(data['users'] / data['venues'], 2)

for prop in ['venues', 'checkins', 'users']:


131

data[prop + '_perc'] = 100 * data[prop] / data[prop].sum()

return data

def plot_overview(cluster_properties, prop, title=None):


# options: checkins_per_sqkm, users_per_sqkm, venues_per_sqkm,
# checkins_per_venue, checkins_per_user, users_per_venue
# area, num_checkins, num_users, num_venues
data = cluster_properties
data[prop] = data[prop].astype(float)
data = data.sort(prop)
ax = data.plot('label', prop,
kind='barh',
figsize=(5, len(data)/3),
legend = False)
ax.xaxis.tick_top()
ax.set_ylabel('cluster')
if title:
plt.title(title, y=1.02)
else:
plt.title(' '.join(prop.split(sep='_')) + ' by cluster', y=1.02)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)

return data.sort(prop, ascending=False)

def plot_indiv_cluster_by_subcategory(cluster_properties, label, prop, title=None,


sortdata=True,
132

categories = ['Arts & Entertainment', 'College & University',


'Event', 'Food', 'Nightlife Spot', 'Residence',
'Outdoors & Recreation', 'Professional & Other
Places', 'Shop & Service', 'Travel & Transport']):

# options for prop: venues, checkins, users, checkins_per_venue,


checkins_per_user, users_per_venue
data = get_cluster_df(cluster_properties, label, category='sub')
data = data[data['category'].isin(categories)]
if sortdata:
data.sort(prop, inplace=True)

ax = data.plot('subcategory', prop,
kind='barh',
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = False)
ax.xaxis.tick_top()

if title:
plt.title(title, y=1.035)
else:
plt.title('number of ' + prop + " in cluster " + str(label) +
" by Foursquare's subcategories", y=1.035)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)

plt.show()
return data.sort(prop, ascending=False)

133

def plot_indiv_cluster_by_category(cluster_properties, label, prop, title=None,


sortdata=True):
data = get_cluster_df(cluster_properties, label, category='main')

if sortdata:
data.sort(prop, inplace=True)

ax = data.plot('category', prop,
kind='barh',
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = False)
ax.xaxis.tick_top()

if title:
plt.title(title, y=1.17)
else:
plt.title('number of ' + prop + " in cluster " + str(label) +
" by Foursquare's main categories", y=1.17)

for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +


ax.get_yticklabels()):
item.set_fontsize(13)

plt.show()
return data.sort(prop, ascending=False)

def plot_all_clusters_by_category(cluster_properties, prop, title=None,


sortdata=True, categories=categories):
134

data = get_all_clusters_df(cluster_properties, category='main') # category =


'main' or 'sub'
data = data[data['category'].isin(categories)]
if sortdata==True:
data.sort(prop, inplace=True)

fig = plt.figure()
ax = fig.add_subplot(111)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax.xaxis.tick_top()
ax2 = ax.twinx()

data.plot('label', prop,
kind='barh',
ax=ax,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)

data.plot('category', prop,
kind='barh',
ax=ax2,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)

if title:
plt.title(title, y=1.035)
else:
135

plt.title('number of ' + prop + " in clusters by Foursquare's main categories: "


+ ', '.join(categories), y=1.035)

plt.show()
return data.sort(prop, ascending=False)

def plot_all_clusters_by_subcategory(cluster_properties, prop, title=None,


sortdata=True, categories=categories):
data = get_all_clusters_df(cluster_properties, category='sub') # category =
'main' or 'sub'
data = data[data['category'].isin(categories)]
if sortdata==True:
data.sort(prop, inplace=True)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.tick_top()
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax2 = ax.twinx()

data.plot('label', prop,
kind='barh',
ax=ax,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)

data.plot('subcategory', prop,
136

kind='barh',
ax=ax2,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)
if title:
plt.title(title, y=1.005)
else:
plt.title('number of ' + prop + " in clusters (label) by Foursquare's
subcategories in these main categories: " + ', '.join(categories), y=1.005)

plt.show()
return data.sort(prop, ascending=False)

def plot_city_properties_by_category(cluster_properties, prop, title=None,


sortdata=False):
data = get_city_properties_by_category(cluster_properties, 'main')
if sortdata==True:
data.sort(prop, inplace=True)

fig = plt.figure()
ax = fig.add_subplot(111)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax.xaxis.tick_top()

data.plot('category', prop,
kind='barh',
ax=ax,
137

color=[colors[i] for i in data['category']],


figsize=(10, len(data)/3),
legend = False)

if title:
plt.title(title, y=1.1)
else:
plt.title(prop + " in city by Foursquare's main categories", y=1.1)

plt.show()
return data.sort(prop, ascending=False)

9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in
terms of ethnic diversity
9.4.1. Python script: extract_ldn_lsoa.ipynb
# This script was used to extract greater London LSOAs from UK LSOAs.
# Before this, I downloaded the LSOA shapefiles form Ordnance Survey
# and used ogr2ogr to convert it to the geojson in the right projection. See
# http://ben.balter.com/2013/06/26/how-to-convert-shapefiles-to-geojson-for-useon-github/

import fiona
c = fiona.open('../_Data/ldn_lsoa_2011_shp/lsoa_2011.geojson', 'r')

import pandas as pd
# this file was downloaded from lsoa 2011 census data
lsoa_data = pd.read_excel('../_Data/lsoa-data.xls', skiprows=2)
138

london_lsoa_codes = lsoa_data['Codes']

import json
outpath = ('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson')
crs = " ".join("+%s=%s" % (k,v) for k,v in c.crs.items())

features = []
for polygon in c:
if polygon['properties']['LSOA11CD'] in london_lsoa_codes.values:
fea = {'type': 'Feature',
'geometry': polygon['geometry'],
'properties': polygon['properties'],
'id': polygon['properties']['LSOA11CD'],
}

features.append(fea)

# create feature collection for geojson


# geojson format from http://gis.stackexchange.com/a/41658
feature_collection = {'type':'FeatureCollection',
'features': features,
'crs': {'type':'name',
'properties': {
'name': 'urn:ogc:def:crs:EPSG::4326'
}
139

}
}

# Save as GeoJSON
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)

140

9.4.2. Python script: add_ethnic_diversity_to_geojson.ipynb


# This script adds ethnic diversity values to the Livehood clusters GeoJSON and the
2011 Greater London Lower Super Output Areas GeoJSON.

# this calculates the ethnic diversity value (Hirschman Index) for each Livehood
cluster by using 2011 LSOA data.
# It looks for LSOAs that intersect the Livehoods cluster and uses data from these
LSOAs.
# The result is saved in the geojson for visualization and analysis

import fiona
import pandas as pd
import json
from shapely.geometry import shape

results_geojson = fiona.open('../_Analysis/wamp/set7_ldn_10_results.geojson')
lsoa_geojson = fiona.open('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson', 'r')
# this file was downloaded from lsoa 2011 census data
lsoa_data_eth = pd.read_csv('../_Data/lsoa_2011_data_eth.csv')

def get_hirschman_index(feature):
import math
cluster_pop_total = 0
cluster_eth_white = 0
cluster_eth_mixed_multi = 0
cluster_eth_asian = 0
141

cluster_eth_black = 0
cluster_eth_others = 0
cluster_eth_BAME = 0

lsoa_intersect_str = feature['properties']['lsoa_intersect']
lsoa_intersect =
lsoa_intersect_str.replace("'","").replace("[","").replace("]","").replace("
","").split(sep=",")

for lsoa in lsoa_intersect:


cluster_pop_total += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['pop_total'].values[0]
cluster_eth_white += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_white'].values[0]
cluster_eth_mixed_multi += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_mixed_multi'].values[0]
cluster_eth_asian += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_asian'].values[0]
cluster_eth_black += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_black'].values[0]
cluster_eth_others += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_others'].values[0]
cluster_eth_BAME += lsoa_data_eth[lsoa_data_eth['codes'] ==
lsoa]['eth_BAME'].values[0]

hirschman_index = 1 - (math.pow(cluster_eth_white/cluster_pop_total, 2) +
math.pow(cluster_eth_mixed_multi/cluster_pop_total, 2) +
math.pow(cluster_eth_asian/cluster_pop_total, 2) +
math.pow(cluster_eth_black/cluster_pop_total, 2) +
142

math.pow(cluster_eth_others/cluster_pop_total, 2) +
math.pow(cluster_eth_BAME/cluster_pop_total, 2)
)
return hirschman_index

features = []
for feature in results_geojson:
lsoa_intersect = []
for lsoa in lsoa_geojson:
if shape(lsoa['geometry']).intersects(shape(feature['geometry'])):
lsoa_intersect.append(lsoa['properties']['LSOA11CD'])

feature['properties']['lsoa_intersect'] = str(lsoa_intersect)
features.append(feature)

for feature in features:


hirschman_index = get_hirschman_index(feature)
feature['properties']['eth_diversity_HI'] = str(round(hirschman_index*100)/100)
print(feature['properties']['label'], hirschman_index)

feature_collection = {'type':'FeatureCollection',
'features': features,
'crs': {'type':'name',
'properties': {
'name': 'urn:ogc:def:crs:EPSG::4326'
143

}
}
}

# Save as GeoJSON
outpath = ('../_Analysis/wamp/set7_ldn_10_results.geojson')
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)
# this saves the ethnic diversity (Hirschman Index) measure in the lsoa geojson.
# the measure was precomputed in excel using excel formulas
import fiona
import pandas as pd
import json
from shapely.geometry import shape

lsoa_geojson = fiona.open('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson', 'r')


# this file was downloaded from lsoa 2011 census data
lsoa_data_eth = pd.read_csv('../_Data/lsoa_2011_data_eth.csv')

features = []
for feature in lsoa_geojson:
lsoa11cd = feature['properties']['LSOA11CD']
hirschman_index =
lsoa_data_eth[lsoa_data_eth['codes']==lsoa11cd]['eth_HI'].values[0]
feature['properties']['eth_diversity_HI'] = str(round(hirschman_index*100)/100)
144

features.append(feature)

feature_collection = {'type':'FeatureCollection',
'features': features,
'crs': {'type':'name',
'properties': {
'name': 'urn:ogc:def:crs:EPSG::4326'
}
}
}

# Save as GeoJSON
outpath = ('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson')
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)

145

9.4.3. Python script: stats_for_eth_diversity.ipynb


# This scripts combines the ethnic diversity information from the Livehods clsuters
and the Greater London Lower Super Output Area and saves them as a csv file.
import fiona
import pandas as pd
import json
from shapely.geometry import shape

results_geojson = fiona.open('../_Analysis/wamp/set7_ldn_10_results.geojson')
lsoa_geojson = fiona.open('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson', 'r')
# this file was downloaded from lsoa 2011 census data
lsoa_data_eth = pd.read_csv('../_Data/lsoa_2011_data_eth.csv')

# create a pandas df that lists the cluster label, the ethnic diversity index for the
cluster,
# the lsoas that intersect the cluster, and the average ethnic diversity index for these
clusters
feature_eth_list = []
for feature in results_geojson:
lsoa_intersect_str = feature['properties']['lsoa_intersect']
lsoa_intersect =
lsoa_intersect_str.replace("'","").replace("[","").replace("]","").replace("
","").split(sep=",")
#print(feature['properties']['label'], feature['properties']['eth_diversity_HI'],
lsoa_intersect)
feature_eth_list.append((feature['properties']['label'],
feature['properties']['eth_diversity_HI'], lsoa_intersect))
146

feature_eth_list = pd.DataFrame(feature_eth_list,
columns=['label','eth_HI','lsoa_intersect'])

for i, row in feature_eth_list.iterrows():


list_HI = []
for lsoa in row['lsoa_intersect']:

list_HI.append(lsoa_data_eth[lsoa_data_eth['codes']==lsoa]['eth_HI'].values[0])
sum_HI = sum(list_HI)
feature_eth_list.ix[i,'avg_HI'] =
round(sum_HI/len(row['lsoa_intersect'])*100)/100
feature_eth_list.ix[i, 'min_HI'] = round(min(list_HI)*100)/100
feature_eth_list.ix[i, 'max_HI'] = round(max(list_HI)*100)/100

sub_feature_eth_list = feature_eth_list[['label','eth_HI','avg_HI', 'min_HI',


'max_HI']].copy()
sub_feature_eth_list.sort('label', ascending=True, inplace=True)
for col in sub_feature_eth_list:
if col == 'label':
sub_feature_eth_list[col] = sub_feature_eth_list[col].astype(int)
else:
sub_feature_eth_list[col] = sub_feature_eth_list[col].astype(float)
sub_feature_eth_list.to_csv('ethnic_diversity_table.csv', index=False)

147

9.4.4. R script: ethnic_diversity_chart.R


# This script was used to create the Hirschman concentration index chart for ethnic
diversity
library(ggplot2)
library(reshape2)
setwd("D:/Dissertation/_Analysis")
dat <- read.csv('ethnic_diversity_table.csv')
c <- ggplot(dat, aes(x=eth_HI, y=label))
c + geom_segment(aes(y=label,
x=min_HI,
yend=label,
xend=max_HI),
colour="grey50",
linetype='dashed') +
geom_point(size=3, shape=4) +
geom_point(aes(x=avg_HI, y=label), size=3, shape=3) +
geom_point(aes(x=min_HI, y=label), size=3, shape=124) +
geom_point(aes(x=max_HI, y=label), size=3, shape=124) +
scale_y_continuous(breaks=seq(0,71,1)) +
scale_x_continuous(breaks=seq(0,0.8,0.1)) +
theme_bw() +
theme(panel.grid.major.y=element_blank()) +
labs(x="value", y="Cluster")

148

9.5. Livehood clusters for nearest neighbours parameter m=5 to m=20


The diagrams below depict the clusters generated using the Livehoods method with k =
100, = 0.01 and m varying from 5 to 20.
m=5

m=6

m=7

m=8

m=9

m = 10

m = 11

m = 12

m = 13

149

m = 14

m = 15

m = 16

m = 17

m = 18

m = 19

m = 20

150

9.6. Largest cluster generated from Livehoods method


The diagram below depicts the largest cluster generated using the Livehoods method
with k = 100, = 0.01 and m=10.

Black shaded area: Largest cluster generated using Livehoods method


Black line: Greater London administrative boundary

151

You might also like