Professional Documents
Culture Documents
ABSTRACT
The widespread use of smartphones and social media has opened opportunities for
researchers to define one of the most elusive concepts in cities: neighbourhoods. While
the number of neighbourhood detection methods using location based social media have
increased in recent years, there is much that we do not know about the process. For
example, researchers have rarely integrated the neighbourhoods detected with
administrative data to add meaning beyond what can be inferred from social media.
This work takes a step towards better understanding neighbourhood detection methods,
and also attempts to add meaning to the clusters / neighbourhoods generated by
incorporating administrative data to these clusters / neighbourhoods.
I break down the neighbourhood detection process into three common elements (a) the
unit used for aggregation, (b) the type of clustering method used; and (c) the similarity
measure.
I then illustrate one way of better understanding the neighbourhood detection process by
applying multiple variations of the Livehoods method (Cranshaw et al., 2012) on data
from Greater London, and find that in addition to neighbourhood clusters, the
Livehoods method may also be able to generate clusters that depict the citys boundaries
from the residents perspective.
I also make a preliminary attempt in this work to combine the clusters / neighbourhoods
formed using the Livehoods method with data from Londons Lower Super Output
Areas to investigate ethnic diversity in neighbourhoods. I found that using location
based social media may generate neighbourhood boundaries that are more appropriate
than or can complement traditional administrative boundaries for studies where
definitions of neighbourhood goes beyond arbitrary administrative boundaries and a
multifaceted view of neighbourhoods is needed.
2
DECLARATION
I, Tai Tong Kam, hereby declare that this dissertation is all my original work and that all
sources have been acknowledged. It is 10,169 words in length.
Signature
====================
Date: 28th August 2015
TABLE OF CONTENTS
1.
2.
3.
4.
5.
1.2.
Overview ........................................................................................................... 10
INTRODUCTION .................................................................................................... 12
2.1.
Neighbourhoods ................................................................................................ 12
2.2.
2.3.
METHODOLOGY ................................................................................................... 25
3.1.
3.2.
3.3.
4.2.
4.3.
4.4.
4.5.
4.6.
Summary........................................................................................................... 36
5.2.
6.
7.
CONCLUSION......................................................................................................... 59
7.1.
7.2.
8.
BIBLIOGRAPHY .................................................................................................... 64
9.
APPENDIX .............................................................................................................. 67
9.1.
9.1.1.
9.1.2.
9.1.3.
9.1.4.
9.2.
9.2.1.
9.2.2.
9.2.3.
9.2.4.
9.2.5.
9.2.6.
9.3.
9.3.1.
9.3.2.
9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in
terms of ethnic diversity ..............................................................................................138
9.4.1.
9.4.2.
9.4.3.
9.4.4.
9.5.
9.6.
LIST OF FIGURES
Figure 1: Relationship between number of smallest eigenvalues (k) found and number of
clusters formed ................................................................................................................. 32
Figure 2: Boundaries formed for different number of clusters .............................................. 33
Figure 3: Boundaries formed for different alpha constants ................................................... 34
Figure 4: Boundaries formed for different nearest neighbours parameter (m) ....................... 35
Figure 5: Clustering results for London ................................................................................ 40
Figure 6: Properties of Livehood clusters ............................................................................. 44
Figure 7: Overall distribution of venues and checkins across clusters .................................... 47
Figure 8: Hirschman concentration index (HI) for clusters..................................................... 56
LIST OF TABLES
Table 1: Summary statistics for cluster results for London .................................................... 41
Table 2: Percentage difference between proportion of venues within cluster to proportion of
venues within city in terms of Foursquares main categories ............................................... 50
Table 3: Percentage difference between proportion of users within cluster checking-in to
proportion of users within city checking-in in terms of Foursquares main categories ............ 52
ACKNOWLEDGMENTS
I would like to thank my supervisors, Steven Gray and Elsa Arcaute, who have been
extremely supportive and helpful throughout the dissertation process. Steven was also
instrumental in helping me process the data by guiding me on the process for setting up
the cloud computing infrastructure required to run the time-consuming scripts in parallel.
On the other hand, Elsa introduced me to Anastasios Noulas from the University of
Cambridge, who kindly provided the Foursquare data used in this work.
I would also like to thank all the teachers, staff and fellow course mates at CASA, who
have given me a great year of friendship, learning and joy in my time at CASA and
inspired me to do better.
Finally, I would like to thank my partner Cherlyn Ng, whose love, patience and support
made it possible for me to focus on my work while we were 6,740 miles apart.
In this dissertation, I illustrate one way of doing this by applying multiple variations
of the Livehoods method (Cranshaw et al., 2012) on data from Greater London. The
Livehoods method was chosen as it is a venues-based approach which has not been
used as much in the literature. In addition, it has not yet been applied to the Greater
London area.
As mentioned above, we do not understand how we can combine the clusters /
neighbourhoods detected via neighbourhood detection methods with data from these
administrative boundaries to help us better understand cities. Integrating cluster /
neighbourhoods detected using neighbourhood detection with data from
administrative boundaries is rare in the neighbourhood detection literature as most
researchers using neighbourhood detection methods have used them for developing
recommendation engines that find similar places based on social media activity. As
such, I make a preliminary attempt in this work to combine the clusters /
neighbourhoods formed using the Livehoods method with data from more
traditional administrative boundaries (the Lower Super Output Areas in this case) to
extend the meaningfulness of the clusters / neighbourhoods formed. In particular, I
have tried to integrate ethnic diversity data with the clusters / neighbourhoods
formed using the Livehoods method.
As neighbourhood detection using location based social media is relatively new and
there are few comparisons between existing neighbourhood detection methods, this
work is not aimed at evaluating whether one method or even whether particular
elements of a method are better than another. Neighbourhood detection is a form of
clustering, and determining the best clustering method has a certain degree of
subjectivity.
1.2. Overview
The dissertation is divided into seven sections.
Section Two discusses the concept of neighbourhoods, its importance for
understanding cities and why social media is a useful source of data for defining
neighbourhoods. I will review the methods that have so far been used for defining
neighbourhoods and three common elements used by the methods: (a) the unit used
for aggregation, (b) the type of clustering method used; and (c) the similarity
measure used. I will then describe what we have learnt so far about neighbourhood
detection using location based social media, and outline some ideas for better
understanding these methods.
Sections three to six illustrates one way we can better understand neighbourhood
detection methods by taking a closer look at the Livehoods method (Cranshaw et al.,
2012). Section Three begins by describing the data and methodology used.
Section Four then considers different variations of Cranshaw et als (2012)
Livehoods method for neighbourhood detection and tests three different parameters
to find out if changing them affects the clustering results.
Section Five describes the clusters / neighbourhoods that are formed using the
Livehoods method and explores some types of information that can be derived from
these clusters, by combining the clusters with Foursquares venues database.
Section Six describes the clusters / neighbourhoods that are formed using the
Livehoods method by combining them with data from Lower Super Output Areas
(LSOAs) in Greater London. It discusses the issue of the modifiable areal unit
problem (Openshaw, 1984) and how characteristics of the clusters / neighbourhoods
formed using the Livehoods method may be more appropriate than traditional
administrative boundaries such as the LSOAs.
10
Section Seven consists of concluding remarks and outlines some ideas for further
research that can help us better understand neighbourhood detection methods using
location based social media.
11
2. INTRODUCTION
2.1. Neighbourhoods
Neighbourhoods are a ubiquitous feature of urban living everyone lives in a
neighbourhood. Many groups have an interest in understanding neighbourhoods.
Cranshaw and Yano (2010) note that analysing neighbourhoods is of interest to
businesses such as realtors and developers as the quality of a neighbourhood
affects the value of their assets, and to researchers in the social sciences as they seek
to understand neighbourhood and community level factors that influence
phenomenon such as obesity rates and perceived happiness through neighbourhood
effects (Sampson et al., 2002). A third group that has an interest in neighbourhoods
are city governments that implement neighbourhood interventions and wish to
identify where the interventions would make sense and be most effective. Being
able to identify neighbourhoods in our cities would be valuable to all three groups.
While there is a general consensus that a neighbourhood is a contiguous
geographic area within a larger city, limited in size, and somewhat homogeneous in
its characteristics (Weiss et al., 2007), it is hard to pin down a more exact definition
(Chaskin, 1998; Weiss et al., 2007). Researchers have defined neighbourhoods in
terms of 3 dimensions with varying emphasis by social ties, physical demarcations
and residents experiences (Chaskin, 1997). These are influenced by many factors
such as administrative boundaries, manmade features such as roads, natural features
such as rivers, demographics, social networks of the people that live in or frequent
the area, and the availability of services and facilities (Cranshaw and Yano, 2010).
Each persons perception of their neighbourhood boundaries may differ, even from
their neighbours, and these perceptions may also differ from the official boundaries
used by city governments for urban planning or neighbourhood initiatives
12
(Campbell et al., 2009). However, researchers have also found evidence that
residents often identify a common core within their neighbourhood, and the
differences are about the boundaries where neighbourhoods begin and end
(Campbell et al., 2009).
Neighbourhoods differ from communities, in the sense that neighbourhoods are tied
to a spatial unit with boundaries, while communities are not limited to spatial units.
This difference is reflected in how the role of neighbourhoods in cities has shifted
over time. To summarize Chaskin (1997), neighbourhoods in the past were tied
closely to the idea of community. There were close ties between those living within
a neighbourhood and a strong sense of identity, akin to an urban village. However,
as transportation systems improved and communication over long distances became
available, ties within a neighbourhood have become less close and more functional,
providing a space where neighbours share information, aid and services. When
studying social ties within neighbourhoods, it may be useful to look at common
social and functional activities between those living in a neighbourhood and where
these activities take place. These may give an indication of places that are
considered part of the neighbourhood for those involved in the activities.
Traditionally, studies on neighbourhoods and the neighbourhood effect have used
boundaries where data was easily available, such as administrative and political
boundaries. The data is often reliable as they are typically collected by government
agencies, and the boundaries used usually do not change greatly. Such data is useful
for understanding long term trends and behaviours such as demographics and
urbanisation. However, these traditional data sources are usually collected at certain
periods with long intervals between each period. The data collected represents
snapshots at particular points in time, and do not capture the multiple changes that
13
may occur in between data collection periods. This means that data from traditional
data sources are less useful for reseachers interested in questions where trends and
behaviors are more short term or temporary in nature, such as commuting behaviour
during transport strikes or riots, are unable to capture the. For example, full censuses
in the United Kingdom take place once every ten years. In addition, data from
traditional sources is often expensive and time consuming to collect. Such issues
means data from traditional sources are less suitable for studying trends and
behaviours that are more short term in nature and change frequently. For studying
more short term and dynamic trends and behaviours, location based social media is
likely to be a more suitable data source.
https://twitter.com/
https://instagram.com/
https://foursquare.com/
14
the data publicly, which further limits the amount of data available for analysis.
Another factor to consider is that users may curate the types of places that they
check-in at using location based social media. Places that are considered more
socially desirable to be at may be over represented when using data from location
based social media. For example, people may be more likely to checkin when eating
at a new fancy restaurant or shopping in a branded goods store rather than when
they are eating at a fast food restaurant or shopping in a discount store. This means
that conclusions based on data from location based social media will likely be
biased towards such socially desirable venues. In the case of neighbourhood
detection, the clusters / neighbourhoods formed may be similarly biased. Previous
research has shown that users have been more likely to check-in at venues
concerning travel and transport, office buildings, and residences (Preotiuc-Pietro
and Cohn, 2013). Despite these limitations, reseachers believe that data from
location based social media can still be valuable for its rich contextual information
and sheer volume available (Silva et al., 2013).
2.3. Review of Methods for Neighbourhood Detection
What follows is a review of neighbourhood detection methods using location-based
social media. Neighbourhood detection using location based social media is
typically treated as a clustering problem, and the methods used so far reflect this
paradigm. Essentially, researchers wish to cluster users social media activities into
contiguous geographic areas based on certain measures of similarity.
Neighbourhood detection methods usually contain three elements:
a. The unit used for aggregation (e.g. grid-based, venue-based)
b. The type of clustering method (e.g. K-Means clustering, spectral
clustering)
16
develop methods for neighbourhood detection. Venues that are considered similar to
each other and fulfil a proximity criterion such as being within a certain distance
from each other are then grouped together and the area bounded by these venues
form a neighbourhood. The proximity criterion is important as it defines the
geographic aspect of the venues. It is similar to how defining the size and shape of
the grids in the grid-based approach determines how the grids are geographically
related to each other. One of the earliest attempts at neighbourhood detection using
location based social media is called Livehoods (Cranshaw et al., 2012) and this
took the venues-based approach. Zhang et al (2013) pointed out that one of the
weaknesses of the venues-based approach is that the neighbourhoods formed have to
be geographically tied to the network of venues used, whereas the grid-based
approach does not.
Clustering methods
Clustering methods used in neighbourhood detection are a reflection of the breadth
and variety of clustering methods used in other fields. This dissertation does not
seek to determine which clustering methods are the best methods for
neighbourhood detection using location baesd social media, since there is a certain
degree of subjectivity. So far, neighbourhood detection methods have included
clustering methods such as K-Means clustering (Del Bimbo et al., 2014), spectral
clustering (Cranshaw et al., 2012; Noulas et al., 2011), and topic-based modelling
(Cranshaw and Yano, 2010). Each clustering method used involves the researcher
choosing parameters used. Examples are the number of topics to use for topic-based
modelling and the number of clusters in K-Means clustering.
18
Similarity measures
A variety of similarity measures have been used in neighbourhood detection. In
terms of properties to include in the similarity measure, researchers have used
properties related to users, such as the users check-in patterns and interests (Del
Bimbo et al., 2014). Researchers have also used properties related to venues in the
databases of location based social media platforms, such as the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues
(Noulas et al., 2011). Other researchers have combined the above mentioned
properties with temporal properties to provide a contextually richer set of properties
to calculate similarity (Falher et al., 2015; Zhang et al., 2013). Different properties
characterise neighbourhoods in different ways, and makes them useful for different
purposes. Amongst the three dimensions of neighbourhoods mentioned earlier
(social ties, physical demarcations and residents experiences), methods in
neighbourhood detection using location based social media have typically used
properties related to residents experiences, for example the number of check-ins,
the temporal pattern of check-ins, and the type and number of venues in the area.
Cosine similarity measures similarity as the angle between two vectors (Xia et al.,
2015). In neighbourhood detection methods, these vectors represent the properties of
the grid and of the venues in the grid-based method and the venues-based method
respectively. Cosine similarity is often used for clustering in neighbourhood
detection with location based social media, and often preferred over other similarity
measures because cosine similarity does not take the magnitude of the vectors into
account. This is useful in cases where the magnitudes of the vectors differ greatly
but at the same time less important for determining similarity. For example, cosine
similarity is often used in information retrieval to determine document similarity as
19
the relative frequency of words in each document and across documents are more
important than the total number of words in a document (Huang, 2008). Similarly,
the magnitude of vectors used in neighbourhood detection differ greatly. The most
popular venues often garner many more check-ins than those less popular and the
most active users check-in much more frequently than those who are less active
(Scellato and Mascolo, 2011). As such, researchers have found that relative
frequencies between venues/grid squares are more useful for neighbourhood
detection rather than absolute numbers, and prefer cosine similarity measures over
Euclidean distance measures when measuring similarity for neighbourhood
detection (Cranshaw et al., 2012; Preoiuc-Pietro et al., 2013).
Researchers use different combinations of the three elements (unit used for
aggregation, clustering method, similarity measure) of neighbourhood detection to
create neighbourhoods, depending on their research purpose. Within each element,
researchers have also had to make decisions that influence the eventual
neighbourhoods formed. Most of the research so far seek to compare urban
neighbourhoods within and across cities so that recommendation engines can make
better recommendations based on criteria such as the users check-in patterns, the
users preferred venue categories and the users interests. Their goals are to suggest
new places that the user may wish to visit, which are similar to places the user has
visited in the past.
A typical example of a neighbourhood detection method for recommendation
engines comes from Noulas et al (2011). They take a grid-based approach and use a
spectral clustering algorithm to cluster grid squares based on the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues.
The method creates neighbourhoods that give us an idea of what type of places are
20
similarity of these venues on the number of check-ins and unique users as well as
the temporal distribution of the check-ins, they also take into account the
distribution of Foursquare venues in the surrounding area.
Cranshaw and Yano (2010) provided a different perspective by treating the question
as an issue of latent topic discovery. They divided the city into grids and applied
topic based modeling to the grids, using each grid as a document and each
Foursquare category tag as a word. With this method, they were able to identify
clusters of places and activities that often appeared together (e.g. beach and seafood).
While research on neighbourhood detection using location based social media has
flourished, there is less research available on understanding whether these methods
accurately reflect neighbourhoods in reality, and how they can contribute to
purposes other than recommending new places that users may wish to visit.
Researchers using the Livehoods algorithm attempted to validate the
neighbourhoods generated through their algorithm (Cranshaw et al., 2012). The
neighbourhoods identified by Cranshaw et als algorithm included neighbourhoods
that corresponded with municipal boundaries, those that were subsets of municipal
boundaries and those that spilled over to more than one municipal boundary.
Cranshaw et al interviewed 27 residents that lived in the city and found that the
neighbourhoods generated by their Livehoods method closely matched the residents
perspectives of neighbourhoods in the city. Cranshaw et als research provides
evidence that the boundaries generated by neighbourhood detection algorithms can
capture local dynamics that includes factors such as municipal boundaries,
demographics, traffic flow and economic development.
22
Some researchers have argued that including more properties in the similarity
measures would better characterise the units being aggregated and produce clusters
that more closely match actual neighbourhoods. For example, Del Bimbo et al (2014)
use both static features (e.g. categories assigned by location based social networks)
and dynamic features (e.g. distribution of the interests of the people who check in at
venues) in their LiveCities method to create neighbourhoods for Florence, which
they then validated qualitatively through online questionnaires with 28 residents.
They found that including both types of features produce neighbourhoods that better
reflect the residents perceptions.
There is much that we do not know about the methods used for neighbourhood
detection process with location based social media. For example, we do not know
how the neighbourhoods detected compare with traditional administrative
boundaries, and how we can combine the neighbourhoods detected with data from
these administrative boundaries to help us better understand cities dynamically. We
also do not know how the neighbourhoods detected may change when data over
different time periods or different time intervals are used and what these changes
may mean.
Better understanding can come in the form of research on particular elements in the
neighbourhood detection process across a variety of methods and comparing the
differences when different elements are used. It can also come in the form of better
understanding a particular method in depth and exploring how the neighbourhoods
formed are different depending on the parameters used. In this dissertation, I look at
the Livehoods method in depth by applying variations of the method on data
collected on Greater London. The Livehoods method was chosen as it is a venuesbased approach which has not been used as much in the literature. It is also one of
23
the rare methods in the neighbourhood detection literature that has validated the
clusters / neighbourhoods generated with the citys residents and found strong
support that the residents perceptions agreed with the clusters formed. This gives it
legitimacy in being able to detect actual neighbourhoods compared to other
neighbourhood detection methods. In addition, it has not yet been applied to the
Greater London area.
24
3. METHODOLOGY
Python was used for most of the analysis and visualization in this work. IPython
notebooks were used for early exploration and experimentation with the data and
Python scripts were written in the later stages to run the neighbourhood detection
method. All scripts used for this work can be found in the appendix section.
3.1. Data sources
The data used for analysis consists of 42,581 Foursquare check-ins at 8,845 venues
by 12,397 unique users in the Greater London area from 6th April 2011 to 31st May
2011. This data was kindly provided by Anastasios Noulas from the University of
Cambridge. For each check-in, the data consists of the user ID, the time, the latitude
and longitude, and the venue ID. Further information on the venues was collected
using the python package foursquare. This included information on the venues
name, category and subcategory (as categorized by the social media network
Foursquare).
Data was also collected from 6th April 2015 to 31st May 2015 for three cities:
London, Singapore and New York City. The Python package tweepy was used to
collect data from Twitters streaming API, which offers samples of the data being
posted on Twitter in real time. A subset of this data consists of Foursquare checkins
from users who have linked their Foursquare accounts to their Twitter accounts such
that their Foursquare checkins also appear as tweets on Twitter. The scripts for
collecting this data and formatting them for analysis are also included in the
appendix. While this data was eventually not used in the analysis for this work,
future work could compare the results generated across the three different cities, or
the results generated from 2 different time periods in London.
25
neighbourhoods are areas that a similar set of people frequent the more often the
same people go to the same venues, the more likely these venues are in the same
neighbourhood. To validate this method, Cranshaw et al (2012) had conducted
qualitative interviews with residents in their study area and verified that the
neighbourhoods generated by their method closely matched the residents
perspectives of neighbourhoods in the city.
Specifically, I applied the following steps from Cranshaw et al (2012) to generate
the affinity matrices used in the spectral clustering algorithm:
1. Given the following sets:
a. Set V, a set of nv Foursquare venues, for which we can compute a
geographic distance (, ) between the venues given their latitude
and longitude coordinates.
b. Set U, a set of nu Foursquare users
c. Set C, a set of checkins of users in U to the venues in V
Each venue v in V is then represented by an nu dimensional vector
(
. )
27
(, ) + ,
0,
() ()
28
29
The clusters nearer to the edges of the city tend to remain large and unbroken.
Generally, the clusters formed nearer the edge of the city are larger than the clusters
formed nearer the centre of the city. This phenomenon is likely because the density
of venues further from the centre of the city is much lower than the density of
venues nearer the centre of the city. Since the Livehoods method uses a nearest
neighbours criterion for identifying adjacent venues, areas where venues are less
dense will cover larger areas when searching for adjacent venues and result in the
method creating boundaries with larger areas. Many of the clusters formed when
there are a higher number of clusters are either subsets of the clusters formed using a
lower number of clusters, or very similar to the clusters formed using a lower
number of clusters. The clear exception occurs where k = 74 and 72 clusters are
formed a previously undetected large cluster is formed. This is the qualitatively
different cluster mentioned earlier.
Donetti and Munoz (2004) have pointed out that the weakest part of the eigengap
heuristic is that we do not know how many eigenvalues (k in the Livehoods method)
should be calculated apriori. While Cranshaw et al (2012) also has not provided any
guidelines on how to choose the right value of k for cities of different sizes, cities
occupying a larger area could be seen to potentially contain more neighbourhoods,
and larger values of k should be used. As the Greater London area is much larger
than Pittsburgh, k should be larger than 45. A k value of 100 was arbitrarily chosen
in this work to test the effects of tuning the nearest neighbour parameter and the
alpha constant, to reflect the possibility of a higher number of neighbourhoods in
London. An even higher value may be more suitable as London is many times larger
than Pittsburgh, but this value was used to keep computation requirements
manageable.
31
Figure 1: Relationship between number of smallest eigenvalues (k) found and number of clusters formed
32
7 clusters (k = 9)
11 clusters (k =13)
23 clusters (k = 25)
41 clusters (k = 43)
72 clusters (k = 74)
99 clusters (k = 101)
to = 0.01 it expands greatly to include many other parts of the Greater London
area. This boundary remains consistent as increases. This behaviour again
highlights the qualitatively different nature of this cluster.
Figure 3: Boundaries formed for different alpha constants
= 0.00
= 0.01
= 0.02
= 0.03
= 0.04
= 0.05
34
levels of m. It is hard to determine the optimal number to use for m, but values of 8
and higher seem to generate reasonably consistent clusters.
Figure 4: Boundaries formed for different nearest neighbours parameter (m)
m=5
m=8
m = 10
m = 15
m = 18
m = 20
35
clusters formed, with more clusters being formed when the number of eigenvalues
increases. The investigation also revealed that two types of clusters may be formed
by the method. One type of cluster is the contiguous geographic space that can be
associated with neighbourhoods, and another type of cluster seems to be large and
spans the entire city.
In the next two sections, I will use one of the sets of clusters / neighbourhoods
generated by the Livehoods method to illustrate the types of information that can be
derived from clusters formed using the Livehoods method, and neighbourhood
detection methods in general. In section 5, I combine the clusters formed with data
from Foursquares venues database and use it to describe the types of venues and
activities that take place within the cluster. Incorporating information from location
based social media to better understand the clusters / neighbourhoods formed is
common for researchers using neighbourhood detection methods.
In section 6, I attempt to combine the cluster / neighbourhoods formed using the
Livehoods method with data from administrative boundaries (the Greater London
Lower Super Output Areas in this case) and determine the ethnic diversity of the
clusters / neighbourhoods formed. Integrating cluster / neighbourhoods detected
using neighbourhood detection with data from administrative boundaries is rare in
the neighbourhood detection literature as most researchers using neighbourhood
detection methods have used them for developing recommendation engines that find
similar places based on social media activity. My attempt tries to add more meaning
to the clusters formed so that they can be used for other purposes, such as
investigating ethnic diversity issues within neighbourhoods.
37
of the cluster and the number of venues in the cluster the number of venues per
square kilometer ranged from 1.27 (cluster 18) to 1,304.61 (cluster 7) with a median
of 43.95; the number of checkins per venue ranged from 1.26 (cluster 65) to 40.09
(cluster 26) with a median of 3.23; and the number of unique users per venue ranged
from 0.55 (cluster 67) to 19.52 (cluster 16) with a median of 1.89.
Many of the distributions of cluster properties are highly skewed. Clusters 2, 13, 16
and 26 are particularly active clusters and are in the top 5 in terms of users and
checkins across all clusters, whether in absolute terms or on a per venue basis.
Collectively, the four clusters account for 29.5% of all checkins from 60% of unique
users despite containing only 5.7% of all venues across the city. This is
understandable for clusters 2 and 13 as they are in the city centre, and cluster 26 as it
is at Heathrow airport. Cluster 16 consists of Wembley stadium, and it is likely that
it had such high values for users and checkins during that period as it was the host
for the 2011 UEFA Champions League Final on 28 th May 2011, which is within the
period of analysis. People attending this event are highly likely to checkin on social
media as it is a rare and meaningful event for them. Under more normal
circumstances, cluster 16 likely would have values closer to the median.
Across all clusters, cluster 18 stands out with the largest area and relatively low
frequencies of users and venues over such a large area. It could be classified as an
outlier, but results for the cluster have been included for completeness. In addition,
all variations of the Livehoods method detect this cluster or a cluster similar to this
cluster. This is more likely an artefact of using the nearest neighbours proximity
criterion as discussed above.
39
City area
40
Area (sq
km)
Number of
checkins
Number of
users
Number of
venues
Number of
check-ins per
sq km
Number of
users per sq
km
Number of
venues per sq
km
Number of
check-ins per
venue
Number of
check-ins per
user
Number of
users per
venue
0
1
0.69
0.89
1002
469
641
321
238
165
1447.35
527.2
925.9
360.84
343.78
185.48
4.21
2.84
1.56
1.46
2.69
1.95
2
3
1.25
26.83
5147
356
2585
178
161
180
4121.23
13.27
2069.82
6.63
128.91
6.71
31.97
1.98
1.99
2
16.06
0.99
4
5
2.95
0.75
851
462
450
230
163
102
288.6
616.58
152.61
306.95
55.28
136.13
5.22
4.53
1.89
2.01
2.76
2.25
6
7
2.19
0.16
1055
695
556
447
239
215
481.71
4217.23
253.87
2712.38
109.13
1304.61
4.41
3.23
1.9
1.55
2.33
2.08
8
9
0.82
1.77
754
610
493
325
195
241
924.93
344.83
604.76
183.72
239.21
136.24
3.87
2.53
1.53
1.88
2.53
1.35
10
11
1.5
0.6
806
967
409
622
253
231
536.37
1602.32
272.18
1030.65
168.36
382.77
3.19
4.19
1.97
1.55
1.62
2.69
12
13
14
1.09
2.73
4.62
294
2888
540
163
2032
213
120
202
155
270.77
1056.98
116.81
150.12
743.7
46.07
110.52
73.93
33.53
2.45
14.3
3.48
1.8
1.42
2.54
1.36
10.06
1.37
15
0.62
1357
578
108
2184.13
930.31
173.83
12.56
2.35
5.35
16
22.55
3508
1737
89
155.54
77.01
3.95
39.42
2.02
19.52
17
18
1.74
203.11
691
257
322
110
165
157
396.12
1.27
184.59
0.54
94.59
0.77
4.19
1.64
2.15
2.34
1.95
0.7
19
20
0.88
2.08
248
556
154
296
101
154
280.51
267.1
174.19
142.2
114.24
73.98
2.46
3.61
1.61
1.88
1.52
1.92
21
22
23.94
12.1
831
453
398
304
257
157
34.71
37.43
16.63
25.12
10.74
12.97
3.23
2.89
2.09
1.49
1.55
1.94
41
Cluster
Area (sq
km)
Number of
checkins
Number of
users
Number of
venues
Number of
check-ins per
sq km
Number of
users per sq
km
Number of
venues per sq
km
Number of
check-ins per
venue
Number of
check-ins per
user
Number of
users per
venue
23
4.7
378
168
139
80.49
35.78
29.6
2.72
2.25
1.21
24
25
26
1.56
42.64
0.35
464
285
2165
296
121
975
123
135
54
296.6
6.68
6131.41
189.21
2.84
2761.26
78.62
3.17
152.93
3.77
2.11
40.09
1.57
2.36
2.22
2.41
0.9
18.06
27
28
0.41
0.31
348
167
235
117
163
48
844.05
543.27
569.97
380.61
395.34
156.15
2.13
3.48
1.48
1.43
1.44
2.44
29
30
1.24
1.71
827
1921
384
547
54
148
668.99
1126.03
310.63
320.63
43.68
86.75
15.31
12.98
2.15
3.51
7.11
3.7
31
32
0.75
136.96
160
432
124
340
31
131
214.22
3.15
166.02
2.48
41.5
0.96
5.16
3.3
1.29
1.27
4
2.6
33
34
25.62
0.21
405
637
224
394
141
188
15.81
3098.25
8.74
1916.34
5.5
914.4
2.87
3.39
1.81
1.62
1.59
2.1
35
36
0.15
22.11
181
321
94
140
38
93
1197.88
14.52
622.1
6.33
251.49
4.21
4.76
3.45
1.93
2.29
2.47
1.51
37
38
0.6
0.32
358
1169
183
740
73
279
600.17
3624.81
306.79
2294.57
122.38
865.12
4.9
4.19
1.96
1.58
2.51
2.65
39
1.4
1366
622
161
974.53
443.75
114.86
8.48
2.2
3.86
40
8.27
179
69
81
21.65
8.34
9.8
2.21
2.59
0.85
41
42
5.94
0.28
144
481
82
311
87
75
24.23
1702.65
13.79
1100.88
14.64
265.49
1.66
6.41
1.76
1.55
0.94
4.15
43
44
1.86
75.25
172
167
134
69
29
99
92.24
2.22
71.87
0.92
15.55
1.32
5.93
1.69
1.28
2.42
4.62
0.7
45
46
1.13
6.48
43
65
10
30
16
40
38.16
10.03
8.88
4.63
14.2
6.17
2.69
1.62
4.3
2.17
0.62
0.75
47
11.88
315
149
144
26.51
12.54
12.12
2.19
2.11
1.03
42
Cluster
Area (sq
km)
Number of
checkins
Number of
users
Number of
venues
Number of
check-ins per
sq km
Number of
users per sq
km
Number of
venues per sq
km
Number of
check-ins per
venue
Number of
check-ins per
user
Number of
users per
venue
48
0.11
199
155
36
1761.06
1371.68
318.58
5.53
1.28
4.31
49
50
51
31.95
0.66
0.55
173
255
385
86
117
248
89
99
131
5.42
387.71
705.65
2.69
177.89
454.55
2.79
150.52
240.1
1.94
2.58
2.94
2.01
2.18
1.55
0.97
1.18
1.89
52
53
39.21
1.12
775
751
287
413
129
209
19.77
670.36
7.32
368.65
3.29
186.56
6.01
3.59
2.7
1.82
2.22
1.98
54
55
87.89
5.6
202
316
93
98
107
123
2.3
56.39
1.06
17.49
1.22
21.95
1.89
2.57
2.17
3.22
0.87
0.8
56
57
18.86
1.12
551
189
287
105
200
79
29.21
168.69
15.21
93.72
10.6
70.51
2.76
2.39
1.92
1.8
1.44
1.33
58
59
0.33
21.86
766
412
444
195
132
193
2296.85
18.85
1331.33
8.92
395.8
8.83
5.8
2.13
1.73
2.11
3.36
1.01
60
61
47.01
1.27
228
115
88
60
107
56
4.85
90.25
1.87
47.08
2.28
43.95
2.13
2.05
2.59
1.92
0.82
1.07
62
63
1.99
9.31
181
47
56
20
66
28
90.82
5.05
28.1
2.15
33.12
3.01
2.74
1.68
3.23
2.35
0.85
0.71
64
8.39
1325
681
261
157.85
81.13
31.09
5.08
1.95
2.61
65
10.86
54
31
43
4.97
2.86
3.96
1.26
1.74
0.72
67
68
33.75
14.95
99
103
28
44
51
38
2.93
6.89
0.83
2.94
1.51
2.54
1.94
2.71
3.54
2.34
0.55
1.16
69
70
4.78
0.5
113
699
76
367
73
115
23.62
1388.01
15.89
728.75
15.26
228.36
1.55
6.08
1.49
1.9
1.04
3.19
71
34.32
532
323
221
15.5
9.41
6.44
2.41
1.65
1.46
43
44
45
.
100
.
Figure 7 shows the overall distribution of venues and checkins across all clusters
according to Foursquares main categories in percentage values. 29.23% of venues in
the data are in the food category, followed by 17.05% of venues in the nightlife spots
46
category. Users, however, check-in mostly at venues related to travel & transport
(23.04%), professional & other places (18.86%), and arts & entertainment venues
(15.68%). From here, we can observe that venues in the travel & transport, professional
& other places, nightlife spot and arts & entertainment receive a disproportionate
number of checkins. This means that clusters formed based on Foursquare checkins are
likely to be biased towards these venues in these categories, and may be more suitable
for research questions related to such categories (e.g. transport, culture).
Figure 7: Overall distribution of venues and checkins across clusters
% of venues
% of checkins
Similar profiles can be created for each cluster to form neighbourhood profiles. To
calculate the distribution of venues / checkins by category within a neighbourhood, the
formula used to calculate the value for each category (B) was:
47
.
100
.
This gives a sense of the type of venues in the clusters and the type of activities that
occur within them. These neighbourhood profiles were compared with the city profile
to understand which categories within the neighbourhood were overrepresented /
underrepresented. For each category, the formula was:
( )
100
Tables 2 and 3 contain the percentage difference figures for all clusters for venues and
checkins respectively, with the highest positive difference for each cluster highlighted.
These percentage differences between each category was used to determine which types
of venues occurred more frequently and which types of venues users checked-in at
more frequently within the cluster. For example, clusters 28 and 29 have more venues
and checkins in the travel and transport category, as these clusters are essentially the
London Heathrow airport terminals, which we expect to have a higher concentration of
venues and checkins related to travel and transport. Another example is clusters with
high levels of concentration of venues and checkins in the college & university category.
Clusters 27, 46 and 47 have percentage difference figures of over 1000% for users
checking-in, and they contain University College London, Brunel University London,
and the Queen Mary University of London respectively.
From tables 2 and 3, we again observe differences between checkin behaviour and types
of venues. For many clusters, the most overrepresented category in terms of venues is
different from the most overrepresented category in terms of checkins. Cluster 3, for
48
49
Table 2: Percentage difference between proportion of venues within cluster to proportion of venues
within city in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain venues in that category
Cluster
Arts &
College &
Food
Nightlife
Outdoors &
Entertain
University
Spot
Recreation
ment
Professional &
Other Places
Residence
Shop &
Service
Travel &
Transport
-36.9
22.98
-12.42
-2.02
-69.76
99.3
-74.25
73.22
-75.35
1
2
3
4
5
6
7
56.72
-30.23
-38.26
-58.42
67.78
-9.85
-31.44
-36.97
9
-79.5
-77.73
-22.09
-31.03
11.54
18.11
34.81
9.09
60.26
94.24
-20.18
-4.65
-40.54
-45.16
13.45
56.88
-56.19
102.84
134.71
-22.52
-62.7
-20.29
58.7
18.75
-65.77
31.04
-78.1
56.2
193.33
-28.4
10.51
-30.51
47.02
2.95
-25.86
-75.62
-29.9
-57.63
-37.2
-17.33
-52.95
-73.22
-38.01
5.49
75.24
63.82
-35.31
-79.92
8
9
10
-54.73
-55.9
-48.85
-22.79
180.75
55.06
23.46
2.22
-18.64
-9.36
-4.14
-4.91
-49.38
-90.14
-42.81
2.55
133.34
122.44
-83.84
-62.21
-75.65
48.28
-42.23
-21.83
-18.77
-57.8
-15.52
11
12
13
14
15
16
17
49.53
26.02
30.04
-0.96
-20.1
129.2
21.43
154.99
-4.49
27.43
25.29
10.82
-13.19
5.53
-39.23
24.18
9.59
92.24
-22.51
21.4
-25.73
-34.44
54.36
-68.65
-76.52
9.05
-36.72
-14.57
-24.57
-8.55
-54.33
13.9
38.45
-13.13
1.53
-41.32
45.47
-80.73
10.19
-8.28
-43.21
-13.5
-73.83
33.45
6.06
-61.68
-71.29
40.73
-8.14
166.2
56.65
-44.67
18
19
20
-52.25
-14.22
-15.11
117.14
-51.24
286.03
-29.47
31.58
-8.37
-13.5
-38.66
-23.11
60.17
-4.08
58.2
-25.01
-53.37
7.68
263.68
22.5
-19.18
18.16
-25.08
-1.14
3.35
61.23
-12.97
21
22
23
24
25
13.19
60.08
-34.8
18.68
-49.49
-1.02
-33.82
-62.94
-10.05
-10.97
-18.15
-2.77
31.48
-28.26
31.98
70.65
-39.4
-60.4
-22.95
143.38
95.26
64.02
-55.77
88.27
-49.51
-36.71
24.05
-21.14
9.84
74.07
24.69
-6.9
-71.75
92.36
14.06
-36.45
-0.35
141.86
61.76
-46.44
-35.35
33.67
-52.68
9.32
26
27
5.15
557.5
-84.9
-5.91
-74.66
-9.77
52.44
-43.69
-80.66
-2.42
520.55
-68.56
280.27
78.17
-59.01
49.43
-42.13
-57.48
24.38
6.84
-4.74
-27.19
-86.13
-64.32
-69.56
34.47
-48.43
-20.11
-69.86
-6.33
-43.2
-67.33
-52.36
-57.63
-27.37
65.99
33.85
-13.9
413.88
383.09
-6.48
-46.44
66.35
34.75
21.84
15.75
-61.95
-17.32
54.33
-21.75
12.62
-7.7
-6.71
-73.29
-36.87
-56.6
-36.87
5.64
23.61
35.7
1.74
33.81
10.8
-37
69.65
-13.03
-77.59
-1.62
-27.16
5.95
-60.03
-30.97
212.06
-83.13
-51.87
23.14
15.34
-32
-12.38
166.26
184.75
107.09
12.27
-55.25
21.38
5.48
-64.45
4
-80.04
-81.31
-44.9
13.6
-23.29
269.19
11.88
118.52
-19.12
-14.69
-85.88
-24.21
-68.67
20.23
57.67
-86.72
-65.78
15.55
-79.64
-57.8
-4.63
143.13
-79.37
76.51
41.13
-73.56
180.57
328.44
71.79
7.84
-36.71
-33.73
73.62
-2.14
-45.37
143.45
42.38
7.09
-76.93
8.67
-72.2
-25.59
28
29
30
31
32
33
-51.49
-84.8
291.8
62.27
113.59
34
35
36
37
38
39
40
41
42
43
44
45
46
-52.34
85.21
-16.27
-35.32
171.65
-13.91
-44.94
3.16
-39.18
-21.64
-73.54
47
48
-6.69
473
40.38
-36.54
-14.21
226.23
17.28
38.28
20.32
461.5
112.17
-32
-12.38
20.96
28.6
-70.01
-53.57
41.43
-34.86
11.9
80.23
205.03
164.5
99.28
38.55
-89.22
-38.53
253.85
120.97
-62.21
263.68
76.33
99.89
50
Cluster
Arts &
Entertain
ment
College &
University
Food
Nightlife
Spot
Outdoors &
Recreation
Professional &
Other Places
Residence
Shop &
Service
Travel &
Transport
49
4.48
-15.42
-17.82
-12.38
-14.8
198.41
59.69
-10.74
50
51
52
53
54
55
56
19.84
36.96
-25.23
-58.63
-75.45
7.23
19.84
9
-61.07
27.5
-76.49
11.62
-2.48
-75.23
34.81
48.82
-55.4
30.43
-41.43
-12.28
16.98
-38.3
37.11
-62.58
-17.18
-1.71
-7.99
-22.1
-46.4
-61.71
-58.2
-19.06
37.23
43.88
-39.09
-13.13
-13.13
8.39
-47.53
-19.93
-45.59
-48.67
-31.54
25.6
-25.24
22.44
94.2
125.1
124.77
66.52
-18.09
-70.75
149.11
-8.11
17.43
-56.03
-25.54
57
58
59
60
61
62
63
-68.66
92.51
-88.42
-76.85
-53.7
-18.51
-7.39
-28.73
-63.52
-47.36
5.28
49.58
-15.24
-9.25
-21.08
-21.08
4.17
-36.87
25.51
0.94
65.54
-60.27
-7.3
74.81
58.92
-64.95
25.56
42.38
81.21
-48.23
173.36
3.55
-43.2
51.16
-66.44
-24.48
-16.09
-26.16
-32.87
-10.48
-31.27
98.37
429
32.25
36.88
-29.94
1.1
81.98
21.32
-46.62
142.65
-67.87
-17.77
2.85
-28.8
89.88
-58.23
-36.71
64
65
67
23.23
-62.64
-6.19
-48.94
-81.23
-10.71
2.83
25.99
65.34
-33
146.27
10.16
73.74
-0.22
-6.15
156.72
-21.37
-46.19
-73.83
20.23
34.75
104.77
50.53
40.38
-47.39
81.63
-17.67
-17.4
-11.71
-10.35
-15.9
13.15
38.06
75.23
-6.06
54.82
-77.62
-77.28
-31.49
-3.22
-11.83
79.04
-86.31
-81.66
-17.94
-15.61
-14.31
215.81
-22.27
68
69
70
71
-7.35
120.25
764.33
-68.66
18.68
-52.24
-10.05
140.23
5.16
83.76
117.82
154.23
51
Table 3: Percentage difference between proportion of users within cluster checking-in to proportion of users
within city checking-in in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain checkins at venues in that category
Cluster
Arts &
College &
Food
Nightlife
Outdoors &
Professional &
Entertain
University
Spot
Recreation
Other Places
ment
Residence
Shop &
Service
Travel &
Transport
-79.6
-0.55
40.74
27.39
-68.98
-18.97
-85.59
301.5
-56.14
1
2
3
4
5
6
7
-76.69
-96.62
-81.43
-72.77
126.08
116.72
-81.33
9.73
-58.82
-82.9
-86.93
102.86
-89.9
104.46
-4.76
45.69
21.01
259.98
443.13
-89.18
56.17
-70.33
-29.67
42.39
268.23
-84.49
-79.65
47.46
-73.65
-53.68
-94.49
512.27
4.96
-83.59
-48.17
-43.21
-9.12
397.75
-35.65
-20.14
-90.09
-15.14
-52.84
-62.41
-98.82
-40.5
-88.73
-64.74
50.33
-28.4
-75.13
-93.59
-20.42
154.67
54.72
-77.63
-87.57
8
9
10
-78.01
-85.31
-77.83
-39.08
425.78
92.76
159.78
109.76
14.47
23.12
105.26
19.42
-55.41
-98.28
-85.73
-63.36
67.16
52.78
-90.29
-40.13
-81.92
176.79
-46.84
8.74
-32.2
-47.13
26.19
11
12
13
14
15
16
17
-13.69
-48.3
-82.6
-38.17
-72.15
524.9
107.66
445.6
71.85
72.18
178.87
-61.13
13.87
-35.39
-91.76
10.02
60.66
242.42
-82.41
51.86
-67.18
-95.61
96.5
-68.8
-60.69
612.28
-92
-99.12
-12.62
-57.56
-44.18
-83.99
40.98
-73.94
-92.89
-78.23
-81.57
-78.63
207.14
-45.07
-94.21
-50.08
-96.01
-86.8
95.86
-75.39
-77.12
-28.25
13.97
236.68
-93.28
-68.76
18
19
20
-80.65
-34.14
-32.68
243.03
-61.93
524.9
21.36
244.84
32.26
83.09
8.38
-3.85
-41.15
-56.45
83.51
-43.26
-81.68
-38.03
822.86
142.79
75.02
37.06
-17.42
-42.16
-12.31
-14.89
-10.84
21
22
23
24
25
195.81
200.33
-84.82
-28.72
-89.82
-44.45
117.03
-24.29
41.17
12.16
8.52
61.91
65.53
2.18
78.69
100.78
-3.02
-59.82
15.6
-30.11
-40.43
41.43
-88.47
-7.1
-76.61
-73.89
-2.85
-39.35
-20.61
192.27
90.3
20.7
-67.85
331.64
-35.31
-72.75
-7.8
397.29
62.27
-77.81
-70.07
11.63
-90.41
28.77
26
27
-84.32
1098.59
-97.28
126.17
-94.73
174.43
-12.23
107.7
-99.06
-4.81
316.58
-73.98
31.14
186.79
-56.49
70.33
3.49
-75.58
-37.7
-24.24
-11.74
-26.53
-91.78
-87.72
-92.67
14.29
-65.94
3.9
-96.76
-84.51
-93.1
-84.3
-61.95
-73.64
-86.46
505.79
-35.42
-22.22
248.71
295.48
-84.18
-83.63
-45.16
-20.15
174.14
104.41
-67.45
29.23
205.93
-54.12
60.65
74.14
-14.85
-45.11
28.68
-65.28
28.21
129.52
58.14
32.59
11.4
166.06
70.66
-31.84
200.22
9.32
-84.94
103.37
-47.62
117.59
-40.18
204.98
69.69
-44.27
-95.43
-39.14
-24.61
-91.22
-93.95
4
405.12
16.57
-36.21
-83.29
-31.51
121.94
-81.88
95.53
-93.6
-92.07
-57.26
-61.82
-29.68
99.24
-48.91
167.03
-74.63
-60.61
-96.81
-22.12
-95.44
15.42
118.24
-82.46
-93.96
26.05
-86.18
-28.66
-17.95
-29.54
-76.82
53.02
56.91
-87.46
184.94
249.66
-10.42
35.75
-67.68
5.24
140.56
97.37
-80.56
-17.25
-37.51
-38.35
-91.78
47.99
-94.8
-44.08
28
29
30
31
32
33
-91.32
-82.04
352.94
287.79
95.82
34
35
36
37
38
39
40
41
42
43
44
45
46
-92.35
-79.11
71.27
5.11
5.28
-69.44
-87.99
-40.51
-89.89
-96.02
-90.23
47
48
-72.79
307.49
-44.45
-30.99
254.39
73.2
31.84
-3.99
614.44
1768.53
1136.03
108.91
-93.44
-77.61
-16.51
201.37
-89.89
290.4
-77.6
220.05
125.45
559.49
342.83
312.56
-33.15
-92.94
-84.07
1130.06
688.18
3.54
193.38
170.81
308.53
52
Cluster
Arts &
Entertain
ment
College &
University
Food
Nightlife
Spot
Outdoors &
Recreation
Professional &
Other Places
Residence
Shop &
Service
Travel &
Transport
49
-70.19
50
51
52
53
54
55
56
-65.64
-59.06
-82.27
-90.58
-61.77
-81.91
-70.65
128.44
-75.26
-13.16
-87.47
27.09
71.85
-82.26
65.61
61.18
-61.14
-11.45
396.48
22.82
-3.05
97.57
147.44
-74.25
97.41
7.92
126.99
114.2
95.08
234.55
-59.39
-0.14
80.89
36.97
31.27
-78.23
-32.08
-97.16
-61.31
-56.39
151.58
-77.69
-67.94
-55.35
-26.87
-53.28
-61.78
-81.39
-75.46
-39.3
60.82
52.53
-66.01
246.14
132.15
72.64
234.11
17.7
-70.58
196.19
-16.58
2.76
-70.58
-42.64
57
58
59
60
61
62
63
-64.41
53.65
-98.27
-89.93
-73.43
-89.11
-81.55
5.16
-87.98
-7.75
78.45
182.75
-21.77
50.14
9.43
66.67
84.43
131.49
161.91
93.28
175.73
-55.55
38.29
116.4
214.3
-75.95
195.51
-31.42
-59.18
-89.9
-17.19
-71.94
-74.7
-53.74
-83.35
-75.85
-57.5
-12.9
-70.48
151.47
-71.26
230.91
1749.17
40.82
38.05
-83.54
-44.71
225.95
41.18
-50.41
152.1
-39.99
-49.7
62.34
-46.96
68.04
17.08
-53.32
64
65
67
-81.27
-51.58
-14.51
16.28
-67.1
-6.47
163.13
115.05
285.27
-76.51
-46.83
-55.48
-1.15
11.85
-17.29
718.73
363.23
-81.84
-53.1
19.42
23.69
17.24
5.03
68
69
70
71
250.01
-93.49
184.94
-46.23
259.49
-40.83
9.13
21.66
121.86
-24.04
57.15
-56.55
28.76
-69.48
-43.88
-94.29
-79.16
-79.13
-24.1
-24.29
176.12
-90.11
-96.95
-41.98
-9.66
-29.98
232.53
-33.32
44.8
-38.81
55.53
-59.97
5.16
48.34
355.9
228.77
295.88
486.75
53
where si is the share of ethnic group i, out of a total of n ethnic groups. This can be
interpreted as the probability that two randomly selected individuals from the same
area are of different ethnic origin (Sturgis et al., 2013). A higher score reflects a more
ethnically diverse population.
I use the same Hirschman concentration index to calculate the scores for Londons
LSOAs using counts of the following ethnic groups: white, mixed/multiple ethnic
groups, Asian/Asian British, Black/African/Caribbean/Black British, Others.
To calculate the Hirschman concentration index for the clusters formed using the
Livehoods method, I summed up the counts of the ethnic groups from LSOAs that
intersected the cluster, even if just partially, and calculated the Hirschman concentration
index based on those sums.
Figure 8 shows the Hirschman concentration index (HI) for each cluster along with the
average HI value for all LSOAs that intersect the cluster and the maximum and
minimum HI values amongst LSOAs within the cluster.
55
56
From the figure, we see that HI values for clusters are lower than the average HI values
for LSOAs within the clusters. This means that each cluster is less diverse than its
component parts in general. In some cases, such as clusters 16 and 52, the cluster HI
value is even lower than the minimum HI value amongst all LSOAs within the cluster.
It is clear that the picture of ethnic diversity is different depending on whether LSOAs
or the clusters from the Livehoods method are used, and which measure to use depends
on the research question being asked. This is expected whenever different boundaries
are used and is related to the modifiable areal unit problem (Openshaw, 1984), an issue
in spatial analysis where there is an almost infinite number of different ways by which
a geographical region of interest can be areally divided (Openshaw, 1984), and using
different areal units may affect the results of geographical studies. One way to manage
the problem is to rely on theory to select areal units that are relevant to the purpose of
the study (Openshaw, 1984).
In the case of measuring the effect of ethnic diversity on social cohesion, the idea that
people in the same neighbourhoods go to similar places may be important, as they then
come into contact with each other at these places, whether it is the neighbourhood
supermarket, bus stop or school. If this is indeed an important factor, the clusters
created by the Livehoods method may be a more suitable unit of analysis than LSOAs
as the clusters have been created to reflect areas where a similar set of people frequent,
while the LSOAs do not reflect such information. While not attempted in this work, it
would be interesting to replicate Sturgis et als (2013) research using the clusters
created using the Livehoods method and compare the results to the original results.
Other studies on neighbourhood characteristics and effects could similarly benefit from
57
the perspective offered by the clusters created using the Livehoods method. Examples
of other domains where this perspective could be useful are in the areas where
neighbourhood diversity is considered important, such as the effects of racial, ethnic,
religious and /or socioeconomic diversity on social trust, cohesion, crime and / or
voting behaviour within neighbourhoods. In these cases, the idea of neighbourhood may
be better represented by clusters / neighbourhoods formed using the Livehoods method
instead of LSOAs. Other neighbourhood detection methods may also produce clusters /
neighbourhoods that are more suitable than the LSOAs.
58
7. CONCLUSION
7.1. Concluding Remarks
In this work, I have argued that social media is a useful source of data for defining
neighbourhoods as it provides rich contextual information on user activity at different
times of day. The neighbourhood boundaries formed by neighbourhood detection
methods are dynamic and contain much contextual information about the city, and as
such can be useful for research and analysis in social science, policymaking, and urban
planning. While there is some research on how to generate neighbourhood boundaries
using location based social media, our understanding of these methods is limited and we
need to better understand these methods so that the methods and the boundaries/clusters
generated can be put to better use.
I pointed out that neighbourhood detection methods using location based social media
generally have three elements: the unit used for aggregation, the type of clustering
method, and the similarity measures used. To illustrate how we can better understand
neighbourhood detection methods, I undertook an in-depth exploration of Cranshaw et
als (2012) Livehoods method using Foursquare data from London and analysed the
various elements in the method. Through the analysis, I found that the method is
relatively robust even when the parameters in the method are tweaked. I also found that
the method may generate two qualitatively different types of clusters, where one type is
contiguous geographic spaces that can be associated with neighbourhoods, and the other
type may reflect the boundaries of the city as perceived by Foursquare users.
59
I then illustrated some types of information that clusters generated using Livehoods
method could provide, in terms of information derived from just social media, and also
in terms of information derived by combining the clusters with Lower Super Output
Area (LSOA) data. In the latter case, I showed how the ethnic diversity score can differ
based on whether the Livehoods clusters or the LSOA boundaries were used, and
argued that Livehoods clusters may be more suitable than LSOA boundaries in
situations where the concept of neighbourhoods encompasses the idea that
neighbourhoods are a set of places that a similar set of people go to. For social scientists
and policy makers, this may be cases where they wish to implement or evaluate policies
and programmes that are neighbourhood based and the effects are influenced by ethnic
diversity.
Beyond the Livehoods method, other methods can be investigated in a similar manner,
such as the Hoodsquare (Zhang et al., 2013) and LiveCities (Del Bimbo et al., 2014)
methods. In particular, it is important to better understand how tuning parameters for a
particular neighbourhood detection method changes the clusters formed. It is also
important to understand the type of information that is contained within the clusters
formed.
7.2. Limitations and Future Research
There were many other lines of inquiry that have not been explored regarding the
investigation of the Livehoods method. One example is how the clusters formed change
over time. The method could be applied over different time scales (e.g. weeks or
months) or different times of day (e.g. morning/night). This could give us an idea of
60
how dynamic or consistent neighbourhood detection using location based social media
is. A second line of inquiry would be to collect Foursquare data from different cities
and compare the results of tuning the Livehoods parameters in these cities. This
comparison would give us an idea of the methods robustness across cities and provide
clues at how characteristics of cities may influence neighbourhood detection methods in
general. A third line of inquiry would be to generate clusters using distance as a
proximity criterion instead of nearest neighbours. From the analysis in this work, using
the nearest neighbours as a proximity criterion seems to lead to larger clusters being
formed in areas where Foursquare venues are less dense. Clusters formed using the two
types of proximity criterion could be compared to provide insight for when it may be
better to use either criterion.
As mentioned earlier, better understanding neighbourhood detection methods can also
be about investigating certain elements of the neighbourhood detection process across a
number of methods. For example, it would be useful to consider how the venues-based
approach compares against the grid-based approach in terms of unit aggregation. While
the Hoodsquare method did not take a strict grid-based approach, Zhang et al (2013)
showed how a comparison could be done by comparing their Hoodsquare method with
the Livehoods method in terms of which could better predict a users home
neighbourhood.
In terms of the type of clustering method, we could learn from the community detection
literature on how to compare clustering algorithms. Lanchichinetti and Fortunato
(Lancichinetti and Fortunato, 2009), for example, evaluated a variety of clustering
algorithms against benchmark graphs in the community detection literature. Is there a
61
63
8. BIBLIOGRAPHY
Campbell, E., Henly, J.R., Elliott, D.S., Irwin, K., 2009. Subjective constructions of
neighborhood boundaries: Lessons from a qualitative study of four neighborhoods. J.
Urban Aff. 31, 461490. doi:10.1111/j.1467-9906.2009.00450.x
Chaskin, R.J., 1998. Neighborhood as a Unit of Planning and Action: A Heuristic Approach.
J. Plan. Lit. 1130. doi:0803973233
Chaskin, R.J., 1997. Perspectives on Neighborhood and Community: A Review of the
Literature. Soc. Serv. Rev. 71, 521547.
Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh, N., 2012. The Livehoods Project: Utilizing
Social Media to Understand the Dynamics of a City. Icwsm 5865.
Cranshaw, J., Yano, T., 2010. Seeing a home away from the home: Distilling protoneighborhoods from incidental data with Latent Topic Modeling, in: CSSWC
Workshop at NIPS. doi:10.1109/SocialCom-PASSAT.2012.93
Del Bimbo, A., Ferracani, A., Pezzatini, D., DAmato, F., Sereni, M., 2014. LiveCities:
Revealing the Pulse of Cities by Location- Based Social Networks Venues and Users
Analysis. Proc. Companion Publ. 23rd Int. Conf. World Wide Web Companion 163
166. doi:http://dx.doi.org/10.1145/2567948.2577035
Donetti, L., Munoz, M. a., 2004. Detecting Network Communities: a new systematic and
efficient algorithm. J. Stat. Mech. Theory Exp. 10, 8. doi:10.1088/17425468/2004/10/P10012
Falher, L., Gionis, A., Mathioudakis, M., 2015. Where is the Soho of Rome? Measures and
algorithms for finding similar neighborhoods in cities.
Fortunato, S., Barthlemy, M., 2007. Resolution limit in community detection. Proc. Natl.
Acad. Sci. U. S. A. 104, 3641. doi:10.1073/pnas.0605965104
Gonzlez, M.C., Hidalgo, C. a, Barabsi, A.-L., 2008. Understanding individual human
mobility patterns. Nature 453, 779782. doi:10.1038/nature06958
Good, B.H., De Montjoye, Y.A., Clauset, A., 2010. Performance of modularity
maximization in practical contexts. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys.
81, 120. doi:10.1103/PhysRevE.81.046106
Hirschman, A.O., 1964. The Paternity of an Index. Am. Econ. Rev. 54, 761.
Huang, A., 2008. Similarity measures for text document clustering. Proc. Sixth New Zeal.
4956.
64
Jones, E., Oliphant, T., Peterson, P., Others, 2001. SciPy: Open Source Scientific Tools for
Python [WWW Document]. URL http://www.scipy.org/ (accessed 7.14.15).
Lancichinetti, A., Fortunato, S., 2011. Limits of modularity maximization in community
detection. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys. 84, 19.
doi:10.1103/PhysRevE.84.066122
Lancichinetti, A., Fortunato, S., 2009. Community detection algorithms: a comparative
analysis. Phys. Rev. E 80, 112.
Noulas, a, Scellato, S., Mascolo, C., Pontil, M., 2011. Exploiting Semantic Annotations for
Clustering Geographic Areas and Users in Location-based Social Networks.
Noulas, A., Scellato, S., Lathia, N., Mascolo, C., 2012. A random walk around the city:
New venue recommendation in location-based social networks. Proc. - 2012
ASE/IEEE Int. Conf. Privacy, Secur. Risk Trust 2012 ASE/IEEE Int. Conf. Soc.
Comput. Soc. 2012 144153. doi:10.1109/SocialCom-PASSAT.2012.70
Openshaw, S., 1984. The modifiable area unit problem. Concepts Tech. Mod. Geogr. 38, 1
41.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., Duchesnay, ., 2011. Scikit-learn: Machine Learning in
Python. J. Mach. Learn. Res. 12, 28252830.
Planck, M., Luxburg, U. Von, 2006. A Tutorial on Spectral Clustering A Tutorial on
Spectral Clustering. Stat. Comput. 17, 395416. doi:10.1007/s11222-007-9033-z
Preotiuc-Pietro, D., Cohn, T., 2013. Mining User Behaviours: A Study of Check-in Patterns
in Location Based Social Networks. Staffwww.Dcs.Shef.Ac.Uk.
doi:10.1145/2464464.2464479
Preoiuc-Pietro, D., Cranshaw, J., Yano, T., 2013. Exploring venue-based city-to-city
similarity measures. Proc. 2nd ACM 14. doi:10.1145/2505821.2505832
Ratti, C., Sobolevsky, S., Calabrese, F., Andris, C., Reades, J., Martino, M., Claxton, R.,
Strogatz, S.H., 2010. Redrawing the map of Great Britain from a network of human
interactions. PLoS One 5. doi:10.1371/journal.pone.0014248
Sampson, R.J., Morenoff, J.D., Gannon-Rowley, T., 2002. Assessing Neighborhood
Effects: Social Processes and New Directions in Research. Annu. Rev. Sociol. 28,
443478. doi:10.1146/annurev.soc.28.110601.141114
65
Scellato, S., Mascolo, C., 2011. Measuring user activity on an online location-based social
network. 2011 IEEE Conf. Comput. Commun. Work. (INFOCOM WKSHPS) 918
923. doi:10.1109/INFCOMW.2011.5928943
Silva, T.H., Vaz de Melo, P.O.S., Almeida, J.M., Loureiro, A.A.F., 2013. Social Media as a
Source of Sensing to Study City Dynamics and Urban Social Behavior: Approaches,
Models, and Opportunities, in: Atzmueller, M., Chin, A., Helic, D., Hotho, A. (Eds.),
Ubiquitous Social Media Analysis. Springer, pp. 6387.
Silva, T.H., Vaz De Melo, P.O.S., Almeida, J.M., Salles, J., Loureiro, A.A.F., 2012.
Visualizing the invisible image of cities. Proc. - 2012 IEEE Int. Conf. Green Comput.
Commun. GreenCom 2012, Conf. Internet Things, iThings 2012 Conf. Cyber, Phys.
Soc. Comput. CPSCom 2012 382389. doi:10.1109/GreenCom.2012.62
Stokes, P., n.d. 2011 Census, Population and Household Estimates for Small Areas in
England and Wales.
Sturgis, P., Brunton-Smith, I., Kuha, J., Jackson, J., 2013. Ethnic diversity, segregation and
the social cohesion of neighbourhoods in London. Ethn. Racial Stud. 121.
doi:10.1080/01419870.2013.831932
Van Der Walt, S., Colbert, S.C., Varoquaux, G., 2011. The NumPy array: A structure for
efficient numerical computation. Comput. Sci. Eng. 13, 2230.
doi:10.1109/MCSE.2011.37
Weiss, L., Ompad, D., Galea, S., Vlahov, D., 2007. Defining Neighborhood Boundaries for
Urban Health Research. Am. J. Prev. Med. 32, 154159.
doi:10.1016/j.amepre.2007.02.034
Xia, P., Zhang, L., Li, F., 2015. Learning similarity with cosine similarity ensemble. Inf.
Sci. (Ny). 307, 3952. doi:10.1016/j.ins.2015.02.024
Zelnik-Manor, L., Perona, P., 2004. Self-Tuning Spectral Clustering. Adv. Neural Inf.
Process. Syst. 2, 16011608.
Zhang, A.X., Noulas, A., Scellato, S., Mascolo, C., 2013. Hoodsquare: Modeling and
recommending neighborhoods in location-based social networks. Proc. - Soc. 2013
6974. doi:10.1109/SocialCom.2013.17
66
9. APPENDIX
9.1. Scripts for collecting and formatting data for analysis
9.1.1. IPython notebook: twitter_streaming.ipynb
# this script is used to collect tweets from the Twitter streaming API. It runs
throughout the period of data collection.
import json
import sys
import tweepy
# from http://stackoverflow.com/questions/21129020/how-to-fixunicodedecodeerror-ascii-codec-cant-decode-byte
# this handles unicode errors
reload(sys)
sys.setdefaultencoding('utf8')
def tweepy_oauth():
return auth
if 'in_reply_to_status' in data:
self.on_status(data)
elif 'delete' in data:
delete = json.loads(data)['delete']['status']
if self.on_delete(delete['id'], delete['user_id']) is False:
return False
elif 'limit' in data:
if self.on_limit(json.loads(data)['limit']['track']) is False:
return False
elif 'warning' in data:
warning = json.loads(data)['warnings']
print warning['message']
return False
self.counter += 1
if self.counter >= 5000: # New file is started every 5,000 tweets, tagged with
prefix and a timestamp.
self.output.close()
self.output = open('../Dissertation/twitter_data/raw/' + self.fprefix + '.'
+ time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
self.counter = 0
return
def on_timeout(self):
sys.stderr.write("Timeout, sleeping for 60 seconds...\n")
time.sleep(60)
return True # Don't kill the stream
69
twitter_api = tweepy_oauth()
Q = "twitter.com"
# the streaming api will filter for the cities indicated by their bounding boxes here.
# the order is West Lat, South Lon, East Lat, North Lon
# using the maximum number of bounding boxes allowed (25) by Twitter Streaming
API
locations = [103.549467,1.145502,104.123447,1.478481, # Singapore
-0.489, 51.28, 0.236, 51.686, # London
-74.255641,40.495865,-73.699793,40.91533, # New York City
]
day = '20150824'
json_path = 'twitter_raw/' + day + '/*.json' # folder path where the tweets are (json
files)
output_path = 'foursquare/' + day + '.json' # name of file to save the final output to
################################
import json
import glob
def merge_json(json_path):
# this merges all the json files given in the json_path and returns a list of tweets
tweets = []
for f in glob.glob(json_path):
tweets_file = open(f,"r")
for tweet in tweets_file:
if tweet != '\n':
tweets.append(tweet)
return tweets
#################################
import pandas as pd
from pandas import DataFrame, Series
72
def extract_tweets(tweets):
# this takes a list of tweets and returns those with coordinates data as a pandas
DataFrame
# it is also much faster than iterrows method for some reason
df = []
for tweet in tweets:
if len(df) % 10000 == 0:
print (len(df), end = ' ')
data = {}
try:
t = json.loads(tweet)
except (ValueError):
pass
if t['coordinates']:
data['tweetId'] = t['id_str']
data['dateTime'] = t["created_at"]
data['tweet'] = t["text"]
data['lng'] = t["coordinates"]["coordinates"][0]
data['lat'] = t["coordinates"]["coordinates"][1]
data['source'] = t["source"]
data ['userId'] = t["user"]["id_str"]
73
if (t["entities"]["hashtags"]):
data['hashtags'] = str([h['text'] for h in
t["entities"]["hashtags"]]).strip('[]')
if (t["entities"]["urls"]):
data['url'] = t["entities"]["urls"][0]["expanded_url"]
df.append(data)
df = pd.DataFrame.from_dict(df)
# Converts DateTime column to index, localizes it to UTC time, then store it back
in DateTime column
df["dateTime"] = pd.DatetimeIndex(df["dateTime"]).tz_localize('UTC')
# Used when converting for Processing - I don't know how to deal with epoch
time in Processing
# df["Year"] = pd.DatetimeIndex(df["DateTime"]).year
# df["Month"] = pd.DatetimeIndex(df["DateTime"]).month
74
# df["Day"] = pd.DatetimeIndex(df["DateTime"]).day
# df["Hour"] = pd.DatetimeIndex(df["DateTime"]).hour
# df["Minute"] = pd.DatetimeIndex(df["DateTime"]).minute
# df["Second"] = pd.DatetimeIndex(df["DateTime"]).second
return df
#################################
main(json_path, output_path)
tweets_path = 'foursquare/' + day + '.json' # load the tweets data stored here
venues_path = 'venues/venues_info.json' # where the master venues info file is
stored. Data will be loaded from and saved to this file
75
################################
# Finds the venue ids based on the Twitter check-ins, and saves them to the json
file
# returns a df with the original tweets info plus the venue_ids
tweets_w_venues = extract_venues(tweets, tweets_path)
# Cross reference venues in tweets_w_venues with master venue file and adds
any new venues found to the master venue file
add_venues_to_master(venues_path, tweets_w_venues, client)
################################
from lxml import html
76
import requests
import collections
def extract_venue_id(url):
# extracts the tweet url, goes to the foursquare page based on the url and finds the
venue id
# returns the url if the page/tweet/venue info cannot be found
try:
page = requests.get(url) # insert tweet url here
except (requests.ConnectionError, requests.exceptions.MissingSchema,
requests.exceptions.InvalidSchema): # Raises error when the url leads to localhost
instead of an actual online page
return url
try:
tree = html.fromstring(page.text)
except (ValueError):
return url
try:
venue_id = (venue_url[0].split('/'))[-1]
return venue_id
except (IndexError):
return url
77
################################
def extract_venue_info(venue_id, client):
# uses the venue_id to request for venue info from the FourSquare api
venue_info = client.venues(venue_id.strip())
return venue_info
################################
def extract_venues(tweets, tweets_path):
first_iter_errors = 0
second_iter_errors = 0
first_list = []
second_list = []
78
# Doing a second iteration because sometimes urls are returned even when they
are perfectly legit
# ConnectionError handling might trigger when the net connection is unstable
for example
# Doing this twice seems to catch all the venue_ids that were missed the first time
print()
print("Second iteration getting venues IDs...")
for index, row in tweets.iterrows():
if index % 100 == 0:
print(index, end = ' ')
# Probably does the same thing just as slowly. The one above prints the count so
I know progress
# tweets['VenueID'] = tweets.apply(lambda row: extract_venue_id(row['url']),
axis=1)
# tweets['VenueID']
print()
tweets.to_json(tweets_path)
print("Tweets saved at: " + tweets_path)
79
return tweets
################################
def add_venues_to_master(venues_path, tweets, client):
venues_info = pd.read_json(venues_path)
res = client.venues.categories()
maincats_dict = dict([(category['id'], category['name']) for category in
res['categories']])
subcats_dict = dict([(subcat['id'], subcat['name']) for category in res['categories']
for subcat in category['categories']])
subsubcats_dict = dict([(subsubcat['id'], subsubcat['name']) for category in
res['categories'] for subcat in category['categories'] for subsubcat in
subcat['categories']])
80
added = 0
errors = 0
skipped = 0
venues_w_errors = []
# check for this to ignore all failed extract_venue_id() attempts and ignore all
venue_ids that are already in venues_info file
if (tweets.ix[index, 'venueId'] != tweets.ix[index, 'url']) and
(tweets.ix[index,'venueId'] not in venues_info['venueId']):
try:
venue_info = extract_venue_info(tweets.ix[index,'venueId'], client)
key = venue_info['venue']['id']
venues_info.ix[key,'venueId'] = venue_info['venue']['id']
venues_info.ix[key,'name'] = venue_info['venue']['name']
venues_info.ix[key,'lat'] = venue_info['venue']['location']['lat']
venues_info.ix[key,'lng'] = venue_info['venue']['location']['lng']
if ('categories' in venue_info['venue']):
81
if cat['id'] in subsubcats_dict:
venues_info.ix[key,'subsubcatId'] = cat['id']
venues_info.ix[key,'subsubcat'] = cat['name']
venues_info.ix[key,'subcatId'] =
subsubcat_to_subcat_ids_dict[cat['id']]
venues_info.ix[key,'subcat'] =
subcats_dict[venues_info.ix[key,'subcatId']]
venues_info.ix[key,'maincatId'] =
subcat_to_maincat_ids_dict[venues_info.ix[key,'subcatId']]
venues_info.ix[key,'maincat'] =
maincats_dict[venues_info.ix[key,'maincatId']]
if ('description' in venue_info['venue']):
venues_info.ix[key,'description'] = venue_info['venue']['description']
added += 1
except (foursquare.ParamError):
errors += 1
venues_w_errors.append((tweets.ix[index,'venueId'],tweets.ix[index,'lat'],tweets.ix[i
ndex,'lng']))
pass
print()
print(venues_path + " has been updated with " + str(added) + " venues")
print(str(errors) + " venues could not be found from the venue_id")
print(venues_w_errors)
return(venues_w_errors)
################################
main(tweets_path, client, venues_path)
84
# returns a df of all checkins on the dates provided. Also saves the df as a json
file.
print("Getting data...")
df = pd.DataFrame({}, columns = columns)
df = df[columns]
df = df.drop_duplicates() # remove duplicates if any
df.index = np.arange(len(df))
return df
def filter_by_location(df, locations = ['London', 'Singapore', 'New York']):
# filter the checkins by list of locations to analyze. Also adds a column
indicating the name of the city
df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''
# use bounding boxes to look for tweets that are within these boxes
df_int = df[ (df['lat']>=bb[1]) & (df['lat']<=bb[3]) & (df['lng']>=bb[0]) &
(df['lng']<=bb[2]) ].copy()
df_int.loc[:,'city'] = location
print('Number of checkins from %s: %s ' %(location, len(df_int)) )
df_out = df_out.append(df_int)
df_out.index = np.arange(len(df_out))
print('Total number of checkins: %s' % len(df_out))
print()
return df_out
86
df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''
# use shapely polygons to look for tweets that are within ciy boundaries
df_int = df[ (df['lat']>=bb[1]) & (df['lat']<=bb[3]) & (df['lng']>=bb[0]) &
(df['lng']<=bb[2]) ].copy()
df_out.index = np.arange(len(df_out))
print('Total number of checkins: %s' % len(df_out))
88
ln -s /mnt /home/ubuntu/mnt
byobu-enable
wget https://bootstrap.pypa.io/get-pip.py
sudo apt-get -y install zip unzip git
sudo python get-pip.py
sudo apt-get -y install python-numpy python-scipy python-matplotlib python-sympy
python-nose git
sudo apt-get -y install python-dev htop
sudo pip install pandas
sudo pip install descartes
sudo pip install -U scikit-learn
sudo apt-get -y install libspatialindex-dev
89
91
def main(argv):
city = ''
results_set = ''
n_neighbors = 10
local_sql = False
alpha = 0.01
try:
opts, args = getopt.getopt(argv,"c:r:n:a:l",["city=", "results_set=",
"n_neighbors=", "alpha=", "local_sql="])
except getopt.GetoptError:
sys.exit(2)
if clustering == 'spectral':
metric = get_set_info(results_set, 'metric')
input_matrix = get_set_info(results_set, 'input_matrix')
if __name__ == "__main__":
main(sys.argv[1:])
93
venues_latlng = get_venues_latlng(city)
if input_matrix == 'full_graph':
matrix = get_social_similarity_matrix(city, metric)
matrix = (matrix + matrix.T) / 2
elif input_matrix == 'nearest_neighbors':
matrix = get_affinity_matrix(city, n_neighbors, metric, alpha)
matrix = (matrix + matrix.T) / 2
else:
matrix = None
94
results_df = pd.DataFrame(venues_latlng).merge(pd.DataFrame(spec.labels_),
left_index=True, right_index=True)
results_df.drop(['city', 'venueId'], axis=1, inplace=True)
results_df.columns = ['lat', 'lng', 'label']
return results
alpha=0.01
nearest_neighbors_matrix = get_nearest_neighbors_matrix(city, n_neighbors)
social_similarity_matrix = get_social_similarity_matrix(city, metric=metric)
affinity_matrix = create_affinity_matrix(nearest_neighbors_matrix,
social_similarity_matrix, alpha=alpha)
return affinity_matrix
return out
return out
users_by_venue = get_users_by_venue(city)
users_dict = get_users_dict(city)
venues_dict = get_venues_dict(city)
users_by_venue['userId'].replace(users_dict, inplace=True)
users_by_venue['venueId'].replace(venues_dict, inplace=True)
users_by_venue.drop('city', axis=1, inplace=True)
df_int = users_by_venue.copy()
df_int.index = np.arange(len(df_int))
row = df_int['venueId'].values
col = df_int['userId'].values
data = df_int['count'].values
try:
97
return social_similarity_matrix
99
def get_checkins(city):
# returns a df of all checkins on the dates provided.
if city_file_exists(city, 'checkins'):
checkins = load_city_file(city, 'checkins')
else:
columns = ['tweetId', 'dateTime', 'lat', 'lng', 'userId', 'venueId']
df = pd.DataFrame({}, columns=columns)
dates = get_city_info(city, 'dates')
df = df[columns]
df = df.drop_duplicates() # remove duplicates if any
df.index = np.arange(len(df))
100
checkins = extract_checkins_within_polygons(df)
save_city_file(city, 'checkins', checkins)
return df_out
def extract_checkins_within_polygons(df):
# extract the checkins by list of locations to analyze. Also adds a column
indicating the name of the city
# Uses city boundaries according to the geojson boundaries instead of bounding
boxes
df['venueId'].apply(str)
df['userId'].apply(str)
df['city'] = ''
df_out = pd.DataFrame(columns=df.columns)
with open(path) as f:
101
boundary = json.load(f)
boundary = shape(boundary['features'][0]['geometry'])
# use shapely polygons to look for tweets that are within ciy boundaries
df_int = df[(df['lat'] >= bb[1]) & (df['lat'] <= bb[3]) & (df['lng'] >= bb[0])
& (df['lng'] <= bb[2])].copy()
df_out.index = np.arange(len(df_out))
return df_out
bb = get_city_info(city, 'bounding_box')
# use bounding boxes to look for tweets that are within these boxes
df_out = df[(df['lat'] >= bb[1]) & (df['lat'] <= bb[3]) & (df['lng'] >= bb[0]) &
(df['lng'] <= bb[2])].copy()
df_out.index = np.arange(len(df_out))
102
return df_out
if col == 'venueId':
file = 'venue_counts'
elif col == 'userId':
file = 'user_counts'
else:
sys.exit('Error in col')
if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the column, date, and hour. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'],
checkins[col]]).tweetId.count())
# this returns a df where the multiindex is put in columns in a normal df
instead.
df_out = df_out.reset_index()
df_out.columns = ['city', col, 'count']
df_out = df_out.sort('count', ascending=False)
df_out.index = np.arange(len(df_out))
103
return df_out
if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the column, date, and hour. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'], checkins[col],
pd.DatetimeIndex(checkins['dateTime']).date,
pd.DatetimeIndex(checkins['dateTime']).hour]).tweetId.count())
# this returns a df where the multiindex is put in columns in a normal df
instead.
df_out = df_out.reset_index()
df_out.columns = ['city', col, 'date', 'hour', 'count']
save_city_file(city, file, df_out)
return df_out
104
def get_checkins_by_hour(city):
file = 'checkins_by_hour'
if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# use venues_by_time or users_by_time as the df. The result is the same
df_int = checkins.copy()
df_int['dateTime'] = df_int.apply(
lambda row: row['dateTime'].normalize() +
pd.DateOffset(hours=row['dateTime'].hour), axis=1)
105
df_int = df_int.groupby(df_int['dateTime']).count()
date_range = pd.date_range(start=df_int.index.min(),
end=df_int.index.max(),
freq='H')
df_out = pd.DataFrame(index=date_range)
df_out = df_out.merge(df_int, how='left', left_index=True, right_index=True)
df_out = df_out.fillna(value=0)
df_out = pd.DataFrame(df_out['tweetId'])
df_out.columns = ['count']
return df_out
def get_users_by_venue(city):
# used as input for the affinity matrix later. It gives information on number of
checkins
file = 'users_by_venue'
if city_file_exists(city, file):
df_out = load_city_file(city, file)
else:
checkins = get_checkins(city)
# Group by the city, userId, venueId. This creates a multiindex df
df_out = pd.DataFrame(checkins.groupby([checkins['city'],
checkins['userId'],
checkins['venueId']]).tweetId.count())
106
return df_out
def get_users_dict(city):
# Create users_dict for city
file = 'users_dict'
if city_file_exists(city, file):
dict_out = load_city_file(city, file)
else:
users_by_venue = get_users_by_venue(city)
dict_out = dict(zip(users_by_venue['userId'].unique(),
np.arange(len(users_by_venue['userId'].unique()))))
save_city_file(city, file, dict_out)
return dict_out
def create_distance_matrix(city):
# Creates distance matrices for each city using venues in the df supplied.
# Saves two files in the folder: a numpy array containing the distance
calculations,
# and a dictionary containing the keys and index numbers for the venues
# Returns a dictionary with the city name as keys and dataframes as values
107
checkins = get_checkins(city)
df_int = checkins.copy()
df_int = df_int.iloc[df_int[['venueId']].drop_duplicates().index]
df_int = df_int[['lat', 'lng', 'venueId']]
print()
print('Saving distance matrix..')
# Saving the output - one venues_dict and one np array
venues_dict = dict(zip(df_out.index, np.arange(len(df_out.index))))
save_city_file(city, 'venues_dict', venues_dict)
save_city_file(city, 'distances', array)
return array
108
def get_distance_matrix(city):
file = 'distances'
if city_file_exists(city, file):
array = load_city_file(city, file)
else:
create_distance_matrix(city)
array = load_city_file(city, file)
return array
def get_venues_dict(city):
file = 'venues_dict'
if city_file_exists(city, file):
venues_dict = load_city_file(city, file)
else:
create_distance_matrix(city)
venues_dict = load_city_file(city, file)
return venues_dict
def get_venues_latlng(city):
file = 'venues_latlng'
if city_file_exists(city, file):
venues_latlng = load_city_file(city, file)
venues_latlng.index = venues_latlng['venueId']
else:
checkins = get_checkins(city)
venues_dict = get_venues_dict(city)
109
return venues_latlng
110
},
'set7': {'name': 'set7',
'input_matrix': 'nearest_neighbors',
'metric': 'cosine',
'clustering': 'spectral'
},
'set8': {'name': 'set8',
'input_matrix': 'full_graph',
'metric': 'cosine',
'clustering': 'spectral'
},
'set9': {'name': 'set9',
'input_matrix': 'nearest_neighbors',
'metric': 'euclidean',
'clustering': 'spectral'
},
'set10': {'name': 'set10',
'input_matrix': 'full_graph',
'metric': 'euclidean',
'clustering': 'spectral'
},
'set11': {'name': 'set11',
'input_matrix': 'nearest_neighbors',
'metric': 'jaccard',
'clustering': 'spectral'
},
'set12': {'name': 'set12',
'input_matrix': 'full_graph',
'metric': 'jaccard',
'clustering': 'spectral'
},
112
return set_info[set][field]
return city_info[city][field]
114
115
create_engine('mysql://[username]:[password]@[host:port]/[tablename]?charset=
utf8', encoding='utf-8')
# send results to mysql
# if operational error occurs, it's probably because too much info being sent
over.
# give a smaller chunksize
data.to_sql(path, engine, if_exists='replace', chunksize = 3000)
with open(path) as f:
data = json.load(f)
else:
data = None
print('No data')
return data
if file == 'checkins':
path = '../_Data/' + file + suffix
elif file in ['venue_counts', 'user_counts',
'venues_by_time', 'users_by_time', 'users_by_venue',
'checkins_by_hour', 'venues_latlng'
]:
path = '../_Data/' + loc + '_' + file + suffix
elif file == 'distances':
path = '../_Data/' + loc + '_distances' + npy_suffix
elif file in ['users_dict', 'venues_dict']:
path = '../_Data/' + loc + '_' + file + p_suffix
elif file == 'boundaries':
path = '../_Data/' + loc + '_boundaries.geojson'
elif file == 'affinity_sparse':
117
return path
118
feature_collection = create_results_geojson(results_dict)
save_city_file(city, 'results_geojson', feature_collection, results_set=results_set,
n_neighbors=n_neighbors)
119
def create_results_geojson(results_dict):
results = results_dict['results_df']
# remove largest cluster from the results, because spectral clustering usually
returns a cluster
# that covers the entire area
# results = results[results.label != results.groupby('label').count().lat.idxmax()]
feature_collection = create_polygons_for_clusters(results)
for key in ['city', 'n_neighbors', 'metric', 'n_clusters', 'n_clusters_chart',
'input_matrix', 'results_set']:
feature_collection[key] = str(results_dict[key])
return feature_collection
venues_dict = get_venues_dict(city)
venue_counts = get_value_counts(city, 'venueId')
venues_info = pd.read_json('../_Data/venues/venues_info.json')
# merge data
results_for_sql = venue_counts.merge(venues_info, how='left', on='venueId')
results_for_sql.index = results_for_sql['venueId'].replace(venues_dict)
results = results_dict['results_df']
results.drop(['lat', 'lng'], axis=1, inplace=True)
results.columns = ['label']
results_for_sql = results_for_sql.merge(results, left_index=True,
right_index=True)
120
def create_polygons_for_clusters(results):
"""Formats the spectral clustering results in a form that can be saved in
geoJSON"""
features = []
# Create shapely points from results (note that lng comes first)
points = [Point(row[0], row[1]) for row in subset[['lng', 'lat']].values]
# Instantiate a MultiPoint, then ask the MultiPoint for its envelope, which is a
Polygon.
# convex_hull and buffer used so that it is a smoth shape
point_collection = MultiPoint(list(points))
point_collection.envelope
convex_hull_polygon = point_collection.convex_hull.buffer(0.0001)
features.append(fea)
return feature_collection
def create_ordered_multi(feature_collection):
"""Convert feature collection to Shapely multipolygon sorted according to their
label property"""
ordered_features = [None] * len(feature_collection['features'])
return ordered_multi
122
# Select venues that are within the polygon's bounding box from SQL
q = "SELECT * FROM venues_info WHERE lng > " + str(bounds[0]) + \
" AND lng < " + str(bounds[2]) + " AND lat > " + str(bounds[1]) + " AND
lat < " + str(bounds[3])
q_result = pd.read_sql(q, conn)
# Further filter for venues that are within the polygon itself. 2 stages are used
because .within is not fast
for i, row in q_result.iterrows():
if not (Point(row['lng'], row['lat']).within(polygon)):
q_result.drop(i, inplace=True)
# generate list of venues to use for filtering for checkins/tweets within the
polygon
q_result_venues = list(q_result['urlId'].values) +
list(q_result['venueId'].values)
for_merge = q_result[['maincat', 'maincatId', 'subcat', 'subcatId', 'name',
'urlId']].copy()
feature['properties']['num_checkins'] = str(len(c))
feature['properties']['num_users'] = str(len(c['userId'].unique()))
feature['properties']['venues'] = str(list(c['urlId'].unique()))
checkins_by_categories_dict =
dict(c.groupby('maincat').size().order(ascending=False))
checkins_by_sub_categories = c.groupby(['maincat', 'subcat']).size()
for k, v in checkins_by_categories_dict.items():
checkins_by_categories_dict[k] = {'count': str(v),
'checkins_by_sub_categories_dict':
dict(zip(checkins_by_sub_categories[k].index,
list(map(int,checkins_by_sub_categories[k].values))))}
feature['properties']['checkins_by_categories'] = checkins_by_categories_dict
venues_by_categories_dict =
dict(c.drop_duplicates('venueId').groupby('maincat').size().order(ascending=False))
venues_by_sub_categories = c.drop_duplicates('venueId').groupby(['maincat',
'subcat']).size()
for k, v in venues_by_categories_dict.items():
venues_by_categories_dict[k] = {'count': str(v),
'venues_by_sub_categories_dict':
dict(zip(venues_by_sub_categories[k].index,
list(map(int,venues_by_sub_categories[k].values))))}
feature['properties']['venues_by_categories'] = venues_by_categories_dict
users_by_categories_dict =
dict(c.drop_duplicates('userId').groupby('maincat').size().order(ascending=False))
users_by_sub_categories = c.drop_duplicates('userId').groupby(['maincat',
'subcat']).size()
125
for k, v in users_by_categories_dict.items():
users_by_categories_dict[k] = {'count' : str(v),
'users_by_sub_categories_dict' :
dict(zip(users_by_sub_categories[k].index,
list(map(int,users_by_sub_categories[k].values))))}
feature['properties']['users_by_categories'] = users_by_categories_dict
# calculate polygon area from http://stackoverflow.com/a/4683144
lng, lat = zip(*list(polygon.exterior.coords))
pa = Proj("+proj=aea")
x, y = pa(lng, lat)
cop = {"type": "Polygon", "coordinates": [zip(x, y)]}
feature['properties']['area'] = round(shape(cop).area/1000000, 4) # convert
to square km
feature['properties']['label'] = label
126
categories = ['Arts & Entertainment', 'College & University', 'Food', 'Nightlife Spot',
'Residence',
'Outdoors & Recreation', 'Professional & Other Places', 'Shop & Service',
'Travel & Transport']
colors = dict(zip(categories, sns.color_palette("hls", 10)))
127
properties['checkins_per_venue'] = properties['num_checkins'].astype(float) /
properties['num_venues'].astype(float)
properties['checkins_per_user'] = properties['num_checkins'].astype(float) /
properties['num_users'].astype(float)
properties['users_per_venue'] = properties['num_users'].astype(float) /
properties['num_venues'].astype(float)
if filter:
properties.drop(properties['area'].argmax(), inplace=True)
return properties
if category == 'main':
data = data[['category', 'venues', 'checkins', 'users', 'venues_perc',
'checkins_perc', 'users_perc']]
elif category == 'sub':
data = data[['subcategory', 'venues', 'checkins', 'users', 'venues_perc',
'checkins_perc', 'users_perc']]
128
return data
129
cluster_data_by_sub_category['venues_per_sqkm'] =
cluster_data_by_sub_category['venues'] / row['area']
data = data.append(cluster_data_by_sub_category.values.tolist())
columns = list(cluster_data_by_sub_category.columns)
data.columns = columns
data.index = np.arange(len(data))
return data
# 'Event' was taken out from this list of categories because it doesn't appear in
the London data
# and causes an error
categories = ['Arts & Entertainment', 'College & University', 'Food', 'Nightlife
Spot', 'Residence',
'Outdoors & Recreation', 'Professional & Other Places', 'Shop &
Service', 'Travel & Transport']
if category == 'main':
data = data.groupby('category').sum()
data['category'] = data.index
elif category == 'sub':
data.index = data['subcategory']
return data
ax = data.plot('subcategory', prop,
kind='barh',
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = False)
ax.xaxis.tick_top()
if title:
plt.title(title, y=1.035)
else:
plt.title('number of ' + prop + " in cluster " + str(label) +
" by Foursquare's subcategories", y=1.035)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
plt.show()
return data.sort(prop, ascending=False)
133
if sortdata:
data.sort(prop, inplace=True)
ax = data.plot('category', prop,
kind='barh',
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = False)
ax.xaxis.tick_top()
if title:
plt.title(title, y=1.17)
else:
plt.title('number of ' + prop + " in cluster " + str(label) +
" by Foursquare's main categories", y=1.17)
plt.show()
return data.sort(prop, ascending=False)
fig = plt.figure()
ax = fig.add_subplot(111)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax.xaxis.tick_top()
ax2 = ax.twinx()
data.plot('label', prop,
kind='barh',
ax=ax,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)
data.plot('category', prop,
kind='barh',
ax=ax2,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)
if title:
plt.title(title, y=1.035)
else:
135
plt.show()
return data.sort(prop, ascending=False)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.tick_top()
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax2 = ax.twinx()
data.plot('label', prop,
kind='barh',
ax=ax,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)
data.plot('subcategory', prop,
136
kind='barh',
ax=ax2,
color=[colors[i] for i in data['category']],
figsize=(10, len(data)/3),
legend = True)
if title:
plt.title(title, y=1.005)
else:
plt.title('number of ' + prop + " in clusters (label) by Foursquare's
subcategories in these main categories: " + ', '.join(categories), y=1.005)
plt.show()
return data.sort(prop, ascending=False)
fig = plt.figure()
ax = fig.add_subplot(111)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() +
ax.get_yticklabels()):
item.set_fontsize(13)
ax.xaxis.tick_top()
data.plot('category', prop,
kind='barh',
ax=ax,
137
if title:
plt.title(title, y=1.1)
else:
plt.title(prop + " in city by Foursquare's main categories", y=1.1)
plt.show()
return data.sort(prop, ascending=False)
9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in
terms of ethnic diversity
9.4.1. Python script: extract_ldn_lsoa.ipynb
# This script was used to extract greater London LSOAs from UK LSOAs.
# Before this, I downloaded the LSOA shapefiles form Ordnance Survey
# and used ogr2ogr to convert it to the geojson in the right projection. See
# http://ben.balter.com/2013/06/26/how-to-convert-shapefiles-to-geojson-for-useon-github/
import fiona
c = fiona.open('../_Data/ldn_lsoa_2011_shp/lsoa_2011.geojson', 'r')
import pandas as pd
# this file was downloaded from lsoa 2011 census data
lsoa_data = pd.read_excel('../_Data/lsoa-data.xls', skiprows=2)
138
london_lsoa_codes = lsoa_data['Codes']
import json
outpath = ('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson')
crs = " ".join("+%s=%s" % (k,v) for k,v in c.crs.items())
features = []
for polygon in c:
if polygon['properties']['LSOA11CD'] in london_lsoa_codes.values:
fea = {'type': 'Feature',
'geometry': polygon['geometry'],
'properties': polygon['properties'],
'id': polygon['properties']['LSOA11CD'],
}
features.append(fea)
}
}
# Save as GeoJSON
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)
140
# this calculates the ethnic diversity value (Hirschman Index) for each Livehood
cluster by using 2011 LSOA data.
# It looks for LSOAs that intersect the Livehoods cluster and uses data from these
LSOAs.
# The result is saved in the geojson for visualization and analysis
import fiona
import pandas as pd
import json
from shapely.geometry import shape
results_geojson = fiona.open('../_Analysis/wamp/set7_ldn_10_results.geojson')
lsoa_geojson = fiona.open('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson', 'r')
# this file was downloaded from lsoa 2011 census data
lsoa_data_eth = pd.read_csv('../_Data/lsoa_2011_data_eth.csv')
def get_hirschman_index(feature):
import math
cluster_pop_total = 0
cluster_eth_white = 0
cluster_eth_mixed_multi = 0
cluster_eth_asian = 0
141
cluster_eth_black = 0
cluster_eth_others = 0
cluster_eth_BAME = 0
lsoa_intersect_str = feature['properties']['lsoa_intersect']
lsoa_intersect =
lsoa_intersect_str.replace("'","").replace("[","").replace("]","").replace("
","").split(sep=",")
hirschman_index = 1 - (math.pow(cluster_eth_white/cluster_pop_total, 2) +
math.pow(cluster_eth_mixed_multi/cluster_pop_total, 2) +
math.pow(cluster_eth_asian/cluster_pop_total, 2) +
math.pow(cluster_eth_black/cluster_pop_total, 2) +
142
math.pow(cluster_eth_others/cluster_pop_total, 2) +
math.pow(cluster_eth_BAME/cluster_pop_total, 2)
)
return hirschman_index
features = []
for feature in results_geojson:
lsoa_intersect = []
for lsoa in lsoa_geojson:
if shape(lsoa['geometry']).intersects(shape(feature['geometry'])):
lsoa_intersect.append(lsoa['properties']['LSOA11CD'])
feature['properties']['lsoa_intersect'] = str(lsoa_intersect)
features.append(feature)
feature_collection = {'type':'FeatureCollection',
'features': features,
'crs': {'type':'name',
'properties': {
'name': 'urn:ogc:def:crs:EPSG::4326'
143
}
}
}
# Save as GeoJSON
outpath = ('../_Analysis/wamp/set7_ldn_10_results.geojson')
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)
# this saves the ethnic diversity (Hirschman Index) measure in the lsoa geojson.
# the measure was precomputed in excel using excel formulas
import fiona
import pandas as pd
import json
from shapely.geometry import shape
features = []
for feature in lsoa_geojson:
lsoa11cd = feature['properties']['LSOA11CD']
hirschman_index =
lsoa_data_eth[lsoa_data_eth['codes']==lsoa11cd]['eth_HI'].values[0]
feature['properties']['eth_diversity_HI'] = str(round(hirschman_index*100)/100)
144
features.append(feature)
feature_collection = {'type':'FeatureCollection',
'features': features,
'crs': {'type':'name',
'properties': {
'name': 'urn:ogc:def:crs:EPSG::4326'
}
}
}
# Save as GeoJSON
outpath = ('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson')
open(outpath, "wb").write(json.dumps(feature_collection).encode('utf-8'))
print('File saved at: ', outpath)
145
results_geojson = fiona.open('../_Analysis/wamp/set7_ldn_10_results.geojson')
lsoa_geojson = fiona.open('../_Data/ldn_lsoa_2011_shp/ldn_lsoa_2011.geojson', 'r')
# this file was downloaded from lsoa 2011 census data
lsoa_data_eth = pd.read_csv('../_Data/lsoa_2011_data_eth.csv')
# create a pandas df that lists the cluster label, the ethnic diversity index for the
cluster,
# the lsoas that intersect the cluster, and the average ethnic diversity index for these
clusters
feature_eth_list = []
for feature in results_geojson:
lsoa_intersect_str = feature['properties']['lsoa_intersect']
lsoa_intersect =
lsoa_intersect_str.replace("'","").replace("[","").replace("]","").replace("
","").split(sep=",")
#print(feature['properties']['label'], feature['properties']['eth_diversity_HI'],
lsoa_intersect)
feature_eth_list.append((feature['properties']['label'],
feature['properties']['eth_diversity_HI'], lsoa_intersect))
146
feature_eth_list = pd.DataFrame(feature_eth_list,
columns=['label','eth_HI','lsoa_intersect'])
list_HI.append(lsoa_data_eth[lsoa_data_eth['codes']==lsoa]['eth_HI'].values[0])
sum_HI = sum(list_HI)
feature_eth_list.ix[i,'avg_HI'] =
round(sum_HI/len(row['lsoa_intersect'])*100)/100
feature_eth_list.ix[i, 'min_HI'] = round(min(list_HI)*100)/100
feature_eth_list.ix[i, 'max_HI'] = round(max(list_HI)*100)/100
147
148
m=6
m=7
m=8
m=9
m = 10
m = 11
m = 12
m = 13
149
m = 14
m = 15
m = 16
m = 17
m = 18
m = 19
m = 20
150
151