Professional Documents
Culture Documents
Bachelor Thesis
Bachelor Thesis
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
VI
Abstract
Abstact
VII
VIII
Contents
Acknowledgments V
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
2.1 Non-Content-Based Location Estimation . . . . . . . . . . . . . . . . . . 3
2.2 Content-Based Location Estimation . . . . . . . . . . . . . . . . . . . . . 4
3 Data 5
3.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Geographical Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Ground Truth Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Conclusion 13
6 Future Work 15
Appendix 16
A Lists 17
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
IX
References 19
X
Chapter 1
Introduction
1.1 Motivation
Micro-blogging services such as Twitter, Facebook and Tumblr have been growing and ris-
ing rapidly recently, As of March 2013, 400 million tweets were being posted everyday[8].
This has initiated enormous research efforts to mine this data and use them in various
applications, such as event detection [Sakaki et al. 2010; Agarwal et al. 2012] and news
recommendation [Phelan et al. 2009]. Many applications could make use of informa-
tion about users locations, but unfortunately the information is very sparse, a research
firm Sysomos studied Twitter usage between mid-October and mid-December 2009 and
found that only 0.23% of tweets in that time period were geo-tagged which is a good
indicator how much this information is sparse. Although blogging services allow users to
specify their location in their profiles, the profile location field is not reliable, Cheng et
al. found that only 26% out of a random sample of over 1 million Twitter users revealed
their city-level location in their profiles and only 0.42% of the tweets in this dataset were
geo-tagged [Cheng et al. 2010]. Moreover these profile locations are not always valid
as reported that only 42% of Twitter users in a random dataset have reported a valid
city-level location in their profiles [Hecht et al 2011].
1.2 Aim
In this paper users location prediction approaches are discussed to overcome location
sparseness problem mentioned above. These approaches are based purely on the tweets
content and tweeting behaviour in the absence of any other location information. The
goal is to develop approaches that will be able to predict the location of the tweet, the
key step towards achieving this goal is to predict the home location of the user as the
home location can give important clues to the possible actual location of the tweet. The
intuition here is that the content of a tweet may contain some words, entity names or
phrases more likely to be employed in particular places than others which could give
1
2 CHAPTER 1. INTRODUCTION
indicators for the actual location. Developing these approaches to be able to predict
possible locations of a tweet will be very beneficial in tracking applications such as news
verification in which we want to know which tweets are reported by users who are likely
to be in the actual location of an event or versus tweets reported by users who are likely
to be far away.
1.3 Outline
In the remainder of this paper related work, data set, formalization of the location pre-
diction problem, location classification approaches, and an evaluation of discussed algo-
rithms and approaches are discussed. Then the conclusion comes with a discussion of
future work.
Chapter 2
Related Work
This chapter shows a variety of prior work that is related to this study, the prior studies
can be categorized into the following areas:
3
4 CHAPTER 2. RELATED WORK
Some studies use other social information to infere location of users, for example Popescu
and Grefenstette[10] tried to estimate the home country of Flicker users using place names
and coordinate provided with their photos, Backstrom et al. [2] presented an algorithm
to predict the physical location of a user using the social network structure of Facebook,
given the known locatiion of users friends they were able to locate 69.1% of the users
with 16 or more located friends to within 25 miles compared to only 57.2% using IP-based
methods.
Data
Twitter give users the option to embed their current GPS-location. The dataset included
only about 2.1% (16,874,517) of the activities with embedded precise GPS-location. On
the other hand geographical information can be inffered indirectly from profile location
field in users profile, but as noticed this geographical information is not reliable as only
43.1% of unique users profiles included non empty profile location field. In addition about
half of these non empty profile location fields were successfully mapped to a real locations
as the other half contained non valid existing locations and in some cases valid but non
5
6 CHAPTER 3. DATA
complete addresses which is hard to map to a unique locations like state or country
names.
Researches previously conducted have been primarily focused on English data or have
been used datasets that consisted of primarily English tweets.However, Twitter is a multi-
lingual platform and including some languages may help in the task of location prediction
as it can be powerful indicator for locations, for example, if a user tweets mostly in Chi-
nese, this could be an indicator that the user is from China.
For the analysis of languages used in tweets, a language detector was applied, Figure
3.2 shows the top 10 most frequently used languages in the dataset.
Spain : 2.16 %
Malaysia : 2.45 %
Spain : 2.16 %
Indonesia : 2.49 %
Malaysia : 2.45 %
Philippines : 3.09 %
Indonesia : 2.49 %
Turkey : 3.51 %
Philippines : 3.09 % Brasil : 14.49 %
United Kindom : 4.62 %
Turkey : 3.51 %
Japan : 5.26 % Brasil : 14.49 %
Argentina : 6.58 %
United Kindom : 4.62 %
United States Brasil Argentina
Japan : 5.26 % Japan United Kindom
Argentina : 6.58 % Turkey
Philippines Indonesia Malaysia Spain All Other Countries
Fig. 3.1 This pie chart shows the top-10 countries determined by GPS information.
meta-chart.com
3.3. GROUND TRUTH DATA 7
T hai : 2.1 %
All Othe r Language s : 7 .2 %
T urkish : 2.2 %
T hai : 2.1 %
Fre nch : 2.6 %
T urkish : 2.2 %
Indone sian : 3.1 % English : 36.5 %
Fre nch : 2.6 %
Kore an : 3.2 %
Indone sian : 3.1 % English : 36.5 %
Arabic : 6.3 %
Kore an : 3.2 %
Arabic : 6.3 %
Portugue se : 6.6 %
Portugue se : 6.6 %
Spanish : 10.4 %
The first naive approach is to get the location of the user by checking the profile location
field included in users profile, but as we discussed before this data is not always reliable
for some reasons such as: 1) Profile fields contain non valid locations, for example phrases
to express a desire for keeping that information private, jokes, sexual content and even
expressions that indicate how much a user hate his current location[Hecht et al 2011] 2)
profile fields could be completely empty 3) some fields contain valid but not complete
addresses like state name or country name only.
Social media applications give users the free choice to publish their status updates and
tweets. Language can be a strong indicator of location: for example a user that writes
tweets in Chinese is most probably located in China, but the problem is there are lan-
guages that are spoken in many locations around the world such as English which is
spoken by 67 countries as an official language[8]. So the prediction based on language
is not accurate enough to get the country where the tweet was published but a list of
possible locations (country level) could be obtained by mapping Languages to countries
speaking this language as an official or second language[10], Then by classifying the lan-
guage of the tweet, a list of countries that speak this language can be obtained. The
problem with this approach is the list could be large and contains irrelevant countries.
9
10 CHAPTER 4. LOCATION DETECTION APPROACHES
Where P(W) is the probability of the presence of word and is calculated as the frequency
f (T )
of the word over the total number of words. P(W) = , and P (W| c) is the conditional
|S|
probability of the presence of word W given some class c and is calculated as the number
of times word W appeared with class c over the total number of all words occurrences
with class c.
C count(W, ci )
, so max(P (W| c = C)) is evaluated as max P
i
count(tj , ci )
j
Now after calculating a score for each word, the algorithm sorts the words according to
the calculated score in non-decreasing order. In chapter 6 we discuss the effect of using
top-n% of the feature set generated using this algorithm.
4.3. LOCATION DETECTION BY MACHINE LEARNING 11
First lets define two important terms, the first one is Information Gain (IG). The In-
formation Gain is the difference in class (location) entropy due to data split on some
attribute (word), so the higher the value the greater the predictability of the word, so
given a set of all words in our training set S, the IG of a word W S across all classes
(cities) C is calculated as follows:
H(c|W)
X X
P(W) P (c | W) log P (c | W) + P(W) P (c | W) log P (c | W)
c C c C
where P(W) and P(W) is the probability of the presence and absence of word W, respec-
tively, P (c | W) and P (c | W) is the conditional probability of class c when word W is
present and absent respectively. Because H(c) is constant over all words, so to rank the
features only the conditional entropy given word W needs to be calculated.
The second term we need to mention is the Intrinsic Value (IV), local words occur-
ring in a small number of cities usually have low intrinsic value, where non-local words
have high intrinsic value, so when the words are comparable in IG values, words with
smaller intrinsic value should be preferred because it means that the words are more
locally employed (location indicative).
Now with the two terms mentioned above, Information Gain Ratio (IGR) is defined as
the ratio between information gain to the intrinsic value.
IG(W)
IGR((W) =
IV (W)
12 CHAPTER 4. LOCATION DETECTION APPROACHES
Chapter 5
Conclusion
Conclusion
13
14 CHAPTER 5. CONCLUSION
Chapter 6
Future Work
Text
15
Appendix
16
Appendix A
Lists
17
List of Figures
18
Bibliography
[1] Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam Shroff.
Catching the long-tail: Extracting local news events from twitter. 2012.
[2] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improving
geographical prediction with social and spatial proximity. pages 6170, 2010.
[3] Frederik Bilhaut, Thierry Charnois, Patrice Enjalbert, and Yann Mathet. Geo-
graphic reference analysis for geographic document querying. pages 5562, 2003.
[4] Hau-wen Chang, Dongwon Lee, Mohammed Eltaher, and Jeongkyu Lee. @ phillies
tweeting from philly? predicting twitter user locations with spatial word usage.
pages 111118, 2012.
[5] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. pages 759768, 2010.
[6] Jacob Eisenstein, Brendan OConnor, Noah A Smith, and Eric P Xing. A latent
variable model for geographic lexical variation. pages 12771287, 2010.
[7] Bo Han, Paul Cook, and Timothy Baldwin. Text-based twitter user geolocation
prediction. Journal of Artificial Intelligence Research, 49:451500, 2014.
[8] Juhi Kulshrestha, Farshad Kooti, Ashkan Nikravesh, and P Krishna Gummadi. Ge-
ographic dissection of the twitter network. 2012.
[10] Adrian Popescu, Gregory Grefenstette, et al. Mining user home location and gender
from flickr tags. 2010.
19