Location Detection Over Social Media

Media Engineering and Technology Faculty
German University in Cairo
Location Detection Over Social

Media
Bachelor Thesis
Author: Ahmed Soliman

Supervisors: Sarah Elkasrawy
Submission Date: XX July, 20XX

This is to certify that:
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
Abstract
Abstact
VII
Contents
Acknowledgments V
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
2.1 Non-Content-Based Location Estimation . . . . . . . . . . . . . . . . . . 3
2.2 Content-Based Location Estimation . . . . . . . . . . . . . . . . . . . . . 4
3 Data 5
3.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Geographical Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Ground Truth Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Location Detection Approaches 9

4.1 Profile location identification . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Location detection by language . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Location Detection by Machine Learning . . . . . . . . . . . . . . . . . . 10
4.3.1 Heuristic-Based Approach . . . . . . . . . . . . . . . . . . . . . . 10
4.3.2 Information Theory-Based Approach . . . . . . . . . . . . . . . . 11
5 Implementation 13
5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.2 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.3 Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Feature Set Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 Filtering Out Noisy Words . . . . . . . . . . . . . . . . . . . . . . 14
5.2.2 CALGARI Algorithm Implementation . . . . . . . . . . . . . . . 15
5.2.3 IGR Algorithm Implementation . . . . . . . . . . . . . . . . . . . 15
5.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
IX
5.4 Web Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Conclusion 19
7 Future Work 21
Appendix 22
A Lists 23
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References 25
X
Chapter 1
Introduction
1.1 Motivation
Micro-blogging services such as Twitter, Facebook and Tumblr have been growing and ris-
ing rapidly recently, As of March 2013, 400 million tweets were being posted everyday[8].
This has initiated enormous research efforts to mine this data and use them in various
applications, such as event detection [Sakaki et al. 2010; Agarwal et al. 2012] and news
recommendation [Phelan et al. 2009]. Many applications could make use of informa-
tion about users locations, but unfortunately the information is very sparse, a research
firm Sysomos studied Twitter usage between mid-October and mid-December 2009 and
found that only 0.23% of tweets in that time period were geo-tagged which is a good
indicator how much this information is sparse. Although blogging services allow users to
specify their location in their profiles, the profile location field is not reliable, Cheng et
al. found that only 26% out of a random sample of over 1 million Twitter users revealed
their city-level location in their profiles and only 0.42% of the tweets in this dataset were
geo-tagged [Cheng et al. 2010]. Moreover these profile locations are not always valid
as reported that only 42% of Twitter users in a random dataset have reported a valid
city-level location in their profiles [Hecht et al 2011].
1.2 Aim
In this paper users location prediction approaches are discussed to overcome location
sparseness problem mentioned above. These approaches are based purely on the tweets
content and tweeting behaviour in the absence of any other location information. The
goal is to develop approaches that will be able to predict the location of the tweet, the
key step towards achieving this goal is to predict the home location of the user as the
home location can give important clues to the possible actual location of the tweet. The
intuition here is that the content of a tweet may contain some words, entity names or
phrases more likely to be employed in particular places than others which could give
1
2 CHAPTER 1. INTRODUCTION
indicators for the actual location. Developing these approaches to be able to predict
possible locations of a tweet will be very beneficial in tracking applications such as news
verification in which we want to know which tweets are reported by users who are likely
to be in the actual location of an event or versus tweets reported by users who are likely
to be far away.
1.3 Outline
In the remainder of this paper related work, data set, formalization of the location pre-
diction problem, location classification approaches, and an evaluation of discussed algo-
rithms and approaches are discussed. Then the conclusion comes with a discussion of
future work.
Chapter 2
Related Work
This chapter shows a variety of prior work that is related to this study, the prior studies
can be categorized into the following areas:
2.1 Non-Content-Based Location Estimation

There are many studies that explored estimating users location based on information
provided in users profile, geo-tagged tweets and other social information.
A number of studies make use of location information provided in users Twitter profiles,
for example Kulshrestha et al. [2012][8] have used location information provided by users
in their profile and map APIs to estimate the country level location, they were able to
estimate country level location for about 23.5% of users with 94.7% accuracy. However
the location prediction techniques that rely on information provided by users are not
always reliable because map APIs do not always return correct results. In addition, users
do not enter correct information most of the time, for example Hecht et al. reported that
34% of users either enter incorrect non-geographic information in the location field or
leave the field empty.
Other studies make use of GPS coordinates provided by users mobile devices, however
the number of geotagged tweets is not large so we can rely on as reported by Cheng et al.
[2010] that the prportion of geotagged tweets is arround 1% and the location of majority
of users are not geotagged. Some Methods based on IP addresses are used by other
studies like (Buyukokkten, Cho, Garcia-Molina, Gravano & Shivakumar, 1999). These
methods have been shown to achieve arround 90% accuracy at locating Internet hosts to
their locations as reported by Padmanabhan & Subramanian [2001] [9]. However these
methods are not applicable to Twitter and other socal media services as geographical
divisions of IP adresses are not always valid. For example, some departments in an
international corporation might use the same IP addresses and their true locations are
spreading across the world. Another example, users who use VPNs could be assigned IP
addresses from different locations other than their true locations.
3
4 CHAPTER 2. RELATED WORK
Some studies use other social information to infere location of users, for example Popescu
and Grefenstette[10] tried to estimate the home country of Flicker users using place names
and coordinate provided with their photos, Backstrom et al. [2] presented an algorithm
to predict the physical location of a user using the social network structure of Facebook,
given the known locatiion of users friends they were able to locate 69.1% of the users
with 16 or more located friends to within 25 miles compared to only 57.2% using IP-based
methods.
2.2 Content-Based Location Estimation

Content of users tweets has been exploited to extract users location, for example if a
place is frequently mentioned in users tweet he is likely tweeting from that place. There
are some methods that are based on that intuition such as naive gazetteer matching
(Bilhaut, Charnois, Enjalbert & Mathet, 2003)[3] and named entity recognition as well as
vocabulary-based method to identify location name from tweets (Agarwal et al. [2012])[1].
A number of methods have been proposed to estimate the home location of users based
on content analysis of tweets. These methods build probabilistic models from tweet
content, for example Eisenstein et al. [2011][6] reported 58% accuracy for predicting
regions (4 regions) and 24% accuracy for predicting states (48 continental US states and
the District of Columbia) using geographic topic models for prediction.
For estimating city-level location it becomes more challenging than location estimation
for higher levels, because the number of cities in dataset is often larger that the number of
states, regions or countries, Cheng et al[2010][5] described a city-level location estimation
algorithm in which local words are identified in tweets (such as red sox is local to Boston)
and statistical models are build from them. However their method was not promising as
it needs manual selection of such local words to train a supervised classification model.
In addition they reported approximately 51% accuracy using their approach. Chang et
al. [2012] [4] recently described another content based location estimation using Gaussian
Mixture Model (GMM) and Maximum Likelihood Estimation (MLE) and reported 50%
accuracy in predicting city-level location within 100 miles of actual city-location which is
comparable to Cheng et al [2010].
Chapter 3
Data
3.1 Dataset Overview

In experiments conducted in this thesis, a dataset collected analyzed by Benjamin Bischke
have been used [?]. This dataset was collected from Decahouse-Streaming API represent-
ing about ten percent of random public tweets over all activities on Twitter during the
period 2016-01-15 till 2016-02-06.
The dataset consists of around 1 Billion activities divided as follows: 47.16% are new
posts, 30.23% are shared content of retweets and the remaining 22.6% are deletion of
tweets.
3.2 Data Analysis

Geolocation prediction models presented have primarily been trained and evaluated using
geotagged tweets, geotagged tweets are filtered based on languages when conducting
different experiments. In this section geographic and language analysis for tweets used
will be presented.
3.2.1 Geographical Analysis
Twitter give users the option to embed their current GPS-location. The dataset included
only about 2.1% (16,874,517) of the activities with embedded precise GPS-location. On
the other hand geographical information can be inffered indirectly from profile location
field in users profile, but as noticed this geographical information is not reliable as only
43.1% of unique users profiles included non empty profile location field. In addition about
half of these non empty profile location fields were successfully mapped to a real locations
as the other half contained non valid existing locations and in some cases valid but non
5
6 CHAPTER 3. DATA
complete addresses which is hard to map to a unique locations like state or country
names.
By extracting GPS-coordinates from users profiles and mapping it to countries we can

see the distribution of geo-locations of tweets, Fig 3.1 shows the top 10 countries extracted
from the dataset.
3.2.2 Language Analysis
Researches previously conducted have been primarily focused on English data or have
been used datasets that consisted of primarily English tweets.However, Twitter is a multi-
lingual platform and including some languages may help in the task of location prediction
as it can be powerful indicator for locations, for example, if a user tweets mostly in Chi-
nese, this could be an indicator that the user is from China.
For the analysis of languages used in tweets, a language detector was applied, Figure
3.2 shows the top 10 most frequently used languages in the dataset.
Country Distribution based on GPS- Information
All Other Countries : 25.59 %

United States : 29.76 %
All Other Countries : 25.59 %
United States : 29.76 %
Spain : 2.16 %
Malaysia : 2.45 %
Spain : 2.16 %
Indonesia : 2.49 %
Malaysia : 2.45 %
Philippines : 3.09 %
Indonesia : 2.49 %
Turkey : 3.51 %
Philippines : 3.09 % Brasil : 14.49 %
United Kindom : 4.62 %
Turkey : 3.51 %
Japan : 5.26 % Brasil : 14.49 %
Argentina : 6.58 %
United Kindom : 4.62 %
United States Brasil Argentina
Japan : 5.26 % Japan United Kindom
Argentina : 6.58 % Turkey
Philippines Indonesia Malaysia Spain All Other Countries
Fig. 3.1 This pie chart shows the top-10 countries determined by GPS information.
meta-chart.com
3.3. GROUND TRUTH DATA 7
Language Distribution Based on Language Dtector
All Othe r Language s : 7 .2 %
T hai : 2.1 %
All Othe r Language s : 7 .2 %
T urkish : 2.2 %
T hai : 2.1 %
Fre nch : 2.6 %
T urkish : 2.2 %
Indone sian : 3.1 % English : 36.5 %
Fre nch : 2.6 %
Kore an : 3.2 %
Indone sian : 3.1 % English : 36.5 %
Arabic : 6.3 %
Kore an : 3.2 %
Arabic : 6.3 %
Portugue se : 6.6 %
Portugue se : 6.6 %
Spanish : 10.4 %
Spanish : 10.4 % Japane se : 19.8 %
English Japanese Spanish Japane se : 19.8 %Arabic

Portuguese Korean
Indonesian French Turkish Thai All Other Languages
Fig. 3.2 This pie chart shows the top-10 most frequently used languages.meta-chart.com
3.3 Ground Truth Data

To conduct experiments presented in Chapter 4, The dataset mentioned in previous sec-
tions have been filtered. Geotagged post activities only have been used to train prediction
models, so from total of 16,874,517 geotagged activities a sample of around 13.9% of it
(2,350,906) have been chosen to be used as ground truth data.
This sample included only post activities, and by extracting GPS-coordinates from users
profiles and mapping it to countries we managed to get the distribution of geo-locations
of tweets in the sample, Fig 3.3 shows the top 10 countries extracted from the sample.
The sample also covered many languages as some experiments in chapter 4 use tweets in
various languages. Fig 3.4 shows the top 10 frequently used languages in sample tweets.
Chapter 4
Location Detection Approaches
4.1 Profile location identification
The first naive approach is to get the location of the user by checking the profile location
field included in users profile, but as we discussed before this data is not always reliable
for some reasons such as: 1) Profile fields contain non valid locations, for example phrases
to express a desire for keeping that information private, jokes, sexual content and even
expressions that indicate how much a user hate his current location[Hecht et al 2011] 2)
profile fields could be completely empty 3) some fields contain valid but not complete
addresses like state name or country name only.
4.2 Location detection by language
Social media applications give users the free choice to publish their status updates and
tweets. Language can be a strong indicator of location: for example a user that writes
tweets in Chinese is most probably located in China, but the problem is there are lan-
guages that are spoken in many locations around the world such as English which is
spoken by 67 countries as an official language[8]. So the prediction based on language
is not accurate enough to get the country where the tweet was published but a list of
possible locations (country level) could be obtained by mapping Languages to countries
speaking this language as an official or second language[10], Then by classifying the lan-
guage of the tweet, a list of countries that speak this language can be obtained. The
problem with this approach is the list could be large and contains irrelevant countries.
9
10 CHAPTER 4. LOCATION DETECTION APPROACHES
4.3 Location Detection by Machine Learning

Machine learning can be used to detect location of tweet based purely on its content.
Classifier for city level location detection have been implemented. Each tweet in our
training dataset corresponds to a training example and the corresponding output is the
geolocation provided with that tweet. The number of classes in the trained models equal
to the total number of locations (cities) in our training dataset.
To train a classifier on tweets content, the words contained in these tweets should be
ranked according to their location indicativeness which defines how much a word is associ-
ated with particular location. In the next two sections we present two different approaches
to rank the words to get feature set.
4.3.1 Heuristic-Based Approach

In this approach a simple heuristic algorithm which is called CALGARI[2] is used. This
algorithm is based on intuition that a model will perform better if it is trained on words
that are more likely to be used by some users from particular regions than users from
the general population. a score for each term is calculated then the words are ranked
according to this score. In mathematical words the score for each term is defined as the
maximum conditional probability of the word given a class from all classes set (cities)
over the probability of the word.
We will explain how this score is calculated below:
First Let score(W) be a function which takes a word and calculate the score for that
word W, f (W) be the frequency of a word W in our dataset, count(W, c) be a function
that count how many times the word W appeared with class c, S is the set of all words
in our dataset and C be the set of all classes (city locations) in our dataset, The score for
each word is calculated as follows:
max(P (W| c = C))

score(W) = where c C
P(W)
Where P(W) is the probability of the presence of word and is calculated as the frequency
f (T )
of the word over the total number of words. P(W) = , and P (W| c) is the conditional
|S|
probability of the presence of word W given some class c and is calculated as the number
of times word W appeared with class c over the total number of all words occurrences
with class c.
C count(W, ci )
, so max(P (W| c = C)) is evaluated as max P
i
count(tj , ci )
j
Now after calculating a score for each word, the algorithm sorts the words according to
the calculated score in non-decreasing order. In chapter 6 we discuss the effect of using
top-n% of the feature set generated using this algorithm.
4.3. LOCATION DETECTION BY MACHINE LEARNING 11
4.3.2 Information Theory-Based Approach
In addition to Heuristic method mentioned in previous section, we also discuss an information-

theoretic feature selection method as it proved to be efficient in text classification tasks,
e.g., Information Gain (IG) (Yang Pedersen, 1997). In addition it was reported that
using this method the best results in location detection task is achieved [7].
First lets define two important terms, the first one is Information Gain (IG). The In-
formation Gain is the difference in class (location) entropy due to data split on some
attribute (word), so the higher the value the greater the predictability of the word, so
given a set of all words in our training set S, the IG of a word W S across all classes
(cities) C is calculated as follows:
IG(W) = H(c) H(c|W)

H(c|W)
X X
P(W) P (c | W) log P (c | W) + P(W) P (c | W) log P (c | W)
c C c C
where P(W) and P(W) is the probability of the presence and absence of word W, respec-
tively, P (c | W) and P (c | W) is the conditional probability of class c when word W is
present and absent respectively. Because H(c) is constant over all words, so to rank the
features only the conditional entropy given word W needs to be calculated.
The second term we need to mention is the Intrinsic Value (IV), local words occur-
ring in a small number of cities usually have low intrinsic value, where non-local words
have high intrinsic value, so when the words are comparable in IG values, words with
smaller intrinsic value should be preferred because it means that the words are more
locally employed (location indicative).
IV (W) = P(W) log P(W) P(W) log P(W)
Now with the two terms mentioned above, Information Gain Ratio (IGR) is defined as
the ratio between information gain to the intrinsic value.
IG(W)
IGR((W) =
IV (W)
Chapter 5
Implementation
The main aim of this study is to detect the location based purely on content of tweets,
this task is very challenging in each part of it, for example too many data is needed
for training a classifier to get good results, also the training part itself is challenging
as we need to implement out of core learning due to this big data. In this chapter we
will present and discuss how each part of the project operates, also we will present the
challenges faced us when developing each component.
5.1 Data Preprocessing
The first important part to get good results from machine learning model is having clean
and well preprocessed data. In this section we will present every step in collecting our
training and testing data.
5.1.1 Data Collection
The first step to establish a good machine learning model is collecting data, so the more
data we have the more accurate results the model could give us. In this study we focus
mainly on twitter data. As mentioned in previous chapters, twitter gives the option to
include geolocation (GPS-coordinates) with tweets, so the main task here is to get a lot
of geotagged tweets, this task is challenging due to lack of geotagged tweets. The first
source of data is the dataset we mentioned in chapter 3, for this dataset geotagged new
posts were collected using a simple python script. for the second source of twitter data we
have collected tweets covering multiple regions of the world over one week using Twitter
Public API with the help of tweepy library1 .
1
http://www.tweepy.org/
13
14 CHAPTER 5. IMPLEMENTATION
5.1.2 Data Labeling
To make the data ready for training and testing, the tweets need to be labeled with unique
identifiers that map to a location as these labels are used as the corresponding output for
tweet input, so with python script the tweets are labeled by extracting the geolocation
from coordinates provided with tweets using reverse geocoding library2 , In this library a
K-D tree is populated with cities that have a population more than one thousand. The
source of the data is GeoNames3 , when calling this library with GPS - coordinates it
returns city name which is mapped to a unique identifier to label the queried tweet.
5.1.3 Data Distribution
Before starting to extract our features, the data needs to be divided into training data
and testing data, so the data is shuffled and divided into two groups to be used later in
training and testing.
5.2 Feature Set Extraction
In order to train classifiers, not all input words are equally important, so feature selec-
tion methods are implemented to filter out noisy words and rank the words according to
their importance. In this section the implementation of two feature selection approaches
presented in chapter 4 will be discussed.
5.2.1 Filtering Out Noisy Words
Before running algorithms to rank words according to their importance, noisy words
need to be filtered out, these words could be very frequent but not important (the, an,
in,.. etc), these words is called stop words, stop words are usually refer to the most
common words in a language. In removing the stop words NLTK4 stopwords list is used
to filter out any word that mentioned in this list. After filtering out stopwords, the
links, hashtags, mentions and any word with non latin alphabet characters is filtered
out. Finally the tweets only include non noisy words which will be ranked with feature
selection algorithms.
2
https://github.com/thampiman/reverse-geocoder
3
http://download.geonames.org/export/dump/
4
http://www.nltk.org
5.2. FEATURE SET EXTRACTION 15
5.2.2 CALGARI Algorithm Implementation
In this section we will present pseudo code for CALGARI algorithm. In Algorithm 1,
the main function CALGARI is presented which takes word as one of its parameters
and calculate the score for that word based on the algorithm discussed in chapter 4, this
function makes use of two other functions presented in the same Algorithm which are
COUNT and FREQUENCY.
Algorithm 1 CALGARI Algorithm

1: function CALGARI(word, classes, total number of words)
2: for class in classes do
3: conditional probability count(word, class) / count(W , class)
4: max probability max(conditional probability, max probability)
5: probability of word frequency(word) / total number of words
6: score of word max probability / probability of word
7: return score of word
8: function COUNT(words, class)
9: for word in words do
10: if word appeared with class then
11: count count + 1
12: return count
13: function FREQUENCY(word )
14: W words in tweets
15: for w in W do
16: if word is w then
17: count count + 1
18: return count
5.2.3 IGR Algorithm Implementation
In this section we will present pseudo code for Information Gain Ratio algorithm in Al-
gorithm 2, the main function IGR is presented which takes word as one of its parameters
and calculate Information Gain Ratio for the word by dividing Information gain of the
word over Intrinsic Value of the word, calculating these two parameters is done using
IG and IV functions respectively as discussed in Chapter 4, IG function uses COUNT
function presented in Algorithm 1.
16 CHAPTER 5. IMPLEMENTATION
Algorithm 2 IGR Algorithm

1: function IGR(word, classes, total number of words)
2: Information Gain IG(word, classes, total number of words)
3: Intrinsic Value IV(word, classes, total number of words)
4: Information Gain Ratio Information Gain / Intrinsic Value
5: return Information Gain Ratio
6: function IG(word, classes, total number of words)
8: for class in classes do
9: appearance probability count(word, class) / count(W , class)
10: appearance probability appearance probability / probability of word
11: sum1 sum1 + appearance probability log2 ( appearance probability )
12: absence probability (appearance probability 1) / (probability of word 1)
13: sum2 sum2 + absence probability log2 ( absence probability )
14: Information Gain probability of word sum1 + (probability of word 1) sum2
15: return Information Gain
16: function IV(word, classes, total number of words)
18: appearance entropy probability of word log2 ( probability of word )
19: absence entropy (probability of word 1) log2 ( (probability of word 1) )
20: return appearance entropy + absence entropy
5.3. TRAINING AND TESTING 17
5.3 Training and Testing
5.4 Web Demo

Chapter 6
Conclusion
Conclusion
19
Chapter 7
Future Work
Text
21
Appendix
22
Appendix A
Lists
23
List of Figures
24
Bibliography
[1] Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam Shroff.
Catching the long-tail: Extracting local news events from twitter. 2012.
[2] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improving
geographical prediction with social and spatial proximity. pages 6170, 2010.
[3] Frederik Bilhaut, Thierry Charnois, Patrice Enjalbert, and Yann Mathet. Geo-
graphic reference analysis for geographic document querying. pages 5562, 2003.
[4] Hau-wen Chang, Dongwon Lee, Mohammed Eltaher, and Jeongkyu Lee. @ phillies
tweeting from philly? predicting twitter user locations with spatial word usage.
pages 111118, 2012.
[5] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. pages 759768, 2010.
[6] Jacob Eisenstein, Brendan OConnor, Noah A Smith, and Eric P Xing. A latent
variable model for geographic lexical variation. pages 12771287, 2010.
[7] Bo Han, Paul Cook, and Timothy Baldwin. Text-based twitter user geolocation
prediction. Journal of Artificial Intelligence Research, 49:451500, 2014.
[8] Juhi Kulshrestha, Farshad Kooti, Ashkan Nikravesh, and P Krishna Gummadi. Ge-
ographic dissection of the twitter network. 2012.
[9] Venkata N Padmanabhan and Lakshminarayanan Subramanian. An investigation of

geographic mapping techniques for internet hosts. 31(4):173185, 2001.
[10] Adrian Popescu, Gregory Grefenstette, et al. Mining user home location and gender
from flickr tags. 2010.
25

Location Detection Over Social Media

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Location Detection Over Social Media

Uploaded by

Copyright:

Available Formats

Media Engineering and Technology Faculty

German University in Cairo

Location Detection Over Social

Author: Ahmed Soliman

Submission Date: XX July, 20XX

4 Location Detection Approaches 9

2.1 Non-Content-Based Location Estimation

2.2 Content-Based Location Estimation

3.1 Dataset Overview

3.2 Data Analysis

3.2.1 Geographical Analysis

By extracting GPS-coordinates from users profiles and mapping it to countries we can

3.2.2 Language Analysis

Country Distribution based on GPS- Information

All Other Countries : 25.59 %

Language Distribution Based on Language Dtector

All Othe r Language s : 7 .2 %

Spanish : 10.4 % Japane se : 19.8 %

English Japanese Spanish Japane se : 19.8 %Arabic

3.3 Ground Truth Data

Location Detection Approaches

4.1 Profile location identification

4.2 Location detection by language

4.3 Location Detection by Machine Learning

4.3.1 Heuristic-Based Approach

max(P (W| c = C))

4.3.2 Information Theory-Based Approach

In addition to Heuristic method mentioned in previous section, we also discuss an information-

IG(W) = H(c) H(c|W)

IV (W) = P(W) log P(W) P(W) log P(W)

5.1 Data Preprocessing

5.1.1 Data Collection

5.1.2 Data Labeling

5.1.3 Data Distribution

5.2 Feature Set Extraction

5.2.1 Filtering Out Noisy Words

5.2.2 CALGARI Algorithm Implementation

Algorithm 1 CALGARI Algorithm

5.2.3 IGR Algorithm Implementation

Algorithm 2 IGR Algorithm

5.3 Training and Testing

5.4 Web Demo

[9] Venkata N Padmanabhan and Lakshminarayanan Subramanian. An investigation of

You might also like