You are on page 1of 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/315880500

Human mobility semantics analysis: a probabilistic and scalable approach

Article  in  GeoInformatica · July 2018


DOI: 10.1007/s10707-017-0295-0

CITATIONS READS

2 58

4 authors, including:

Xiaohui Guo Richong Zhang


Hangzhou Innovation Institute. Beihang University (BUAA) Beihang University (BUAA)
13 PUBLICATIONS   164 CITATIONS    98 PUBLICATIONS   839 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Xiaohui Guo on 05 October 2021.

The user has requested enhancement of the downloaded file.


Geoinformatica (2018) 22:507–539
DOI 10.1007/s10707-017-0295-0

Human mobility semantics analysis: a probabilistic


and scalable approach

Xiaohui Guo1,2 · Richong Zhang1,2 · Xudong Liu1,2 ·


Jinpeng Huai1,2

Received: 1 August 2015 / Revised: 4 December 2016 / Accepted: 28 February 2017 /


Published online: 10 April 2017
© Springer Science+Business Media New York 2017

Abstract The popularity of smart mobile devices generated data, e.g., check-ins and geo-
tagged status, offers new opportunity for better understanding human mobility regularity.
Existing works on this problem usually resort to explicit frequency statistics models, such
as association rules and sequential patterns, and rely on Euclidean distance to measure the
spatial dependence. However, the noisiness and uncertainty natures of geospatial data hin-
der these methods’ application on human mobility in robust and intuitive way. Moreover, the
mobility spatial data volume and accumulation speed challenge the traditional methods in
efficiency, scalability, and time-space complexity aspects. In this context, we leverage full
Bayesian sequential modeling, to revisit mobility regularity discovery from high level prob-
abilistic semantic knowledge perspective, and to alleviate the inherent in mobility modeling
and geo-data noisiness induced uncertainty. Specifically, the mobility semantics is embod-
ied by virtue of underlying geospatial topics and topical transitions of mobility trajectories.
A classic variational inference is derived to estimate posterior and predictive probabilities,
and furthermore, the stochastic optimization is utilized to mitigate the costly computational
overhead in message passing subroutine. The experimental results confirm that our approach

 Richong Zhang
zhangrc@act.buaa.edu.cn
Xiaohui Guo
guoxh@act.buaa.edu.cn
Xudong Liu
Liuxd@act.buaa.edu.cn
Jinpeng Huai
Huaijp@buaa.edu.cn

1 State Key Laboratory of Software Development Environment, School of Computer Science


and Engineering, Beihang University, 37 Xueyuan Road, Beijing 100191, China
2 Beijing Advanced Innovation Center for Big Data and Brain Computing (BDBC), Beihang
University, 37 Xueyuan Road, Beijing 100191, China
508 Geoinformatica (2018) 22:507–539

not only reasonably recognizes the geospatial mobility semantic patterns, but also scales up
well to embrace the massive spatial-temporal human mobility activity data.

Keywords Human mobility · Geospatial semantics · Stochastic variational inference ·


Scalability

1 Introduction

With the prevalence of geospatial metadata enabled mobile devices, the crowd e-footprints,
such as check-ins, geo-tagged texts and images, are deluging into our information society,
which can be reorganized into trajectories and contain abundant human mobility regular-
ities. In practice, many respects of human mobility has been exploited, such as frequent
transitions pattern for tourism [75], visiting temporal pattern [51], periodic phenomena [41],
destination preference [38], and decision tree based pattern [48] united with prefix-tree [26].
However, these studies mainly utilize explicit statistical analysis, and interpret mobility
patterns as frequent sequential movements, which weakly express the geospatial mobil-
ity semantics, in the presence of the massive geospatial data’s noisiness and imbalance.
Moreover, with the manifestations of mobility data changing, volume rapidly inflating, and
accumulation speed quick increasing, the previous effective methods gradually become dull,
and even not work. All of these naturally call for effective and flexible geospatial mobility
semantics expression and discovery methods, which tolerate data’s imperfection and accom-
modate scalability and uncertainty. As a simple example, the semantics enriched pattern of
“Office → Food” can be abstracted from the explicit and concrete sequential movements
like “Office Building A → Restaurant B”. Obviously, this kind of high-level semantic
patterns generalize the human explicit sequential movements, and thereby heighten the
geospatial modeling expressive power.
To unleash this kind of geospatial mobility semantic knowledge, we assume that there
are latent topics corresponding to the observed venues. Meanwhile, when people transit
from place to place, semantic transitions between latent topics are induced spontaneously
and accordingly. Our objective is to discover these latent yet perceptible geospatial semantic
topics and topical transitions. Just as the previous example “many people would go to restau-
rants after work” indicated, such phenomena are abstracted by some coarse-granularity
transitions among the geospatial semantic topics rather than the adjacent locations them-
selves. Through elevated into an abstraction level, these geospatial topics and transitions
modeling units, which granularity could be elastically traded off by probabilistic model-
ing, are apt to reflect the human mobility geospatial regularity as semantic knowledge. In
order to embody the versatility and usage of our mobility semantics analysis approach, we
identify two potential application problems, semantic venue labeling and successive place
prediction, that will be focused on throughout this paper.
Based on the above analysis, this study adopts the full Bayesian hidden Markov model
framework, a probabilistic generative sequential modeling, to mine and convey the human
mobility semantics. With Multinomial and Gaussian distributions modeling two types of
venue representations, discrete venue identifier and continuous geo-coordinate respectively,
these models can simultaneously manifest the geospatial topical semantics of the locations
and reveal the topical transitions among the check-in trajectories (a.k.a., location sequences
without ambiguity). Specifically, assimilating the essence of probabilistic latent variable
models and topic metaphor, this modeling uncovers the emergent hidden semantic structure
from the observed venues. Furthermore, through augmenting First-order Markov property to
Geoinformatica (2018) 22:507–539 509

the trajectories’ topic assignment sequences, we characterize the underlying coarse-grained


topical transitions among the successive venue pairs. This is obviously a kind of flexible and
abstract semantic knowledge, rather than explicit venues co-occurrence frequency statistics
of mobility trajectories.
With respect to the estimation and inference, we first design a deterministic optimization
based batch structured mean-field variational inference to approximate posterior distribu-
tions. Specifically, it classifies the latent model variables to global model related category
and sequence specific local category, and fully factorizes the global variables distributions,
but retains the local variables’ linear chain dependency structures. Then, it alternate between
the global and local variational factors to optimize the objective variational bound through
the coordinate ascent algorithm.
But in local factors updating steps, this batch structured variational inference method
resorts to a forward filtering and backward smoothing [52] subroutine. which scans
all the observed sequences, to pursuit an optimal local variables’ posterior distribution
for the global steps gradient computation. However, the volume of trajectory data have
inevitably been becoming tremendous, and the business response capacity will get more
and more demanding. This leads to the local procedure both time and space consum-
ing, hence the algorithm’s scalability is severely limited. Stochastic Variational Inference
(SVI) [27] is one of the most effective solutions to unleash the potentials of varia-
tional inference in the big data scenarios. It leverages the Robbins-Monro averaging
scheme [53] based stochastic optimization approach to optimise the evidence lower bound.
This family of recursive optimization algorithms, appeal to the researchers addressing
the scalability issues, by virtue of only feeding one sample or minibatch to stochas-
tic algorithm, just as the stochastic gradient decent [13] at each update steps’ gradient
recomputing.
In light of this, we also present a stochastic structured variational inference approach
for two proposed models, full Bayesian hidden Markov model with Multinomial and Gaus-
sian emissions. Instead of analyzing whole observed trajectories in each iterations, we feed
the data sequence by sequence to the algorithm, i.e., a sequence per iteration. Specifically,
before updating the global factors parameters, the algorithm performs the local update steps
on a randomly sampled sequence only, to find an optimal local variational factors. Then,
it replicates the sample to simulate full-sized dataset, and estimates a noisy but unbiased
natural gradient. Finally, the algorithm execute the stochastic natural gradient ascent step
to complete this round of global variational parameters updating. Besides, a minibatch
extension of this stochastic inference is also given.
For the experimental studies, we convened real world Twitter, Foursquare, and Gowalla
data to validate our models in many aspects. The semantics analysis experiments consist
of intuitive geospatial semantics explanation, and the performances about the two applica-
tion problems identified by us, semantic venue labeling and successive place prediction.
The results confirm the effectiveness of our approach for recognizing and expressing the
human mobility semantics. In addition, we give the scalability performance comparison of
proposed structured variational inference method with state-of-the-art inference algorithm.
Moreover the batch version, stochastic version, and minibatch extension of the structured
variational inference methods are also investigated intensively. The results manifests that
this stochastic approximation based variational methods not only can alleviate the time cost
enormously, but also even converge to better optima.
Totally, the contributions of this work mainly lie in the following four sides.
– We discovered the probabilistic semantics of human mobility trajectory, i.e, the high
level geospatial topics and topical transitions;
510 Geoinformatica (2018) 22:507–539

– Identifying two applications, semantic venue labeling and successive place prediction,
to embody the value of probabilistic human mobility semantics;
– We formulate the geospatial human mobility into probabilistic generative model, i.e.,
hidden Markov models;
– Besides of a structured variational inference method, we also propose a scalable
stochastic variational inference and its minibatch extension;
– Last, experimental studies are carried out to demonstrate the human mobility semantics
discovering ability and the scalability performance.
The rest of this paper is organized as follows. First, we review some necessary and fun-
damental technologies to open our discussion in Section 2. Then, we elaborate the human
mobility semantics analysis problem, modeling, inference, and the computational complex-
ity analysis in Section 3. The semantics analysis and scalability comparison experimental
results are reported in Section 4. Summarizing the related works in Section 5, we conclude
this work in Section 6 finally.

2 Preliminaries

In this section, we review some important components of probabilistic modelling and


inference methodology. They are probabilistic graphical latent variable models, stochas-
tic approximation optimization technology and its counterpart used in the mean-field
variational inference for probabilistic models.

2.1 Probabilistic latent variable models

Probabilistic graphical model, incorporating the latent variables, has been widely used
to capture hidden knowledge and complex structure from observed data. The mixture
model, e.g., Gaussian mixture, additive mixture model, e.g., latent Dirichlet allocation, and
sequential data model, e.g, hidden Markov model, etc. belong to this category.
Generally speaking, Hidden Markov Model (HMM) [52] is assumed to be double coupled
stochastic processes. Namely, an observation sequence {yt }t>0 and a hidden states transition
sequence {zt }t>0 are exemplified by Fig. 1a, wherein t is the temporal evolution index and
the discrete-time case is considered here. The states are also called topics or regimes, and
often take values in a finite natural number set, namely zt ∈ {1, · · · , K}. The underlying
hidden state transitions are governed by Markovian chain.

z0 z1 zN z z

y0 y y y y
1 N

N N

(a) Hidden Markov Model (b) Mixture Model (c) S

Fig. 1 Review of latent variable models & SVI hierarchy framework


Geoinformatica (2018) 22:507–539 511

Suppose, the two specific corresponding sequences y :=< y0 , y1 , · · · , yN > and z :=<
z0 , z1 , · · · , zN >, the joint distribution could be accordingly written as
N
p(y, z) = p(z0 )p(y0 |z0 ) p(zt |zt−1 , A)p(yt |zt , )
t=1

where p(z0 ) is the initial state probability, A is the probabilistic transition matrix with Aij =
p(zt = j |zt−1 = i), and p(yt |zt = k, θk ) is the emission or observation probability with
the parameters  = {θk }k=1···K . These emission probabilities vary according to specific
application modelling contexts.
Mixture model, in Fig. 1b, assumes that N latent variables z independently control
mixture component selection for generating N observations y. Wherein π is the mixing
coefficients and y is governed by mixture components p(y|z = k, θk ), parameterized by
 = {θk }k=1···K . This is a bit different from the admixture model, latent Dirichlet allocation,
which augments one more layer in the Bayesian hierarchical model to convey the document
specific topic mixing weights. Whereas mixture model mixes the different topics globally.
Hidden Markov model could be viewed as generalization of the Mixture model, which
treats the sequence as a mixture of some latent topics, and enriches temporal and sequential
latent variables’ transitions with 1st-order Markov chain correlation. These HMMs’ mod-
elling expressive capabilities and features, the topic assignments and the successive topic
transitions, are extremely desired in human mobility semantics analysis application.

2.2 Stochastic approximation

Stochastic approximation method traces back to the seminal work of Robbins-Monro algo-
rithm [53], a scheme of recursive optimization algorithms for root or extreme estimation via
small batch noisy samples. Recently, stochastic approximation revives and plays an instru-
mental role in big data analytics. It endows those batched algorithms, which iteratively scan
the whole observed data for many rounds, with new versatility and high efficiency.

2.2.1 Stochastic gradient decent

Assume that we optimize some loss function (y, y) of training pair (x, y), and y = fw (x)
parameterized by w, to seek fw∗ ∈ F , in a hypothesized function family, to predict the future
possible data following certain distribution p(x, y).
Theoretically,
 to settle this, we need the expectation risk of the loss function
E[] = (fw (x), y) dp(x, y). But this is intractable for the unknown distribution.
N
Usually, we resort to minimizing the empirical risk Q(w) = i=1 Qi ((xi , yi ), w) =
N
i=1 (f (x
w i ), y i ) measuring the training set {(x , y )}
i i i=1
N performance of f .
w
Of course, the gradient decent method is competent to this task. But in order to updating
the parameter w to determinate the predictor fw , we need to operate on the whole data to
compute the true gradient ∇Q(w). Instead, at each iteration, the stochastic gradient decent
(SGD) algorithm [13] randomly picks a sample (xi , yi ) and i ∼ Unif(1, · · · , N ), to update
the parameters according to the Robbins-Monro averaging scheme

wt+1 = wt − ρt ∇Qi (wt ). (1)


∞
In the light of [53], given a step-size sequence {ρt }∞
t=1 satisfying the condition t=0 ρt =

∞ and ∞ ρ
t=0 t
2 < ∞, the noisy gradient, ∇Q (w ), could lead to the empirical risk Q(w)
i t
512 Geoinformatica (2018) 22:507–539

slipping into a local minima, i.e., ∇Q(w ∗ ) = 0. In another word, the sequence {wt }∞
t=1
almost surly converges to some w ∗ under mild condition.
By virtue of feeding only one example or minibatch to the update scheme in Eq. 1 at
every iteration, this method appeals and scales well in big data settings.

2.2.2 Stochastic variational inference

In the probabilistic graphical model context, stochastic variational inference [27] is a generic
and vigorous framework, leveraging the stochastic approximation to tune the variational
inference to accommodate more bigger data. Specifically, it optimizes evidence lower bound
through Robbins-Monro scheme.
SVI framework can handle a general class of Bayesian network models with global and
local hidden variables in its hierarchical graphical structures, schematically rendered in
Fig. 1c. Therein, the N observations are y = y1:N ; the vector of global hidden variables is
θ ; the N local hidden variables are z = z1:N , each of which is a collection of K dimensional
variables zn = zn,1:K ; the vector of fixed hyperparameter is α. The corresponding joint dis-

tribution could be factorizes into p(y, z, θ |α) = p(θ|α) N n=1 p(yn , zn |θ), and the desired
posterior distribution is p(z, θ |y). Relaxing the latent variable distributions with free param-
eters, i.e., λ for the global distribution and φ1:N for the local ones, forms the mean-field
variational distribution family as q(z, θ ) = q(θ |λ) N n=1 q(zn |φn ). Utilizing the alternative
coordinate ascent, we maximize the Evidence Lower Bound (Abbr. ELBO)

L(q)  Eq [ln p(y, z, θ )] − Eq [ln q(z, θ )]  ln p(y), (2)

with respect to the variational parameters {λ, φ1:N }, i.e., equivalently minimizing the
Kullback-Leibler divergence. Thus the variational distributions are determined through
their parameters respectively, and then used as surrogates to approximate the predictive
distribution.
It further assumes all the complete conditional in exponential family, i.e.,
 
ln p(θ|y, z, α) = ηg (y, z, α), tg (θ ) − Ag (ηg (y, z, α)),
 
ln p(znk |yn , zn,−k , θ ) = η (yn , zn,−k , θ ), t (znk ) − A (η (yn , zn,−k , θ )). (3)

These assumptions imply that the prior of global variable θ and the local context complete
likelihood (yn , zn ) are from a conjugate exponential distribution family [24], namely

ln p(θ|α) = α, t (θ ) − Ag (α)


ln p(yn , zn |θ) = θ, t (yn , zn ) − A (θ ).

The sufficient statistics are t (θ) = (θ, −A (θ )) and corresponding natural parameter is
α = (α1 , α2 ), where the α1 and θ are the vectors with same dimension, and α2 is a scalar.
Then the posterior of the complete conditional in Eq. 3 will be with the natural parameter
N
ηg (y, z, α) = (α1 + t (yn , zn ), α2 + N ). (4)
n=1

Furthermore, the variational distributions are assumed from the corresponding same expo-
nential families too,

ln q(θ |λ) = λ, t (θ) −Ag (λ)


ln q(znk |φnk ) = φnk , t (znk ) −A (φnk ).
Geoinformatica (2018) 22:507–539 513

Then, substituting the variational and complete conditional distributions into the ELBO of
Eq. 2, we could derive the gradients of L with respect to the global variational parameter λ
and local variational parameters {φnk }N,K
n,k=1 through the exponential family properties. They
nearly take the identical forms as follow,

∇λ L = ∇λ2 Ag (λ)(Eq [ηg (y, z, α)] − λ)


∇φnk L = ∇φ2nk A (φnk )(Eq [η (yn ,zn,−k ,θ)] − φnk ). (5)

The classical variational inference maximizes the bound with alternative coordinate ascent
algorithm between the global λ and the local φ1:N . Before refining the global parameter,
algorithm need analyze the whole data set to update all N × K local parameters. This local
step is time consuming when the data is massive.
To mitigate this computational bottleneck, the SVI update the global parameter using
noisy natural gradient estimated by N replicates of random sample yn where n ∼
Unif(1,· · ·,N ). The natural gradient could be computed through premultiplying the gra-
dient in Eq. 5 by the inverse Fisher information. For a specific sample yn , employing the
exponential family natural parameter’s conjugacy property in Eq. 4, we have

∇˜ λ L = α + N Eq [t (yn , zn ), 1] − λ. (6)

Thereby, for a given stepsize sequence {ρt }∞


t=0 , the SVI, in a noisy natural gradient direction
at each step, update the global variational parameters via

λt+1 = (1 − ρt )λt + ρt (α + N Eq [t (yn , zn ), 1]). (7)

For each local factor q(zn |φn ), SVI proceeds on the random sample yn with the standard
mean-field update for φn , using the current estimation of global factor q(θ |λt ).

3 Human mobility semantics model

In this section, we model the human mobility semantics analysis application with Hidden
Markov Models, incorporating two kinds of emission probabilities, Gaussian and Multi-
nomial distribution. In addition, we propose a deterministic optimization based structured
mean-field variational inference algorithm. Furthermore, in order to leverage this method
to embrace large scale data and applications, two scalable inference algorithms, stochastic
structured variational inference and its mini-batched extension, are particularly designed.

3.1 Metaphors and assumptions

As analyzed in previous sections, there are two factors that should be highlighted in mod-
elling the human mobility trajectory or sequence semantics analysis problem. One is that
user generated trajectories collaboratively and statistically exhibit geospatial topical seman-
tics. The other is that the sequential venue shifts might be governed by some topical
transitional patterns among the geospatial topics.
The venues always serve different functions, and even a venue provides multiple func-
tionalities. For examples, a college, first of all, is a place for education activities, but at the
same time, it also offers some sites for physical exercises, entertainment, and even some
very famous restaurants. From an urban planning perspective, some city regions maybe
designed for some particular purposes, such as industrial, commercial, residential and some
514 Geoinformatica (2018) 22:507–539

other service facility regions. Therefore, some specific venue manifests multi-functional
geospatial semantics, which may be influenced by its nearby venues, the entire city planing
region where it resides in, and even user mobility contexts. This semantical uncertainty is
suitable to be expressed by the probabilistic topical modelling scheme.
In the other sides, the trajectories connecting some venues, are the leaved traces in the
users’ daily routines or some unusual itineraries in holidays. They naturally reflect the user
intentions of sequential transfers from venue to venue. And these intentions for user tra-
versed venues must align to the venues’ geospatial semantics to a large extent. Therefore,
the mobility trajectory’s semantics could be viewed as mixture of some geospatial topics.
For instance, in the workday, people usually transfer to some fast-food restaurants, after
the morning works, and go to the nearby supermarket or home after the whole day works.
In the weekend, young guys usually to meet some friends in some entertainments sites,
like KTV or bars, after the morning physical exercises or long distance running. But the
middle aged peoples usually go to the parents’ house. We are intended to employ the first-
order Markov property to capture those conditional topical intentions’ transitions, i.e., the
geospatial transitional semantics underlying the sequential check-ins.

3.2 Problem definitions

From massive check-in data with each record in the form of (userI d, venue, time), we
can derive many user trajectories, i.e., venue sequence composed by Ns successive observed
check-in points s =< y1 , y2 , · · · , yNs > and S = {si }N
i=1 denoting the entire trajectory set

Nu U
with cardinal N . Also, we may reorganize them in user-specific view, S = {si }i=1 u .
u=1
The venue representation depends on the particular data collection fashion, and there are
two kinds of data format for y, bivariate geographic coordinate (longitude, latitude) ∈ R2
or predefined venue identifier (venue − I d) ∈ N, usually. The venue identifier set is with
size V .
On the basis of above formulation, we are aimed to understanding the underlying seman-
tic regularity of human mobility in a probabilistic way. Specifically, we crystallize the
mobility semantics metaphor in Section 3.1 into two aspects, semantically labeling the
geospatial venues, and discovering personal or crowd topical transition regularities. Fur-
thermore, we identify following two application problems to demonstrate the discovered
mobility semantics’ usage and usefulness.

Problem 1 Semantic Venue Labeling. Assuming the spontaneous alignment between user
intention and venue functionality, we want to predict the topical category, semantic label or
annotation for the venues through understanding the topical transition pattern. Namely, via
recognizing the latent user’s intention transition z :=< z0 , z1 , · · · , zNs > for the observed
trajectory s, y :=< y0 , y1 , · · · , yNs >, we can category the venues’ functionality themes
into certain statistically emergent topics. In this way, we intend that the consecutive check-in
sequence can boost the venue labeling problem.

Problem 2 Successive Place Prediction. Harvested the user mobility regularities, the user’s
future movements expect to be forecasted. Formally, given the observed check-in sequences
is y :=< y0 , y1 , · · · , yn >, what the next places yn+1 most likely be is one of the suc-
cessive place prediction problems. Another potential and interesting problem is predicting
the missed venue yn in a partially observed sequence y :=< · · · , yn−1 , , yn+1 , · · · >.
Geoinformatica (2018) 22:507–539 515

Although this paper focuses on the former, our modelling approach is competent to handle
both of them.

3.3 Modelling

Introducing the geospatial topic variables and its sequential Markov chain dependency to
trajectory data, we formulate human mobility semantics understanding as a full Bayesian
generative model, focusing on the above mentioned two ingredients, venue’s topic labeling
and topical transitions.
Recognizing the latent geospatial topics, we adopt the Dirichlet-Multinomial conjugated
distribution to characterize the trajectory as topic mixture, i.e., placing a Dirichlet prior on
the global topic mixing weights π ∼ Dir(α). For a specific check-in venue yts in an observed
sequence s ∈ S, we assume it belongs to one of K geospatial topic categories, and introduce
a latent topic assignment (a.k.a. semantic label) variable zts ∼ Mult(π ). Given the particular
zts , we further consider the conditional distribution of the observed venue p(yts |zts ) for the
two venue representation modes declared in Section 3.2. For the geo-coordinates case, K
bivariate Gaussian topic components parameterized by θk = {μk , k } are employed as [65],
with a common Normal-inverse-Wishart (NIW, in Fig. 2a) prior imposed,
μk , k |μ0 , κ0 , ν0 , 0 ∼ N (μk |μ0 , (κ0 k ))W ( k | 0 , ν0 )
where W is the inverse Wishart distribution with ν0 called the number of degrees of free-
dom, and 0 is a scale matrix with the same size K × K of Gaussian precision. The above
Gaussian mean μk and Wishart distributed k coupled NIW distributions is the conju-
gated prior for Gaussian distributions with mean and covariance unknown context. For the
identifier representation case, we assume each venue identifier drawn from an one of K
Multinomial distributions with parameters θk = {ωk } and all of them share a Dirichlet prior
Dir(γ ), schematically rendering in Fig. 2c.
So far, we fulfill the topic metaphor of venue label. and the geospatial semantics of tra-
jectory is topic mixture, i.e.,“bag of venue topics”. Note that the successive venue labeling

K
ν Σ   γ

Σ z0 z1  zNs
K

µ y y0 y  yN y
1 s

K S

κ µ
θ
K
β
(a) Normal Inverse Wishart (b) Bayesian HMM (c) Dirichlet-Multinomial

Fig. 2 Bayesian HMMs with conjugated emission Ddistributions


516 Geoinformatica (2018) 22:507–539

topics are independent. In this way, the modelling expressive power is actually equiva-
lent to the Mixture of Gaussian, attained by combining Figs. 1b and 2a, and Mixture of
Multinomial, Figs. 1b and 2c together.
But for human mobility application, the topic of current venue is often correlated and
strongly dependent on the previous locations in the trajectory, owing to the geographical
restriction and the user intention. For the simplicity and effectiveness, the 1st-order Markov
chain is utilized to materialize this geospatial topics’ conditional transition dependency.
Specifically, we notate all the trajectories initial geospatial topic distribution as Mult(π0 ),
and the subsequent topic assignments are governed by the probabilistic transition matrix,
AK×K , formed with Multinomial-distributed
 probability row vector, πk = Ak· , horizontally
stacked, and p(A) = K k=1 p(πk ). Namely, given the current geospatial topic assignment
zt = k, the next topic will be zt+1 ∼ Mult(Ak. ). Moreover, in order to obtain a full bayesian
version of our formulation, we impose the rational priors on the relevant parameters, i.e.,
the Dirichlet prior Dir(α) on the geospatial topic initial and transition matrix distributions
parameters π0 , πk=1···K .
Ultimately integrating the emission probabilities just as having been defined two kinds
of mixture components above, Hidden Markov Model with Gaussian EMission (Abbr.
HMM-GausEM) is derived, as shown by Fig. 2a and b collaboratively. Likewise, Hid-
den Markov Model with Multinomial EMission (Abbr. HMM-MultEM) is obtained with
graphical notations Fig. 2b and c. Totally the full Bayesian HMMs generative process can
be summarized as following, with the hidden Markov temporal modelling structure,


−π1 −
iid
πk ∼ Dir(α) k = 0,· · ·, K A  ⎝ ··· ⎠
−πK −
z1 ∼ Mult(π0 ) zt+1 ∼ Mult(πk , zt = k).

The Gaussian-distributed geographic coordinates emission generative process,

iid
yt |ztk ∼ N(μk , k ) μk , k ∼ NIW(μ0 , κ0 , ν0 , 0 ),

and the Multinomial venue identifier observations generated trough

iid
yt |ztk ∼ Mult(ωk ) ωk ∼ Dir(γ ).

From the above modelling analysis, HMM-GausEM and HMM-MultEM can versatilely
and felicitously characterize the human daily mobility semantics. And, the corresponding
generative process is actually encoding following joint distribution over geospatial topic
labeling sequences, Z = {zs :=< z0s , z1s , · · · , zNs >}N , observed trajectories Y = {ys :=<
s s=1
y0s , y1s , · · · , yN
s >}N , and the global latent variables,  = {π , A, } and  = {θ }
s s=1 0 k k=1···K ,
which reside in the outside of the trajectory plate in Fig. 2b,


N 
N 
Ns
p(Y , Z |) = p(ys , zs |) = p(z0s |π0 )p(y0s |z0s , θz0s ) p(zts |zt−1
s
, A)p(yts |zts , θzts ).
s=1 s=1 t=1

For the notational simplicity, the model hyperparameters  = {α, β}, where β =
{μ0 , κ0 , ν0 , 0 } for HMM-GausEM, and β = {γ } for HMM-MultEM respectively, are
omitted and assumed to be implicitly depended hereinafter.
Geoinformatica (2018) 22:507–539 517

3.4 Inference

Grafted two types of emission models onto the Bayesian HMM framework, we desire
for the posterior distribution of the venue topic labeling sequences and global parameter
variables given the trajectory observations, p(Z , π0 , A, |Y ). Unfortunately, the exact pos-
terior inference is intractable, and we must appeal to approximate methods. There indeed
exist the Expectation Maximization algorithm [52] and Markov chain Monte Carlo [55]
sampling to handling the HMMs posterior inference. For the sake of the scalability and
time efficiency, we resort to the optimization oriented Bayesian variational inference, and
particularly promote the stochastic variational inference to acclimatize HMM-GausEM
and HMM-MultEM models to extreme massive mobility data circumstances.

3.4.1 Batch variational inference

Inspired by [9], we sort out the latent variables into two global and local categories, i.e.,
 = {π0 , A, } and Z = {zs }s=1···N , in terms of its positions inside or outside of trajec-
tory plate in HMM Bayesian hierarchical model, see Fig. 2b. Then, the HMMs posterior is
approximated with following structured variational family

q(Z , ) = q(Z )q() = q(Z )q(π0 )q(A)q(),

with the variational parameters  ˆ = {α̂, β̂}, wherein α̂ = {α̂0 , α̂1,··· ,K } is for Dirich-
let prior imposed on initial and transition parameter variables π0 and A, and β̂ =
ˆ 0k }k=1···K for NIW prior on Gaussian emission parameters θk = {μk , k }
{κ̂0k , μ̂0k , ν̂0k ,
in HMM-GausEM or β̂ = {γ̂k }k=1···K for Dirichlet prior on Multinomial emission param-
eter ωk in HMM-MultEM accordingly. Concretely, this factorization are dropping the
dependencies between global  and local zs ∈ Z , but preserving the temporal chain struc-
tured correlation for each labeling sequence zs . This structured factorization is inherent and
crucial demand for accurate sequential model inference. And, the ELBO takes this form
   
ln p(Z , Y |) ln p()
L(Z , )  Eq + Eq  ln p(Y ). (8)
ln q(Z ) ln q()
Batch variational inference maximize this lower bound by coordinate ascent method,
alternatively optimizing the global and local variational factors. 
In the global optimization step, we fix the local factors q(Z ) = N s
s=1 q(z ) and maxi-
mize the ELBO L() with respect to its variational parameters . ˆ In light of the conjugate
exponential family leading to easy coordinate ascent updates derivation [21], we set each
variational factors in the same exponential distribution families as its complete condition-
als counterpart respectively. The natural parameterizations are employed to represent these
distributions, with the natural parameter set η = {ηα , ηβ }, ηˆ = {ηα̂ , ηβ̂ } and the same
sufficient statistics t = {tα , tβ }, tˆ = {tα̂ , tβ̂ }.
Then the global steps is obtained by differentiating the  involved terms with respect to
the global variational natural parameters η̂ˆ . Relying on the conjugate property (see Eq. 3)
of Dirichlet-Multinomial and NIW-Gaussian distribution pairs, and the exponential family
identity Eq() [t ()] = ∇A(ηˆ ), it yields following elegant coordinate ascent steps
N
ηα̂k = ηαk + Eq(zs ) [tαk (zs )] (9)
s=1
N
ηβ̂k = ηβk + Eq(zs ) [tβk (ys , zs )] (10)
s=1
518 Geoinformatica (2018) 22:507–539

where the rightmost summation terms are the expectations of sufficient statistics with
respect to optimal local variational factors q(Z ). Equation 9 is the Markov chain related
updates. For k = 0, the sufficient statistics tα0 (zs ) is a K-dimensional parameter vector for
π0 , and k = 1 · · · K for A.
Ns
tα0 (zs ) = (−t0j (zs )−)j =1···K , t0j (zs ) = 1(z0s = j );
t=1
Ns
tαk (zs ) = (−tkj (zs )−)j =1···K , tkj (zs ) = s
1(zt−1 = k, zt = j )
t=1

where, 1(·) is the indicator function whether the arguments or conditions are satisfied
throughout this paper. The update Eq. 10 is for two types of emission parameter variables
 = {θk }k=1···K . In HMM-GausEM, θk = {μk , k } and the sufficient statistic tβk (ys , zs ) is
the compact representation of K NIW sufficient statistics,
 s s  s
tμ0k (ys , zs ) = Nt=0 yt 1(zt = k),
s tκ0k (ys , zs ) = N 1(zts = k),
 Ns t=0
Ns s s 
tν0k (y , z ) = t=0 1(zt = k),
s s s t 0k (y , z ) = t=0 yt yt 1(zts = k).
s s

In HMM-MultEM, θk = {γk }, and tβk (ys , zs ) is the condensation of K Dirichlet topic


components’ sufficient statistics


Ns
tγk (ys , zs ) = (−tγk i (ys , zs )−)i=1···V , tγk i (ys , zs ) = 1(yt = i, zt = k).
t=1

Note that the batch variational inference requires scanning all trajectories so as to calculate
these sufficient statistics terms.
Henceforth, with the updates in Eqs. 9 and 10, we obtain the current global variables’
variational posterior distributions q() = q(π0 )q(A)q(), after estimating the expectation
of above sufficient statistics.
Picking out q(Z ) involved terms from ELBO in Eq. 8, The local steps seeks the optimal
local variational posterior factors for the topical labeling variables sequence.
 

N 
Ns 
Ns
q(Z ) ∝ exp Eq ln π0z0s + Eq ln Azt−1
s zs +
t
Eq ln p(yts |zts ) (11)
s=1 t=1 t=0

This is the exponentiated expected log probabilities under the current global variational fac-
tors q(). Since the chain coupling, we utilize the Forward-Backward message propagation,
expounded by next subsection.

3.4.2 Forward-backward message propagation

The core ingredients ofstructured variational inference framework, is evaluating the optimal
local factor q(Z ) = N s
s=1 q(z ), so that, the quantities for sequences s, pairwise-beliefs
s
p(zt−1 , zt ), marginal belief p(zts ), and the most probable topical assignment sequences,
s

zs∗ = arg maxzs p(zs |ys , ), could be inferred for sufficient statistics’ expectation estima-
tion and posterior probability evaluation. A dynamic programming based message passing
algorithm are presented following, with the distributions of global parameters variables 
are fixed.
Geoinformatica (2018) 22:507–539 519

Since the expectation and logarithm are not commutative in Eq. 11, we define following
parameters for the message passing procedure
   
π̃0k  exp Eq(π0 ) ln π0k  Ãik  exp Eq(A) ln Aik
L̃stk  exp Eq() ln p(yts |zts = k)

Then, Forward-Backward algorithm could propagate the set of forward messages F =


{Ftks }K,Ns ,S
k=1,t=0,s=1 according to

K
Ftks = F s Ãik L̃stk s
F0k = π̃0k .
i=1 t−1,i

s }K,Ns ,S
Then, it propagates the backward messages B = {Btk k=1,t=0,s=1 reversely as

K
s
Btk = Ãki L̃st+1,i Bt+1,i
s s
BN sk
= 1.
i=1

After performing these pairwise passes for every sequences, we retain the messages to com-
pute the optimal local variational distributions, q ∗ (Z ), in Eq. 11. Thereby we also evaluate
the expected sufficient statistics in the global updates. Owning to the space limitation, we
schematically give the Markov structure case in Eq. 9 and emission case Eq. 10 in the global
updates here, for the inference completeness,
   
s −1 ∗ s
 s −1 Ft,k
s à L̃s B s
= N = j) = N
kj tk t+1,j
Eq (z ) tηαk (z )
∗ s
s
t=0 q (zt = k, zt+1 s
t=0 K s
kj i=0 FNs ,i
  
 s s s ∗ s
Ns Ft,j
s B s ·t s
tk ηβ (yt ,k)
Eq ∗ (ys ,zs ) tηβk (ys , zs ) = N
t=0 tηβk (yt , zt )q (zt = k) = t=1 K ks .
kj i=0 FNs ,i

3.4.3 Stochastic variational inference

It is obvious that local update steps of the batch variational inference (BVI) performs on
the whole dataset, and is very time-consuming. Moreover, the global factors are usually
randomly initialized and then not informative, thus the local updates will be wasteful and
rewardless in the initial several iterations. Meanwhile, big data is always big volume with
redundant data points. Therefore, the entire pass of batch variational inference algorithm is
costly but thankless.
There indeed exist many researches in the incremental learning context [35], involving
block-wise and symbol-wise maximization likelihood estimation of HMM parameters. In
the variational inference for HMM, sub-sequence (i.e., block-wise) based SVI [21] has been
used to very long time series. For the check-in data analysis, we have massive not that
long trajectories. Therefore, we feed the data sequence by sequence to every iteration of the
upcoming SVI algorithm for HMMs.
Concretely speaking, considering only one randomly sampled sequence s instead of on
whole sequences, SVI local updates, i.e., forward filtering backward smoothing, find the
optimal local variational distribution,
 

Ns 
Ns
q(z ) ∝ exp Eq ln π0z0s +
s
Eq ln Azt−1
s zs +
t
Eq ln p(yt |zt ) .
s s

t=1 t=0
520 Geoinformatica (2018) 22:507–539

SVI global update steps replicates the sampled sequence N times to mimic the full dataset,
the corresponding ELBO given by
   
ln p(ys , zs |) ln p()
Ls  E q + E q  ln p(y).
ln q(zs ) ln q()
So that, it could get a noisy but unbiased gradient direction Es [∇η̂ˆ Ls ] = ∇η̂ˆ L, and fur-
ther approximate the natural gradient ∇˜ η̂ˆ L according to the Eq. 6. Following this natural
gradient direction, SVI performs the stochastic natural gradient ascent on the global varia-
tional parameters ηˆ using the Robbin-Monro averaging scheme like Eq. 7, modifying the
global updates as
(t+1) (t)
ηα̂ = (1 − ρt )ηα̂ + ρt (ηα(t)k + N Eq(zs ) [tηαk (zs )]) (12)
k k

η(t+1) = (1 − ρt )η(t) + ρt (ηβ(t)k + N Eq(zs ) [tηβk (ys , zs )]). (13)


β̂k β̂k
Here, we a little bit abuse the notation t to denote the iteration index. Following [27], we set
the stepsizes sequences as
ρt = (t + τ )−κ (14)
where, κ ∈ (0.5, 1] is the forgetting rate controlling how quickly old information is
forgotten, and the delay coefficient τ  0 down-weights unstable updates in the early
phases.

3.4.4 Minibatch stochastic variational inference

In fact, the minibatch SVI is a compromise between the entire batch and stochastic ver-
sion of variational inference. As many stochastic approximation methods benefiting from
the mini-batched data feeding manner, we also adopt it to mitigate the instability of SVI
performance. Specifically, the minibatch SVI proceeds the global and local updates on the
sampled several sequences set Mt = {si }B i=0 , with equal batch-size B = |M | for each
t

iteration.
For minibatch SVI, the local updates and the expected sufficient statistics computation
remains largely the same. The major difference is that only observations and local variables
of the sequences in M are considered. Then, in the global updating stages, minibatch SVI
scales the minibatch date using the multiplying factor c = N B to imitate the full dataset. The
global transition matrix and topic components variational parameters updating of minibatch
SVI, similar with Eqs. 12 and 13, is exemplified here
(t+1) (t)

ηα̂ = (1 − ρt )ηα̂ + ρt (ηα(t)k + c Eq(zs ) [tηαk (zs )])
k k s∈Mt
(t+1) (t) (t)

η = (1 − ρt )η + ρt (ηβk + c E s [t (ys , zs )]).
t q(z ) ηβk
β̂k β̂k s∈M
The corresponding sufficient statistics computation for the SVI and minibatch SVI global
updates can be straightforward derived following the batch variational inference section.

3.4.5 Computational complexity analysis

The main computational overhead of the above three variational inference approaches are
the local steps, i.e., Forward-Backward message passing in Section 3.4.2, at each itera-
tions. For the analysis simplicity, we assume the sequences are all with the same average
length of M, and there are totally N sequences in the whole dataset. Feeding a sequence to
Forward-Backward algorithm will take the time O(M × K 2 ) to achieve the optimal local
Geoinformatica (2018) 22:507–539 521

variational factors. Thus, the batch stochastic variational scans all the dataset with time
O(N × M × K 2 ) in each local steps, but SVI and minibatch SVI will take only O(M × K 2 )
and O(B × M × K 2 ), B N , respectively. Although this is just linearity difference, it
should not be overlooked in this big data age. In the global steps, these three algorithms
spend the same time because SVI and minibatch SVI are all with their expansion factor N
and c to imitate the entire dataset scale.

3.5 Prediction

Harvesting the topical and transitional human mobility semantics, HMM-GausEM and
HMM-MultEM are competent to treat future stay-points prediction claimed by pervasive
location based services aiming at geographical user targeting and advertisement disseminat-
ing, and is promising to recognize and partition the urban functional zones for governmental
city planning. Most of these geospatial functionality requirements are covered by the two
identified problems in Section 3.2, and could be resolved through proposed hidden Markov
modelling framework with two kinds of emissions.
Particularly, as defined in Problem 1. Semantic Venue Labeling, we intend to find the
most probable topical transition sequence z which is in line with the intent for user to
produce the itinerary y, so as to assign label to venue. Conformed with HMM modelling
framework, the labeling issue is formulated as
ˆ
z∗ = arg max p(y, z|; ). (15)
z
Viterbi algorithm [20] is qualified for this task. But in Eq. 15, the variational parameter
ˆ is explicitly depended, which implies that the marginal probability of local sequences
pair (y, z) is calculated with optimal global variational factor posterior q() in proposed
variational inference context.
And about the Problem 2. Successive Place Prediction, given user’s just past trajectory
y :=< y0 , y1 , · · · , yn >, and by virtue of inferring transition regularities A and labeling
experienced trace z :=< z0 , z1 , · · · , zn >, we aim to predict the next place yn+1 . The
corresponding predictive probability takes the form
1  
p(yn+1 |y) = p(yn+1 |zn+1 ) p(zn+1 |zn )p(zn , y). (16)
p(y) z n+1 zn

We could treat this prediction through the Forward-Backward Message Propagation algo-
rithm in Section 3.4.2 with the inferred posterior q() likewise.

4 Experiments

The following section describes the evaluation process and results of the proposed method
for analyzing the geospatial and topical human mobility semantics. The datasets, from
Twitter, Foursquare, and Gowalla, are primarily analyzed in Section 4.1. Three aspects
of experimental studies are conducted. Firstly, we exhibit our methods’ informativeness,
expressiveness and intuitiveness of human mobility semantics, through rendering topical
transition patterns underlying user generated and geo-tagged Tweets dataset in Section 4.2.
Then, we directly and experimentally answer the two identified problems, i.e., illustrating
the convincible venue labeling performance in Section 4.3 and strong explanatory venue
prediction capability in Section 4.4. Last but not lest, the Section 4.5 presents the scal-
ability performance of the promising stochastic variational inference and its minibatch
522 Geoinformatica (2018) 22:507–539

Table 1 Summary of the datasets’ basic information

ID/Geo Category GeoRange #User #Venue #Checkin #Trajectory

TWT-NY Yes/Yes No City 1243 2648 208007 22612


FQ-TKY Yes/Yes Yes City 2293 7870 447076 61971
GOWLA Yes/Yes No Global 107092 1280969 6442892 1295364

extension compared with the batch variational inference and other state-of-the-art inference
approaches on more larger Gowalla dataset.

4.1 Experimental dataset

There are three datasets used to validate our models, Foursquare at Tokyo dataset (Abbr.
FQ-TKY), and Gowalla dataset provided by Stanford SNAP project [17] (Abbr. GOWLA),
Twitter dataset contributed by [16]. The original Twitter dataset is composed by 22 million
check-ins by over 220 thousand users. We only extract those check-ins occurred in New
York City, in which total 840,046 check-ins samples, with 12,623 users and over 50,000
venues, are involved (Abbr. TWT-NY). This data subset, without loss of generality, should
be sufficient to cover most of the geospatial transition semantic patterns in urban area.
All of these three datasets are with both venue-Id and geo-coordinates data format pro-
vided. Except for FQ-TKY, the categorical information of our datasets is not available
originally. In order to demonstrate the mobility semantics expressive power, we need some
ground-truth semantic labeling information. Foursquare category hierarchy1 is a predefined
taxonomy for venues categorization and taxonomic classification. Although its arbitrari-
ness and subjectivity, it is advisable utilizing this handcrafted categorical information as
alignment reference to mobility semantics quality evaluation. Through invoking Foursquare
API and mapping the tweets’ geo-tag onto unique Foursquare venue entity, we extract the
corresponding categorical labels to supplement TWT-NY dataset. Last but not least, we
reorganize the user’s check-in coordinates into consecutive venue sequences per the times-
tamps, with a fixed idle time-span threshold (24 Hours). Table 1 gives a summary of the
datasets.
As shown in Fig. 3, we conduct a preliminary analysis on the TWT-NY and FQ-TKY
datasets. The histograms of venue categories, shown in Fig. 3a and b, illustrate that the
check-in frequency of location-based services varies considerably not only in category but
also in community or city. We can clearly observe that check-ins occur significantly more
in places categorized by food, transport&travel, and shop&services, and the New Yorker’s
check-in categories are more diverse. The check-in habits of Tokyo Foursquare user and
New York Twitter user are different, which may also be caused by the functionality dis-
crepancy of two communities. Through primary statistics analysis, the overall trajectory
length distributions of both datasets manifest the long tail effect prominently and consis-
tently. The empirically estimated trajectory length power law distributions for TWT-NY and
FQ-TKY datasets are correspondingly conveyed in Fig. 3c and d. Observed from them, most
of the trajectories (about 80–90%) have the length less than 10 check-ins. The proportion
of medium length trajectories in TWT-NY dataset is higher than in FQ-TKY (see from the
decaying factors αTWT-NY = 2.81 ≤ αFQ-TKY = 3.50), which may be formed by the successive

1 https://developer.foursquare.com/categorytree
Geoinformatica (2018) 22:507–539 523

4 5
x 10 x 10
6 3

5 2.5

4 2

Frequency
Frequency

3 1.5

2 1

1 0.5

0 0

od

e
ts

on

ce

el
ity

rs
od

e
ts

ity

rs

on

ce

el

or
or

er
lif
er
lif

Ar

ot
o
Ar

ot

rs
o

en
Fo

si

sp
rs

en

ht
Fo

si

sp

&S
do
ht

do

&S

H
H

ve

es
ve

es

ig

id

an
ig

id

an

ut
ut

op
N
ni

of
op

es
N
ni

of

es

Tr
O
O

Tr

Pr

Sh
U

Pr

R
Sh
R

(a) (b)
0 0
10 10

-1 -1
10 10

= 2.81
-2 -2
10 10
= 3.50

-3 -3
10 10

-4 -4
10 10

-5 -5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Trajectory Length (X) Trajectory Length (X)

(c) (d)

40 100
TWT-NY TWT-NY
Averaged Trajectory Length per User

35
FQ-TKY 90 FQ-TKY

80
30
#Trajectory per User

70
25
60
20 50

15 40
30
10
8 20
6
4 10
2
0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
User Index in Averaged Trajectory length Decending Order User Index in Decending Order of #Trajectory

(e) (f)
Fig. 3 Preliminary analysis of TWT-NY & FQ-TKY datasets

tweets reply behaviors. We also observe and study the data from user individual angles, and
find that the overall trajectory quality and quantity of TWT-NY is weaker than FQ-TKY in
Fig. 3e and f. The averaged per user trajectory lengths of TWT-NY dataset are short, and
the users’ trajectory numbers are small as well.
524 Geoinformatica (2018) 22:507–539

4.2 Probabilistic human mobility semantics analysis

In this section, we schematically demonstrate the capacity of our proposed method for rec-
ognizing the mobility semantics in two aspects, the geospatial topics and topical transitions.
To this end, we first set our model’s latent topic number to 10, the same as the col-
lected category number. Then employ our HMM-MultEM Model to identify the latent topic
assignments for trajectories and inference the transition probabilities among topics simul-
taneously. Finally, through maximizing the overall match between the topic assignment
sequences and the corresponding Foursquare categorical label sequences by the Kuhn-
Munkres algorithm [44], we align and assign the 10 predefined categories to the latent topics
indexes.
Four geospatial and topical transition semantics matrices about TWT-NY dataset are
rendered in Fig. 4, i.e., an overall New York Twitter crowd mobility semantics in Fig. 4a,

Arts
University
Food
Nightlife
Outdoors
Profession
Residence
Shop&Serv
Transport
Hotel
N od

ut e

es on

op ce

an v
ni ts
ity

of rs

t
el
or
Tr Ser
O tlif
U Ar

ot
Pr oo
rs

Sh den
Fo

R ssi

sp
h

H
ve

&
ig

e
i

(a) (b)

Arts
University
Food
Nightlife
Outdoors
Profession
Residence
Shop&Serv
Transport
Hotel
an v
N od

ut e

es on

op ce
ni ts
ity

of rs

t
el
or
Tr Ser
O tlif
U Ar

ot
Pr oo
rs

Sh den
Fo

R ssi

sp
h

H
ve

&
ig

e
i

(c) (d)
Fig. 4 Semantics analysis of New York twitter users’ mobility
Geoinformatica (2018) 22:507–539 525

and three other representative samples of personal mobility semantics. These also mirror
the users’ Twitter usage habits. Each matrix row is a Multinomial probability for transiting
out the row topic. The darker the gray-scale matrix cell, the higher is the corresponding
transition probability.
It is prominently observed from global semantics in Fig. 4a that the self-transition prob-
abilities on the diagonal are generally higher than the probabilities off the diagonal. This
phenomena obviously conforms to reality, because two adjacent locations are usually gener-
ated under the same user intention (i.e., geospatial topics), and multiple continuative venues
of same topic are composed to reach the user mobility purpose. Taking the “Shop&Serv”
topic as an example, people regularly go to several stores to compare and purchase some
commodities. Moreover, the “Transport” topic also makes sense that folk usually transfer at
several midway stops to reach their ultimate desired destinations. It is as well noticed that
the “University” and “Hotel” topics are with relatively low in-degree probabilities, but their
out-degree probabilities are diverse.
The other matrices are meant to account for three picked individual mobility semantics.
Figure 4b gives a user with diverse check-in habits, except that he may day-to-day dwells
in hotel, seldom goes to residential areas, and even never to universities. A probable com-
muter, who has no time for nightlife and with almost probability 1 “after Profession →
Transport” mobility regularity, is rendered in Fig. 4c. And he apparently lives in residential
zone not hotel, and neither go to universities. Figure 4d presents a student’s mobility seman-
tics, which seldom has arts and nightlife related things, seems to have never been to hotel,
and likes outdoors and sport activities.
All in all, based on the above observations, the proposed method is competent to reveal
human mobility semantics. Whether the global or personal mobility semantics is useful
for mobility analysis, especially for the future check-in venue prediction, is up to the real
context, such as data size or manifestation.

4.3 Semantic venue labeling

As an important aspect of evaluating HMMs modelling expressive power on human


mobility application, Problem 1. Semantic Venue Labeling, mentioned in Section 3.2, is
experimentally evaluated in following.
The datasets, TWT-NY and FQ-TKY, are used in this subsection to verify the semantic
labeling or annotation accuracy. And, we use the Foursquare category information narrated
in Section 4.1 as the ground truth. For both datasets, we use the user specific view of data
preprocessing mode, S = {Su }U u u
u=1 , and split each user’s sequences set S for training Strain
and testing Stest by the ratios varying from 10%, 20%, · · · , 50%. In addition, the HMMs
u

usages are personalized, i.e., training the models user by user to learn a user-specific topical
intention transition matrices. Finally, the entire labeling performance is summarized through
the averaged Hamming Error metric,
1  
HMStest = HM(z̃us , zus )
|Stest |
Sutest ⊂Stest s∈Sutest

where Stest = ∪u∈U Sutest , z̃us is the predicted label sequence for zus through personalized
HMMs, and HM(·, ·) is the Hamming Distance or Error for two corresponding sequences.
The maximum match of all the corresponding z̃us and zus , where s ∈ Sutest , found by the
Kuhn-Munkres algorithm [44] once again, and used for labeling validation alignment. We
consider the following baselines,
526 Geoinformatica (2018) 22:507–539

– Linear Chain CRF (Abbr. LC-CRF): Conditional Random Fields has been used in
activity recognition [4, 58] to handle temporal classification problem. CRF is a dis-
criminative probabilistic graphical model and retains the chain structures of sequences
through unidirectional edge dependencies.
– Multi-Class SVM (Abbr. MC-SVM): Multi-class support vector machines [18] is
an extension of binary SVM [63]. It is borrowed off-the-shelf to address the venue
annotation in LBS as multi-class classification problem.
– Geo Topic Model (Abbr. GeoTM): Geo topic model [37] can jointly estimate both
the user topical interests and past activities. Specifically, for each user, it adjusts the
geospatial topic distributions with his historical activity zone’s distance constraints.
Here, we employ it to topically label the check-in sequence by, p(zi |yi , yi−1 , u) ∝
p(yi |yi−1 , zi , u)p(zi |yi−1 , u)p(yi−1 , u).
Since the first two of above models are discriminative, we extract the location specific
explicit patterns (EP) features as in [63] to train them. For both FQ-TKY and TWT-NY
datasets, we set the hyperparameter topic number K = 10 once again for GeoTM, HMM-
MultEM, and HMM-GausEM.
The average Hamming Error of semantic annotation performance of FQ-TKY dataset
are present in Table 2a and Fig. 5a. It is obvious that GeoTM and HMM-GausEM are worst
performed. Through investigating the experiments, we find it is geo-coordinates data for-
mat bring about HMM-GausEM’s performance hurt. Observing from the raw data, a venue
identifier usually associates with several geo-coordinate, and the proximal geo-coordinates
in the venues densely distributed downtown often relate to several diverse venues. This
results in difficulty of fitting variance parameters of Gaussian topic components. There-
fore, the venue-Id data format is a more appropriate venue modelling granularity than
geo-coordinates, and Multinomial emission probability is better to model geospatial topic
components. The GeoTM adopts maximization likelihood estimation via EM algorithm
without closed form update, which incorporates gradient based numerical optimization with
trickish probability simplex constraints. Therefore, GeoTM is essentially hard to reach a sta-
ble solution. The lesson learned from our experiments confirms this point, and the relevant
results on both FQ-TKY and TWT-NY datasets, shown in Fig. 5, are barely satisfactory.

Table 2 Average hamming errors of venue annotation

(a) Hamming Errors of FQ-TKY dataset


Ratio HMM-MultEM LC-CRF MC-SVM HMM-GausEM GeoTM
10% 0.353981 0.327505 0.330647 0.478169 0.484208
20% 0.375599 0.351196 0.357189 0.472164 0.607541
30% 0.365097 0.355465 0.366843 0.457011 0.606203
40% 0.373813 0.374405 0.380434 0.464455 0.497900
50% 0.376940 0.386691 0.407947 0.465847 0.568998

(b) Hamming Errors of TWT-NY dataset


Ratio HMM-MultEM LC-CRF MC-SVM HMM-GausEM GeoTM
10% 0.530534 0.544994 0.502796 0.584732 0.671774
20% 0.539046 0.554969 0.502917 0.580766 0.689817
30% 0.544320 0.564962 0.516147 0.590344 0.635453
40% 0.548274 0.572285 0.540062 0.602257 0.695088
50% 0.544400 0.593444 0.539610 0.598185 0.648147
Geoinformatica (2018) 22:507–539 527

1 1
HMM-MultEM LC-CRF MC-SVM HMM-GausEM GeoTM HMM-MultEM LC-CRF MC-SVM HMM-GausEM GeoTM

0.9 0.9
0.8 0.8
0.7 0.7
Hamming Error

Hamming Error
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Test Data Splitting Ratio Test Data Splitting Ratio

(a) (b)
Fig. 5 Semantic venue labeling performance

And, the performances of HMM-MultEM, LC-CRF, and MC-SVM are slight differ-
ent. Furthermore, we can find that the performances of generative and discriminative
models are influenced by test data splitting ratio very differently. The generative models
HMM-MultEM and HMM-GausEM are insensitive to the ratios, and at least, their label-
ing performance deviations do not manifest notable ascending or descending regularity. But
for the discriminative models LC-CRF and MC-SVM, there is a significant tendency that
the performances become worse with the ratio increasing. Because the LC-CRF retains the
temporal trajectory dependency information, its performance is better than the unstructured MC-
SVM model. When the splitting ratio is small, the discriminative model dominates the generative
ones. But when the ratio is bigger than 30%, HMM-MultEM outperforms the discriminative
ones, which shows the superiority of generative models for the sparse observed data cases.
Table 2b and Fig. 5b render the venue labeling result on smaller TWT-NY dataset, and
the overall performance is a little bit worse than FQ-TKY. And the impacts of test data
splitting ratio for the generative models are also insignificant, but remarkable for the dis-
criminative models as do as on FQ-TKY dataset. An unexpected and interesting discovery
is that the MC-SVM model outperforms others in various degree even though the splitting
ratio approaches 0.5. This is maybe caused by the entire venues number of TWT-NY dataset
is less than FQ-TKY’s, and the relative small sized observed data is just sufficient to fit an
unstructured multi-class classifier with lower capacity. In another word, the higher complex-
ity trajectory models with chain-structure modelling assumption are possibly under-fitting
in the fewer venues cases.

4.4 Successive place prediction

Experiments on Problem 2. Successive Place Prediction defined in Section 3.2, focused on


next-place prediction here, are carried out based on the aforementioned predicting probabil-
ity Eq. 16 in Section 3.5 for HMM-MultEM and HMM-GausEM. The dataset described in
Section 4.1 is also used, and we truncate the last venue out of the trajectories for prediction.
For each sequence s ∈ S, the former Ns − 1 successive locations y :=< y1 , · · · , yNs −1 >
are used as the trajectory for training the models, and the last location yNs is masked for
prediction evaluation. The candidates sets are all the observed venues in the corresponding
datasets.
528 Geoinformatica (2018) 22:507–539

Due to the large quantity of the potential location candidates, and the stringent demand
that the right venue should recommend in the top positions as possible in location
recommendation system, we design two position sensitive evaluation metrics, averaged
Precision@n and Recall@n which are similar with the metrics used in [47, 49], to assess
the next-place prediction performance.
1 1
n
Precision@n = 1(yNs , rs , i) (17)
|S| n
s∈S i=1
1 
Recall@n = 1(yNs , rs , n) (18)
|S|
s∈S

where 1(yNs , rs , n) is the indicator of whether the right venue yNs emerges in the nth
position in the venue candidates ranking list rs for the trajectory s.
Moreover, in order to manifest the capability of our approach on recommending or
predicting next locations, we compare it with the following sequential location recommen-
dation baselines.
– Popularity (Abbr. PopRec): The popularity based recommendation method assumes
that the popular places will be more likely to be visited, thus it utilizes the frequency
of location occurrence for predicting, i.e. p(yNs |y) ∝ Count(yNs ). This approach is
effective and easy to realize in practice.
– Markov Model (Abbr. MM): Markov Model harvests the previous most likely explicit
transition patterns for future potential transition prediction [15, 22]. But the higher
order Markov model seem to be thankless job in this context, which is pointed out by
[45]. Thus the candidate locations are ranked based on the 1st-order Markov transition
probability, i.e. p(yNs |y) ∝ Count(yNs −1 → yNs ).
– Additive Markov Chain (Abbr. AMC): Exploiting sequential influence for loca-
tion recommendation, the weighted additive contribution of previous locations
to the sequential prediction, discussed in [71], seems helpful and prominent.
Therefore, we sequential predict the next-place with probability p(yNs |y) =
Ns −1 W (yi )·T P (yi →yNs )
i=1
n
W (y )
, where T P (yi → yNs ) is the transition probability, and
i=1 i
W (yi ) = 2−α(n−i) is the weight decay factor and the larger α is, the higher is the decay
rate.
In addition, we also consider the personal and the crowds’ global mobility semantics
discrepancy discussed in Section 4.2, and add prefixes “P-” and “G-” to the algorithm
abbreviations correspondingly to distinguish the algorithms convened for the next place
prediction performance comparison on two datasets. Except for HMM-GausEM using the
geo-coordinates data in both training and testing processes, we feed venue-Id format data
for the rest algorithms. Note that, the cardinality of geo-coordinates data candidate set is
much bigger than the venue-Id data, which results in the difficulty for HMM-GausEM to
recall the right venues.
We first tune the topic number for HMM-MultEM and HMM-GausEM on two datasets
(FQ-TKY: K = 60 and K = 80; TWT-NY: K = 50 and K = 60) respectively. In terms
of accuracy metrics specified in Eqs. 17 and 18, the next place prediction performance
comparision with the baseline models are shown in Fig. 6. Most of the finds are sensible
and convincible, nevertheless, a few of them are a little bit unexpected and yet also rational.
The first found fact is that the personalized popularity based recommendation, “P-
PopRec” circle marked and solid blue curve in Fig. 6a and c, is a simple but sharp method for
Geoinformatica (2018) 22:507–539 529

0.9 0.25
G-PopRec
0.8 G-MM
G-AMC
0.7 0.2 G-HMM-GausEM

Averaged Precision@n
G-HMM-MultEM
Averaged Recall@n

0.6 P-PopRec
P-MM
0.15 P-AMC
0.5
G-PopRec P-HMM-GausEM
G-MM P-HMM-MultEM
0.4
G-AMC 0.1
G-HMM-GausEM
0.3
G-HMM-MultEM
P-PopRec
0.2 P-MM 0.05
P-AMC
0.1 P-HMM-GausEM
P-HMM-MultEM
0 0
0 20 40 60 80 100 0 5 10 15 20 25 30
Ranking Positions Ranking Positions

(a) (b)
0.9 0.35
G-PopRec
0.8 G-MM
0.3 G-AMC
0.7 G-HMM-GausEM
Averaged Precision@n

G-HMM-MultEM
Averaged Recall@n

0.25 P-PopRec
0.6
P-MM
0.2 P-AMC
0.5
G-PopRec P-HMM-GausEM
G-MM P-HMM-MultEM
0.4 0.15
G-AMC
G-HMM-GausEM
0.3
G-HMM-MultEM
0.1
P-PopRec
0.2 P-MM
P-AMC 0.05
0.1 P-HMM-GausEM
P-HMM-MultEM
0 0
0 20 40 60 80 100 0 5 10 15 20 25 30
Ranking Positions Ranking Positions
(c) (d)
Fig. 6 Successive place prediction performance

the next venue prediction. P-PopRec is even better than the sequential transition semantics
considered models in some cases. In contrary, the G-PopRec is a bad case. This makes sense
because the users’ radius of gyration is usually limited, consequently, their next venues are
usually previous frequent visits. But the global popularity does not works.
The second find is that HMM-GausEM is worse for both personal and global transition
semantics cases in both FQ-TKY and TWT-NY datasets. This is also caused by the data
modeling resolution. But the other probabilistic sequential model HMM-MultEM, together
with MM and AMC, is a promising model for venue prediction, moreover, its performance
significantly depends on the personal and global consecutive transition mobility semantics.
Owning wholly excellent performance, P-HMM-MultEM is killer on FQ-TKY dataset and
second winner on TWT-NY only next to G-AMC, as shown by the purple colored and circle
marked curve in Fig. 6a and c. This success is attributed to the personal hidden intention
assumption. However, the global hidden state model, G-HMM-MultEM, cannot gain this
benefit.
Thirdly, the models without hidden state assumption, AMC and MM, directly harvest the
explicit inter-venue transition frequency for next venue recommendation. As indicated by
the red curves in Fig. 6 for FQ-TKY dataset, the two AMC models manifest similar perfor-
mance, and the P-AMC recalls the venues a bit quicker than G-AMC at the top positions.
530 Geoinformatica (2018) 22:507–539

The MM models’ performances resemble and are a bit worse than the AMC models’. And
all of them are inferior in contrast with P-HMM-MultEM.
For AMC and MM Models’ performances on the TWT-NY dataset in Fig. 6c, G-AMC
and G-MM correspondingly outperform P-AMC and P-MM. The reasons lie in two aspects.
The first one is the small volume of TWT-NY and the observations from Fig. 3e and f, i.e.,
the individual trajectory samples are too sparse to underlie rich mobility regularity for high
quality prediction. Therefore, the crowd global mobility semantics remedy this weakness.
The other is Twitter community’s functional discrepancy with Foursquare, the consecutive
tweet reply interaction behaviors, and AMC is intrinsically good at modeling higher-order
markov dependency. The P-HMM-MultEM’s performance is between G-AMC and G-MM,
and far better than P-AMC and P-MM. This fact confirms once again the superiority of
hidden state based generative HMMs for the sparse mobility data. We anticipate that the
high-order HMMs will be more hopeful.
Last, An important supplement is the precision of top positions in location prediction
problem. On the more felicitous FQ-TKY dataset for geospatial mobility semantics analysis,
P-HMM-MultEM, P-MM, and P-AMC are all performed well, refer to Fig. 6b. And the P-
PopRec model without transitions considered also works. G-AMC is the outstanding winner
for the TWT-NY dataset in Fig. 6d.

4.5 Evaluating the scalability

Here, we first compare the variational inference with traditional inference for HMMs, then
evaluate the performance of stochastic approximation approaches, i.e., SVI and Minibatch
SVI. Especially, we focus on the time consumption, and illustrate the trade-off between
the choice of learning rate, decay coefficient, and the impact of the minibatch size. The
experiment data is from the widely used Gowalla, i.e., GWOLA, which totally involves
6.4 millions check-ins. We reorganize them into more than 1 million trajectories (1:9 for
splitting), in terms of the users and the check-in time intervals. As for the performance
metric, we adopt the hold-out likelihood


p(ytest |ytrain ) = p(ytest |π0 ,A,)p(π0 ,A,|ytrain )dπ0 dAd.

This is also known as the predictive likelihood wieldy used for measuring probabilistic
models’ performance.

4.5.1 Stochastic variational inference brings about scalability

As a prominent focus in this paper, we strongly highlight the timely efficiency and scalabil-
ity. In the Fig. 7a, we first give the performance comparison of batch structured variational
inference (BVI) vs. other two State-of-the-art inference methods of the HMMs, i.e., Expec-
tation Maximization and Gibbs sampling. At the initial stage, BVI outperforms Gibbs and
EM algorithm obviously, and the optimization oriented BVI and EM algorithm is better
than the Gibbs Sampling simulation inference method. And at the very late stage in our
experiment, Gibbs Sampling achieves better prediction power than the other two a bit. Fur-
thermore, another interesting fact is found from the posterior variance perspective, BVI
has the best posterior variance reduction ability so far than Gibbs and EM algorithm, EM
algorithm bears the worst variance, and Gibbs is in the middle. The reason is that EM
Geoinformatica (2018) 22:507–539 531

SVI Time Span (Unit: Seconds)


6
x 10 0 20 40 60 80 100 120
-3

-3.5
Hold-out Log-likelihood

-4

Hold-out Log-likelihood
-4.5

-5

-5.5

-6
BVI Topic = 20
BVI Topic = 30
-6.5 SVI Topic = 20
SVI Topic = 30
-7
0 0.5 1 1.5 2 2.5 3
Time Span (Unit: Seconds) BVI Time Span (Unit: Seconds) x 10
4

(a) (b)
7 7
x 10 x 10
-0.2 -0.2

-0.4 -0.4

-0.6 -0.6
Hold-out Log-likelihood

Hold-out Log-likelihood

-0.8 -0.8

-1 -1

-1.2 -1.2

-1.4 -1.4
=0.2 = 0.6
=0.2 = 0.7 MBVI Topic = 20 MB = 10
-1.6 =0.2 = 0.8 -1.6
MBVI Topic = 20 MB = 20
=0.2 = 0.9 SVI Topic = 20
-1.8 -1.8
0.5 1 1.5 2 2.5 3 3.5 0 5 10 15 20 25 30
SVI Time Span (Unit: Seconds) Time Span (Unit: Seconds)

(c) (d)
Fig. 7 Scalability performance of variational inference approaches

parameterize the model directly, and takes the optimal parameters with the maximum
likelihood principle, which is prone to slip into the local minima. But from the full
Bayesian viewpoint, BVI imposes suitable priors on the model parameters, and its predic-
tion power is actually equivalent of standing on the averaging the EM parameters space. The
Gibbs stochastic simulation is a verbose inference method, but theoretically, it unbiasedly
approaches the exact posterior if without time limit.
For the stochastic optimization inference, the space saving of SVI is conspicuous, on
account of our algorithm design intrinsic virtue, i.e., sequence by sequenced online style.
Fig. 7b gives the hold-out prediction performance, of the batch structured variational infer-
ence v.s. its stochastic version (SVI). Note that, we use the different scale to smooth the
extremely huge time-consumption gap between SVI and BVI on the top and bottom axes
of Fig. 7b. It is amazingly noticed that SVI can reach the prediction ability that BVI need
performing almost twelve times whole data scanning, while the SVI only takes serval hun-
dreds sequences, a small fraction of the whole dataset. Certainly, SVI owns its weakness
of high variance, especially in the initial stage. But, with the algorithm evolving forwardly,
we believe that it is not a big issue with the help of step-size configuration and many other
enhanced variation reduction technologies.
532 Geoinformatica (2018) 22:507–539

4.5.2 Impact of forgetting rate

Almost for all Robbins-Monro approximation approaches, the parameter settings are affir-
mative for the algorithms performance. There are two parameters, the forgetting rate κ and
decay coefficient τ in Eq. 14. The bigger τ leads to the more serious down-weighting the
early noise updates, which is indeed a variation reduction trick at the starting stage. And
the impact of κ is more complicated and interesting. As shown in Eq. 14, the step sizes will
become very small as the algorithm evolving, therefore the influence of the κ and τ con-
figuration on the performance are trivial at the later period iterations. For this reason, we
shows the hold-out log-likelihood performance at the very early stage of the SVI in Fig. 7c,
fixing τ and topic number but varying κ. Before the 1.5 seconds in Fig. 7c, all curves
take on big variances, and the red curve with the smallest κ = 0.6 is particularly obvious.
After the 1.5 seconds, the blue and cyan curves are relatively stable and with a little bit
weaker performance than the red and green ones. This manifests that bigger κ means con-
servative exploration in stochastic natural gradient direction but more stable performance
progress. Those phenomena coincide with the forgetting rate’s physical meaning that how
quickly the old information is forgotten. The smaller the rate is, the quicker the old informa-
tion is forgotten, and in another word, the more aggressive exploration along the stochastic
gradient.

4.5.3 Impact of minibatch size

Minibatch technology of the stochastic optimization algorithms is also a trade-off between


the efficiency and stability of the performance improvement. Figure 7d illustrates batch
size’s influence on the inference algorithms’ performance, which are also the performances
at the dozens of seconds beginning period. And the interval between the two successive
error bars is corresponding with a time of update with respective batch size. We have
already trimmed some hard to be visualized error fluctuations at the beginning of the green
line, the SVI algorithm with batch size equal to 1. Even with a minibatch of size 20, the
algorithm, represented with the blue curve, can mitigate the variance to a large extent in
Fig. 7d, and surely brings about relatively tardive but valuable parameters updating. That is,
the algorithms with a slightly bigger minibatch size will lead to more stable performance
improvement, meanwhile the acceptable time-effectiveness versus the entire batched varia-
tional inference algorithms. As we all known that the stable hold-out log-likelihood means
consistent predictive performance which is urgently desired in the piratical application sce-
narios. From this viewpoints, mini-batched inference algorithm is a promising choice in our
geospatial data application context with very heterogenous sequences length. The key point
is the batch size selection depending your data and domain specific features.

5 Related works

Mining the knowledge from the location based services and geospatial-temporal data is
an appealing research direction recently. Human trajectories show a high degree of tem-
poral and spatial regularity [25]. Check-in patterns in location based social networks are
investigated by [16, 17, 33, 51, 54] from the temporal, spatial, social, and even economic
aspects. In addition, the GPS trajectory [73] and geo-tagged photo [2, 75] are employed to
detect the frequent trip and travel patterns. Besides GPS and camera, sensor data collected
from accelerometer, gyroscope, microphone, bluetooth, phone tower, and communication
Geoinformatica (2018) 22:507–539 533

logs, in the works [8, 11, 40], are used to characterize mobile phone users mobility and
spatio-temporal interaction patterns. Related research results and methods are also widely
applied for geographical function partitioning, and information security and privacy protec-
tion. In urban computing [23, 72], the taxicab traces are also utilized to inferring mobility
patterns. And it is very insightful that [34] quantifies the distinct characteristics of mobile
phone movements and taxicab trips, and [46] investigates fundamental contrasts between
manual check-in behavior data that originates from location-based social networks and auto-
matic check-in data that can be automatically collected through various sensors. Obviously,
a variety of location data are analyzed, but most of them mine explicit frequent patterns
and neglect the high-level and elastic mobility semantics, inevitably suffering from the
performance degradation caused by data noisiness and uncertainty.
Analyzing trajectory data [74] and forecasting individual mobility behavior have been
comprehensively researched. Relying on Fano’s inequality, [45, 56] find theoretical max-
imum predictability of Markov model [15]. With high-order Markov dependency Consid-
ered, nth-order Markov chains [22], additive Markov chain proposed by [71], and user
category mixed Markov chain model [3], are used for movement prediction. A family of
hidden Markov models have also been contending for this task, such as, HMM for category
forecast and Mixed HMM for location prediction in [62], hidden semi-Markov model [6]
for mobility data analysis, and nonparametric HDP-HMM [31] for indoor abnormal activity
recognition. Besides, decision tree [48], regression and M5 tree with explicitly extracted fea-
tures [49], and proximity model synthesizing geography and friendship [5], are extensively
investigated for next place prediction as well.
The semantic trajectory [1] is appealed by both academic [70] and industrial field, and
semantic knowledge can boost up many mobility application [59]. There are seminal works
on trajectory semantics annotation [50, 60] for dense mobile data via predefined taxon-
omy or ontology [12, 57, 61] of trajectory behaviors. A trajectory semantic pattern mining
algorithm, PrefixSpan [26], has been widely used in [66, 67] for location prediction. Being
reduced to classification problem, semantic place label prediction have been boosted by fea-
ture engineering researches, such as spatial and temporal features [7, 14], explicit pattern
and implicit relatedness features [63], and mixing them up [36, 76]. Discriminative prob-
abilistic sequential model, Conditional Random Fields, has been extensively adopted for
robot and human activity recognition [58] and labeling mobility events [4, 42].
There have been indeed emerging some researches applying probabilistic modelling to
location based application. Probabilistic topic modeling [10] are the common and de facto
core building block for web media modeling. There are some direct applications of topic
model prototype [19, 43] and indirect augmented LDA variants [28, 65] appeared in loca-
tion based services. Utilizing the topical assumption, works [30, 37] model the geography
information only, and [29, 64, 69] are with text features enriched, to generate location
recommendation. Besides, interesting work [68] combines mobility semantics and topical
location semantics to discovery the different functional city zones. A hybrid approach com-
bining Markov and topic models named Photographer Behavior Model [39], is presented to
handle the trip planning problem. It seems to coincide with our modelling intentions. How-
ever, we coherently formulate and inference our problem using the consistent language of
probabilistic graphical model.
As for the scalable probabilistic model inference methods, especially for the variational
bayesian inference, SVI [27] is a landmark work in this field. There are indeed some pro-
posed variational inference technologies [21, 32] for hidden Markov model and time series,
promoting the stochastic approximation technology. Different from ours, sub-chain styled
stochastic inference [21] is designed for very long series estimation.
534 Geoinformatica (2018) 22:507–539

All in all, the geospatial human mobility semantics is indeed of great practical significance.
But mainstream literatures focus on either frequent trajectory pattern mining, seman-
tic venue category, or trajectory semantics ontology separately, and seldom works
dwell on both geospatial topics and topical transition semantics analysis simultaneously with
probabilistic modeling approach. Solving the scalability issue and releasing the potentials
of big geospatial human mobility data, still remain virgin lands to the best of our knowledge.

6 Conclusions

The large scale geospatial web content and human mobility data are destined to call for
the scalable processing methods. Moreover, to address data noisiness and uncertainty, high-
level geospatial mobility semantics with probabilistic modelling, are more appealing and
flexible than the explicit trajectory patterns. To this end, we come up with hidden Markov
model with two different emission probabilities for two typical geo-data formats, Multino-
mial and Gaussian distributions, to express geospatial mobility semantics. More valuably,
we propose a stochastic approximation inference method, i.e., stochastic structured vari-
ational inference, to conquer the formidable scalability issue. In the experiments, two
mobility application problems, semantic venue labeling and successive place prediction, are
convened to demonstrate the application value of our discovered human mobility semantics.
Furthermore, in terms of the scalability, we compare our stochastic variational inference
with the traditional inference methods, and give some practical application guides of the
two versions of stochastic variational inference methods. The results confirm our thinking
obviously.
In the the future, we meant to assimilate the periodical factor into probabilistic modeling
framework, to better reveal the temporal semantics aspects of human daily mobility regu-
larity. Furthermore, observed from our practice, the global and personal mobility semantics
can complement each other with their respective advantages. Therefore, organically unit-
ing them together is promising to boost potential and diverse mobility applications, such
as the two problems semantic venue labeling and successive venue prediction. Last but not
least, we want to derive a kind of more expressive and elastic venue modeling mechanism to
capture the natural multi-modality of geo-data than veuneId and geo-coordinate, which can
better bridge between high-level geospatial topics and the informative but a little bit noisy
mobility data.

Acknowledgements This work is supported partly by China 973 program (No. 2014CB340305,
2015CB358700), by the National Natural Science Foundation of China (No. 61421003), and by the Beijing
Advanced Innovation Center for Big Data and Brain Computing).

References

1. Albanna BH, Moawad IF, Moussa SM, Sakr MA (2015) Semantic Trajectories: a survey from modeling
to application. Springer International Publishing, Cham, pp 59–76
2. Arase Y, Xie X, Hara T, Nishio S (2010) Mining people’s trips from large scale geo-tagged photos. In:
Proceedings of the international conference on multimedia, MM ’10. ACM, New York, pp 133–142
Geoinformatica (2018) 22:507–539 535

3. Asahara A, Maruyama K, Sato A, Seto K (2011) Pedestrian-movement prediction based on mixed


markov-chain model. In: Proceedings of the 19th ACM SIGSPATIAL international conference on
advances in geographic information systems. ACM, pp 25–33
4. Assam R, Seidl T (2014) Context-based location clustering and prediction using conditional random
fields. In: Proceedings of the 13th international conference on mobile and ubiquitous multimedia. ACM,
pp 1–10
5. Backstrom L, Sun E, Marlow C (2010) Find me if you can: Improving geographical prediction with
social and spatial proximity. In: Proceedings of the 19th international conference on world wide web,
WWW ’10. ACM, New York, pp 61–70
6. Baratchi M, Meratnia N, Havinga PJ, Skidmore AK, Toxopeus BA (2014) A hierarchical hidden semi-
markov model for modeling mobility data. In: Proceedings of the ACM international joint conference on
pervasive and ubiquitous computing. ACM, p 2014
7. Baumann P, Kleiminger W, Santini S (2013) The influence of temporal and spatial features on the
performance of next-place prediction algorithms. In: Proceedings of the 2013 ACM international joint
conference on pervasive and ubiquitous computing. ACM, pp 449–45
8. Bayir MA, Eagle N, Demirbas M (2009) Discovering spatiotemporal mobility profiles of cellphone
users. In: Proceedings of the 10th IEEE international symposium on a world of wireless, mobile and
multimedia networks (WoWMoM 2009), pp 1–9
9. Beal MJ (2003) Variational algorithms for approximate Bayesian inference. PhD thesis, University of London
10. Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84
11. Blondel VD, Decuyper A, Krings G (2015) A survey of results on mobile phone datasets analysis. EPJ
Data Sci 4(1):1
12. Bogorny V, Kuijpers B, Alvares LO (2009) St-dmql: a semantic trajectory data mining query language.
Int J Geogr Inf Sci 23(10):1245–1276
13. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of
COMPSTAT’2010. Springer, pp 177–186
14. Chang C-W, Fan Y-C, Wu K-C, Chen AL (2014) On the semantic annotation of daily places: a machine-
learning approach. In: Proceedings of the 4th international workshop on location and the web. ACM,
pp 3–8
15. Chen M, Liu Y, Yu X (2014) Nlpmm: a next location predictor with markov modeling. In: Pacific-Asia
conference on knowledge discovery and data mining. Springer, pp 186–197
16. Cheng Z, Caverlee J, Lee K, Sui DZ (2011) Exploring millions of footprints in location sharing services.
ICWSM 2011:81–88
17. Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social
networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery
and data mining. ACM, pp 1082–1090
18. Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector
machines. J Mach Learn Res 2(Dec):265–292
19. Farrahi K, Gatica-Perez D (2011) Discovering routines from large-scale human locations using proba-
bilistic topic models. ACM Trans Intell Syst Technol (TIST) 2(1):3
20. Forney GD (1973) The viterbi algorithm. Proc IEEE 61(3):268–278
21. Foti N, Xu J, Laird D, Fox E (2014) Stochastic variational inference for hidden Markov models. In:
Advances in neural information processing systems, pp 3599–3607
22. Gambs S, Killijian M-O, del Prado Cortez MN (2012) Next place prediction using mobility Markov
chains. In: Proceedings of the first workshop on measurement, privacy, and mobility, MPM ’12. ACM,
New York, pp 3:1–3:6
23. Ganti R, Srivatsa M, Ranganathan A, Han J (2013) Inferring human mobility patterns from taxicab loca-
tion traces. In: Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous
computing, UbiComp ’13. ACM, New York, pp 459–468
24. Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Chapman & Hall/CRC
Boca Raton, FL, USA
25. Gonzalez MC, Hidalgo CA, Barabasi A-L (2008) Understanding individual human mobility patterns.
Nature 453(7196):779–782
26. Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: Mining sequen-
tial patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international
conference on data engineering, pp 215–224
536 Geoinformatica (2018) 22:507–539

27. Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res
14(1):1303–1347
28. Hong L, Ahmed A, Gurumurthy S, Smola AJ, Tsioutsiouliklis K (2012) Discovering geographical topics
in the twitter stream. In: Proceedings of the 21st international conference on world wide web, WWW
’12. ACM, New York, pp 769–778
29. Hu B, Ester M (2013) Spatial topic modeling in online social media for location recommendation. In:
Proceedings of the 7th ACM conference on recommender systems. ACM, pp 25–32
30. Hu B, Jamali M, Ester M (2013) Spatio-temporal topic modeling in mobile social media for location
recommendation. In: IEEE 13th international conference on data mining. IEEE, p 2013
31. Hu DH, Zhang X-X, Yin J, Zheng VW, Yang Q (2009) Abnormal activity recognition based on hdp-hmm
models. In: IJCAI, pp 1715–1720
32. Johnson M, Willsky A (2014) Stochastic variational inference for bayesian time series models. In:
Proceedings of the 31st international conference on machine learning (ICML-14), pp 1854–1862
33. Jurdak R, Zhao K, Liu J, AbouJaoude M, Cameron M, D. Newth. (2015) Understanding human mobility
from twitter. PLoS ONE 10(7):e0131469
34. Kang C, Sobolevsky S, Liu Y, Ratti C (2013) Exploring human movements in Singapore: a compar-
ative analysis based on mobile phone and taxicab usages. In: Proceedings of the 2nd ACM SIGKDD
international workshop on urban computing, UrbComp ’13. ACM, New York, pp 1:1–1:8
35. Khreich W, Granger E, Miri A, Sabourin R (2012) A survey of techniques for incremental learning of
hmm parameters. Inf Sci 197:105–130
36. Krumm J, Rouhana D, Chang M-W (2015) Placer++: Semantic place labels beyond the visit. In: 2015
IEEE international conference on pervasive computing and communications (PerCom). IEEE, pp 11–19
37. Kurashima T, Iwata T, Hoshide T, Takaya N, Fujimura K (2013) Geo topic model: joint modeling of
user’s activity area and interests for location recommendation, ACM
38. Kurashima T, Iwata T, Irie G, Fujimura K (2010) Travel route recommendation using geotags in photo
sharing sites, ACM
39. Kurashima T, Iwata T, Irie G, Fujimura K (2010) Travel route recommendation using geotags in photo
sharing sites. ACM, New York, pp 579–588
40. Laurila JK, Gatica-Perez D, Aad I, Bornet O, Do T-M-T, Dousse O, Eberle J, Miettinen M, et al. (2012)
The mobile data challenge: Big data for mobile computing research. In: Pervasive computing, number
EPFL-CONF-192489
41. Li Z, Wang J, Han J (2012) Mining event periodicity from incomplete observations. In: Proceedings of
the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
42. Liao L, Fox D, Kautz H (2007) Hierarchical conditional random fields for gps-based activity recognition.
In: Robotics research. Springer, pp 487–506
43. Long X, Jin L, Joshi J (2012) Exploring trajectory-driven local geographic topics in foursquare. In:
Proceedings of the ACM conference on ubiquitous computing. ACM, p 2012
44. Lovász L, Plummer MD (2009) Matching theory, vol 367. American Mathematical Soc.
45. Lu X, Wetter E, Bharti N, Tatem AJ, Bengtsson L (2013) Approaching the limit of predictability in
human mobility. Sci Rep 3
46. Malmi E, Do TMT, Gatica-Perez D (2012) Checking in or checked in: comparing large-scale manual and
automatic location disclosure patterns. In: Proceedings of the 11th international conference on mobile
and ubiquitous multimedia. ACM, p 26
47. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University
Press, New York
48. Monreale A, Pinelli F, Trasarti R, Giannotti F (2009) Wherenext: a location predictor on trajectory
pattern mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge
discovery and data mining. ACM, pp 637–646
49. Noulas A, Scellato S, Lathia N, Mascolo C (2012) Mining user mobility features for next place prediction
in location-based services. In: 2012 IEEE 12th international conference on data mining. IEEE, pp 1038–
1043
50. Parent C, Spaccapietra S, Renso C, Andrienko G, Andrienko N, Bogorny V, Damiani ML, Gkoulalas-
Divanis A, Macedo J, Pelekis N, et al. (2013) Semantic trajectories modeling and analysis. ACM Comput
Surv (CSUR) 45(4):42
Geoinformatica (2018) 22:507–539 537

51. Preoţiuc-Pietro D, Cohn T (2013) Mining user behaviours: a study of check-in patterns in location based
social networks. In: Proceedings of the 5th annual ACM web science conference. ACM
52. Rabiner L (1989) A tutorial on hidden markov models and selected applications in speech recognition.
Proc IEEE 77(2)
53. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat
54. Scellato S, Noulas A, Lambiotte R, Mascolo C (2011) Socio-spatial properties of online location-based
social networks. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press
55. Scott SL (2002) Bayesian methods for hidden Markov models. J Am Stat Assoc 97(457)
56. Song C, Qu Z, Blumm N, Barabási A-L (2010) Limits of predictability in human mobility. Science
327(5968):1018–1021
57. Spaccapietra S, Parent C, Damiani ML, de Macedo JA, Porto F, Vangenot C (2008) A conceptual view
on trajectories. Data Knowl Eng 65(1):126–146
58. Vail DL, Veloso MM, Lafferty JD (2007) Conditional random fields for activity recognition, ACM
59. Xiao X, Zheng Y, Luo Q, Xie X (2010) Finding similar users using category-based location history.
ACM, New York, pp 442–445
60. Yan Z, Chakraborty D, Parent C, Spaccapietra S, Aberer K (2013) Semantic trajectories: mobility data
computation and annotation. ACM Trans Intell Syst Technol (TIST) 4(3):49
61. Yan Z, Macedo J, Parent C, Spaccapietra S (2008) Trajectory ontologies and queries. Trans GIS
12(s1):75–91
62. Ye J, Zhu Z (2013) What your next move: user activity prediction in location-based social networks. In:
Proceedings of the SIAM international conference on data mining siam. SIAM
63. Ye M, Shou D, Lee W.-C., Yin P, Janowicz K (2011) On the semantic annotation of places in
location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on
knowledge discovery and data mining. ACM, pp 520–528
64. Yin H, Cui B, Chen L, Hu Z, Zhang C (2015) Modeling location-based user rating profiles for
personalized recommendation. ACM Trans Knowl Discov Data (TKDD) 9(3):19
65. Yin Z, Cao L, Han J, Zhai C, Huang T (2011) Geographical topic discovery and comparison. In:
Proceedings of the 20th international conference on world wide web. ACM, pp 247–256
66. Ying JJ-C, Lee W-C, Tseng VS (2014) Mining geographic-temporal-semantic patterns in trajectories for
location prediction. ACM Trans Intell Syst Technol 5(1):2:1–2:33
67. Ying JJ-C, Lee W-C, Weng T-C, Tseng VS (2011) Semantic trajectory mining for location prediction.
In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic
information systems. ACM, pp 34–43
68. Yuan J, Zheng Y, Xie X (2012) Discovering regions of different functions in a city using human mobility
and pois. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery
and data mining. ACM, pp 186–194
69. Yuan Q, Cong G, Zhao K, Ma Z, Sun A (2015) Who, where, when, and what: a nonparametric Bayesian
approach to context-aware recommendation and search for twitter users. ACM Trans Inf Syst (TOIS)
33(1):2
70. Zahir Irani P, Elragal A, El-Gendy N (2013) Trajectory data mining: integrating semantics. J Enterp Inf
Manag 26(5):516–535
71. Zhang J-D, Chow C-Y, Li Y (2014) Lore: exploiting sequential influence for location recommendations.
In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic
information systems. ACM, pp 103–112
72. Zheng Y, Liu Y, Yuan J, Xie X (2011) Urban computing with taxicabs. In: Proceedings of the 13th
international conference on ubiquitous computing, UbiComp ’11. ACM, New York, pp 89–98
73. Zheng Y, Zhang L, Xie X, Ma W-Y (2009) Mining interesting locations and travel sequences from gps
trajectories. In: Proceedings of the 18th international conference on world wide web, WWW ’09. ACM,
New York, pp 791–800
74. Zheng Y, Zhou X (2011) Computing with spatial trajectories. Springer Science & Business Media
75. Zheng Y-T, Zha Z-J, Chua T-S (2012) Mining travel patterns from geotagged photos. ACM Trans Intell
Syst Technol (TIST)
76. Zhu Y, Zhong E, Lu Z, Yang Q (2013) Feature engineering for semantic place prediction. Pervasive Mob
Comput 9(6):772–783
538 Geoinformatica (2018) 22:507–539

Xiaohui Guo is a Ph.D student in the School of Computer Science and Engineering, Beihang University,
Beijing, China. His research interests include machine learning, data mining, and services oriented com-
puting. And his research of machine learning is mainly focused on the probabilistic graphical modeling
and scalable inference, Bayesian sparse learning, and its application to recommender system, computational
advertisement, and geospatial-temporal data analysis.

Richong Zhang received his B.Sc. Degree and M.A.Sc. degree from Jilin University, Changchun, China,
in 2001 and 2004. In 2006, he received his M.Sc. degree from Dalhousie University. He received his Ph.D
form the School of Information Technology and Engineering, University of Ottawa. He is currently an asso-
ciate professor in the School of Computer Science and Engineering, Beihang University, Beijing, China. His
reasearch interests include Recommender Systems, Knowledge Graph and Crowdsourcing.
Geoinformatica (2018) 22:507–539 539

Xudong Liu received the PhD degree in computer application technology from Beihang University, Bei-
jing, China. He is a professor and doctoral supervisor at Beihang University. His research interests mainly
include middleware technology and applications, service-oriented computing, trusted network computing,
and network software development.

Jinpeng Huai is a professor of the School of Computer Science and Engineering at Beihang University,
and a vice-minister of Ministry of Industry and Information Technology of the People’s Republic of China.
Prof. Huai is an academician of Chinese Academy of Sciences. He used to serve on the Steering Committee
for Advanced Computing Technology Subject for the National High-Tech Program (863) as Chief Scientist.
His research interests include big data computing, distributed system, virtual computing, service-oriented
computing, trustworthiness and security.

View publication stats

You might also like