A Data Cleaning Method For Big Trace Data Using Mouiuiu

sensors
Article
A Data Cleaning Method for Big Trace Data Using
Movement Consistency
Xue Yang 1 ID
, Luliang Tang 1, *, Xia Zhang 2 and Qingquan Li 3
1 State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing,
Wuhan University, Wuhan 430079, China; yangxue@whu.edu.cn
2 School of Urban Design, Wuhan University, Wuhan 430070, China; xiazhang@whu.edu.cn
3 College of Civil Engineering, Shenzhen University, Shenzhen 518060, China; liqq@szu.edu.cn
* Correspondence: tll@whu.edu.cn; Tel.: +86-139-9568-3555
Received: 31 January 2018; Accepted: 6 March 2018; Published: 9 March 2018
Abstract: Given the popularization of GPS technologies, the massive amount of spatiotemporal GPS
traces collected by vehicles are becoming a new kind of big data source for urban geographic
information extraction. The growing volume of the dataset, however, creates processing and
management difficulties, while the low quality generates uncertainties when investigating human
activities. Based on the conception of the error distribution law and position accuracy of the GPS data,
we propose in this paper a data cleaning method for this kind of spatial big data using movement
consistency. First, a trajectory is partitioned into a set of sub-trajectories using the movement
characteristic points. In this process, GPS points indicate that the motion status of the vehicle has
transformed from one state into another, and are regarded as the movement characteristic points.
Then, GPS data are cleaned based on the similarities of GPS points and the movement consistency
model of the sub-trajectory. The movement consistency model is built using the random sample
consensus algorithm based on the high spatial consistency of high-quality GPS data. The proposed
method is evaluated based on extensive experiments, using GPS trajectories generated by a sample of
vehicles over a 7-day period in Wuhan city, China. The results show the effectiveness and efficiency
of the proposed method.
Keywords: data cleaning; big data; vehicle trajectory; movement consistency modeling
1. Introduction
Nowadays, big data are everywhere, from sensors that monitor traffic loads to the flood of tweets
and Facebook ‘likes’. Researchers use volume, velocity, variety, value, and veracity to characterize the
key properties of those big data [1]. In contrast to volume, velocity, variety, and value, the fifth
‘V’ of big data, veracity, is increasingly recognized as a key dimension when making big data
operational in various applications [2]. Big GPS trace data generated by vehicles also have the
five ‘V’ characteristics [3] and provide us with an unprecedented window into the dynamics of
urban areas [4–9]. However, the growing volume of spatial data brings significant challenges for the
management of data processing. In addition, a large amount of low-quality data mixed in the raw
dataset increases the uncertainty of knowledge mining. Therefore, data cleaning plays a crucial role in
the research field of information science [10–12].
In this paper, we propose an efficient method for big trace data cleaning. On the basis of our
previous work [12], the proposed method polishes the theory of GPS data cleaning with further
development. To keep the consistency of moving objects, the entire trajectory is first partitioned into
a set of sub-trajectories by movement characteristic points. Those characteristic points are extracted
from trajectories based on the changes in the motion status. Then, GPS data are cleaned based on the
Sensors 2018, 18, 824; doi:10.3390/s18030824 www.mdpi.com/journal/sensors

Sensors 2018, 18, 824 2 of 16
similarities of GPS points and the movement consistency model of the sub-trajectory. The movement
consistency model is built using the random sample consensus algorithm based on the high spatial
consistency of high-quality GPS data. Moreover, the accuracy of cleaned data can be controlled by
tuning the threshold of similarities of GPS data and the movement consistency model. The proposed
method is evaluated based on extensive experiments, using GPS trajectories generated by a sample of
vehicles over a 7-day period in Wuhan city, China. The results show the effectiveness and efficiency of
the proposed method. In summary, the contributions of this research include: (1) High-quality GPS
data can be extracted from the raw dataset using the proposed data cleaning method; (2) The method
of vehicle movement consistency modeling is proposed using GPS trajectories; (3) The discussion of
the relationship between the accuracy of the cleaned GPS data and the similarity threshold provides
a possible way to extract GPS data based on the estimation accuracy; (4) The amount of GPS data
can be compressed after data cleaning, which can greatly decrease the storage space, as well as the
computing time.
2. Related Work
Based on previous studies, approaches to GPS time sequence data cleaning can be
broadly classified into statistical/quantitative based and logical/constraint based [13–16].
The statistical/quantitative methods of vehicle GPS data cleaning have been widely applied to identify
and clean GPS data as they are less susceptible to error stemming from sampling intervals. For example,
a high density of GPS points suggests a high probability that a road is present, whereas a low density
indicates that vehicles deviate far away from the road [17–19]. Therefore, low-density points are
defined as outliers. To remove these outliers, several studies [17] have sorted all the data points in
ascending order according to their distance from the median and chose 95% of the sorted data points
as the experimental data. Kernel density was applied to compute the density of each GPS point
and then all the low-density points were removed [18]. In [19], researchers proposed an adaptive
density optimization method to automatically recognize outliers and then removed those outliers.
In addition, an RGCPK (region growing clustering with prior knowledge) method was also proposed
to delete outliers from raw traces based on the motion tendency of vehicle traces [20]. However,
the existing methods proposed in articles [17–20] still cannot remove outliers mixed in the high-density
points cluster.
Trajectory filtering is a typical example of the logical/constraint methods for GPS data quality
management. This method has been applied to improve GPS data position accuracy, including adaptive
Kalman filtering for INS/GPS [21] and particle filtering. Authors in [22] argued that they can apply
various filtering techniques to a trajectory to smooth the noise and potentially decrease the error in
the measurements. They also gave a detailed introduction on how to implement a filtering algorithm
to smooth a trajectory using the measuring position, speed, and heading of a GPS tracking point
in a trajectory. However, the kinds of methods such as Kalman filter and particle filter have the
shortcomings of high complexity and computational overhead [22].
Unlike previous approaches that clean GPS trajectories based on clustering or filtering algorithms,
we propose a GPS data cleaning approach through the adjustment of movement consistency of GPS
data. The following section on the data cleaning model describes our method for raw GPS data
cleaning. Subsequent sections discuss the experimental results and conclusions.
3. Preliminaries
3.1. Spatial Big Data: Vehicle GPS Data

Vehicle GPS data as a major part of spatial big data [23] record the position, gathering time,
heading, speed, and other movement attributes of moving objects. In general, the sampling rates of
those vehicle trajectories usually range from 1 s to 60 s or even longer. The GPS data cleaning method
proposed in this paper focuses on finding high-quality GPS data, also known as high-accuracy GPS
Sensors 2018, 18, 824 3 of 16
data, from the raw GPS database. In reality, the position accuracy of the GPS data is different because
Sensors 2018, 18, x 3 of 16
of the types of GPS receivers, collection environment, techniques (e.g., single-point positioning, precise
point
precisepositioning, and difference
point positioning, positioning),
and difference etc. For
positioning), instance,
etc. the position
For instance, accuracy
the position of raw
accuracy of GPS
raw
traces collected by taxis using single-point positioning technique is about
GPS traces collected by taxis using single-point positioning technique is about 10–15 m in Wuhan, 10–15 m in Wuhan, while
raw
while GPS
rawdataGPSgenerated by smartphone
data generated applications
by smartphone on someon
applications mobile
somephones
mobile have
phones a 3–5 m accuracy.
have a 3–5 m
Beyond
accuracy. that, the accuracy
Beyond that, theofaccuracy
GPS dataofcollected
GPS data bycollected
the sameby GPSthereceiver
same GPS alsoreceiver
displaysalsoa difference
displaysina
different
differenceenvironments (e.g., open area,
in different environments semi-sheltered
(e.g., area, and sheltered
open area, semi-sheltered area)
area, and [24,25].area)
sheltered At the same
[24,25].
time,
At thebecause
same time,of thebecause
influence of of
thethe error distribution
influence of the error lawdistribution
of the GPS datalaw of[26],
thethe
GPS accuracy of each
data [26], the
GPS pointofofeach
accuracy the GPS
trajectory
point isof likely to be different.
the trajectory is likelyForto beinstance, if the
different. For accuracy
instance, of a GPS
if the dataset
accuracy ofisa
about 10 m, the
GPS dataset accuracy
is about 10 m, of the
oneaccuracy
part of such
of oneGPS data
part of issuch
higher
GPSthandata10ism, whilethan
higher another
10 m,portion
while
of the GPS
another data shows
portion of the aGPSlowerdataaccuracy.
shows aTherefore, a raw crowdsourcing
lower accuracy. Therefore, a raw GPS database has both
crowdsourcing GPS
low-accuracy and high-accuracy GPS data; a trajectory in such a database also
database has both low-accuracy and high-accuracy GPS data; a trajectory in such a database also has has both low-accuracy
and
bothhigh-accuracy
low-accuracy and GPS high-accuracy
data. Although most
GPS commercial
data. AlthoughGPS mostreceivers
commercial usually
GPSimplement strong
receivers usually
filtering
implement techniques to obtain
strong filtering very smooth
techniques tracking
to obtain veryresults,
smootha tracking
considerable amount
results, of crowdsourced
a considerable amount
GPS data generated by low-end GPS devices are still spotty.
of crowdsourced GPS data generated by low-end GPS devices are still spotty.
3.2. Discussion: Movement

3.2. Discussion: Movement Consistency
Consistency of
of Vehicle
Vehicle GPS
GPS Data
Data
The
The GPS
GPS data records the
data records the movement
movement of moving objects;
of moving objects; the
the higher
higher thethe accuracy
accuracy ofof the
the GPS
GPS data,
data,
the more realistic is the moving pattern it describes. As we know, in the real
the more realistic is the moving pattern it describes. As we know, in the real world, vehicles always world, vehicles always
keep
keep moving
moving in in aa straight
straight direction
direction except
except for for changing
changing lanes
lanes or
or turning
turning at at intersections. Therefore,
intersections. Therefore,
trajectories
trajectories generated by those vehicles show a very smooth result when its accuracy isashigh,
generated by those vehicles show a very smooth result when its accuracy is high, shownas
in Figure 1. Figure 1a,b respectively illustrate the DGPS (Differential Global
shown in Figure 1. Figure 1a,b respectively illustrate the DGPS (Differential Global Positioning Positioning System) data
with
System)0.1 m
dataaccuracy
with 0.1and its synchronous
m accuracy and its GPS data withGPS
synchronous 10 m accuracy
data with 10collected
m accuracyby acollected
mappingby car.
a
The model of the GPS receiver and base receiver are Trimble_R9 and
mapping car. The model of the GPS receiver and base receiver are Trimble_R9 and NetR9, NetR9, respectively. The ground
truth of one of
respectively. Thethe trajectories
ground truth of isone
obtained by field measurements.
of the trajectories is obtained by As we
field can see from As
measurements. Figure 1a,
we can
the DGPS data truly reflect the movement of the mapping car; however, the
see from Figure 1a, the DGPS data truly reflect the movement of the mapping car; however, the GPS GPS data cannot paint the
true
datapath
cannotof the mapping
paint the truecar because
path of theofmapping
the interference of some
car because low-accuracy
of the interferenceGPS points.
of some Meanwhile,
low-accuracy
by
GPS comparing with the ground
points. Meanwhile, truth, wewith
by comparing found thethat the positions
ground truth, we offound
GPS points vacillate
that the around
positions the
of GPS
ground truth and some high-accuracy GPS points keep a high consistency
points vacillate around the ground truth and some high-accuracy GPS points keep a high consistency in position and direction,
as
in shown
position inand
Figure 1a. Theas
direction, DGPS
shown data,
in by contrast,
Figure show
1a. The a very
DGPS smooth
data, result, and
by contrast, most
show of thesmooth
a very points
are either
result, andvery
mostnear to or
of the are atare
points theeither
ground very truth.
near to or are at the ground truth.
(a) (b)
Figure 1.
Figure 1. Movement
Movement consistency
consistency and
and position accuracy of
position accuracy of the
the GPS
GPS data.
data. (a) Differential Global
(a) Differential Global
Positioning System (DGPS) data overlay with the ground truth; (b) GPS data overlay with the
Positioning System (DGPS) data overlay with the ground truth; (b) GPS data overlay with the ground
truth. truth.
ground
Through the comparative results above, the high-accuracy GPS points of the trajectory present
Through the comparative results above, the high-accuracy GPS points of the trajectory present
a high consistency of the movement. Based on this observation, the key techniques of GPS data
a high consistency of the movement. Based on this observation, the key techniques of GPS data
cleaning from the raw crowdsourced database are to construct the consistency model of GPS points
cleaning from the raw crowdsourced database are to construct the consistency model of GPS points
based on such consistency of high-accuracy GPS data.
based on such consistency of high-accuracy GPS data.
4. GPS Data Cleaning Method Based on Movement Consistency
4.1. Overview
Sensors 2018, 18, 824 4 of 16
4. GPS Data Cleaning Method Based on Movement Consistency

Sensors 2018, 18, x 4 of 16
4.1. Overview
According to the analysis discussed in the previous section, the data cleaning method proposed
this paper
in this paperhas
hastwo
twosteps:
steps:trajectory
trajectorysegmentation
segmentationand
andmovement
movement consistency
consistency modeling,
modeling, as as shown
shown in
in Figure
Figure 2. 2.
Figure 2. The methodology of GPS data cleaning: (a) the raw GPS trace data; (b) how the sub-traces
Figure 2. The methodology of GPS data cleaning: (a) the raw GPS trace data; (b) how the
are partitioned using a characteristic point p6; (c) the consistency model generated for each sub-trace;
sub-traces are partitioned using a characteristic point p6 ; (c) the consistency model generated for
and (d)
each the results
sub-trace; andof
(d)cleaning based
the results on the similarity
of cleaning based on between the consistency
the similarity model
between the and sub-trace
consistency model
points.
and sub-trace points.
Step 1. The whole trajectory is partitioned into a set of sub-trajectories based on the movement
Step 1. The whole trajectory is partitioned into a set of sub-trajectories based on the movement
characteristic constraints, as shown in Figure 2a,b. These split points, also called
characteristic constraints, as shown in Figure 2a,b. These split points, also called characteristic
characteristic points, are the starting and ending points of each sub-trajectory.
points, are the starting and ending points of each sub-trajectory.
Step 2. The movement consistency model of each sub-trajectory is constructed using the random
Step 2. The movement
sample consensusconsistency
algorithmmodelbased of
oneach sub-trajectory
the high is constructed
spatial consistency using theGPS
of high-quality random
data,
sample consensus algorithm based on the high spatial consistency of high-quality
as shown in Figure 2c,d. The movement consistency model is regarded as the linear position GPS data,
asreference
shown infor Figure 2c,d. The movement consistency model is regarded
cleaning points; the more similar the GPS points are to the movement as the linear position
reference for cleaning
consistency model, the points;
morethe more are
precise similar the GPS
the GPS points are to the movement consistency
points.
model, the more precise are the GPS points.
This section presents a detailed introduction of each process.
This section presents a detailed introduction of each process.
4.2. Trajectory Segmentation Based on the Changes in Motion Status of Vehicles
4.2. Trajectory Segmentation Based on the Changes in Motion Status of Vehicles
Trajectory segmentation is a preparatory work in spatiotemporal data mining [27]. For example,
Trajectory
Gonzales et al.segmentation
[28] identifiedis acritical
preparatory
pointswork in spatiotemporal
in various data mining
GPS trajectories [27]. For
to perform example,
their mode
Gonzales et al. [28] identified critical points in various GPS trajectories to perform
classification study. In general, the whole trajectory is divided into several sub-trajectories based on their mode
classification
the movement study. In general,
characteristic the wholesuch
constraints trajectory is divided
as position, timeinto several
interval, sub-trajectories
velocity, based
etc. [29,30]. on
In this
the movement
paper, we focuscharacteristic
on GPS data constraints
cleaning basedsuchonasthe
position, time consistency.
movement interval, velocity, etc. [29,30].
The cleaning ruleInofthis
the
paper, we focus on GPS data cleaning based on the movement consistency.
proposed method is based on the premise that a moving object keeps moving on the same road in The cleaning rule of the
proposed method is Thus,
the same direction. based trajectory
on the premise that a moving
segmentation aims toobject keeps moving
determine on the same
the characteristic roadwhere
points in the
same direction.
the position Thus, trajectory
or direction segmentation
of a trajectory changesaims to determine
rapidly and thenthe characteristic
splits points
the trajectory basewhere
into the
the
position or direction of a trajectory
detected characteristic points. changes rapidly and then splits the trajectory base into the detected
characteristic points.
4.2.1. The Principle of Trajectory Segmentation
The partitioning constraint factors in trajectory segmentation include position and angle, and
are termed verdisk and angdisk in the following definitions:
Definition 1 (Position interference verdisk). Let Ti = (p1, p2, …, pn) denote the trajectory of the object moving
from p1 to pn. For any tracking points pk ∈ Ti, k = 1, 2, …, n, the vector composed by pi and pi+1 presents move
Sensors 2018, 18, 824 5 of 16
4.2.1. The Principle of Trajectory Segmentation

The partitioning constraint factors in trajectory segmentation include position and angle, and are
termed verdis
Sensors 2018, k and angdisk in the following definitions:
18, x 5 of 16
Definition 1 (Position interference verdis ). Let T = (p1 , p2 , . . . , pn ) denote the trajectory of the object
action, i = 1, 2, …, n, and pi+2′ is the projectionk of pi+2 oni the vector of pi and pi+1, then the distance between pi+2
moving from p1 to pn . For any tracking points pk ∈ Ti , k = 1, 2, . . . , n, the vector composed by pi and pi+1
and pi+2′ is
presents called
move the position
action, i = 1, 2,interference
. . . , n, andverdis
pi+2 0 k,isasthe
shown in Figure
projection 3. on the vector of p and p , then the
of pi+2 i i+1
distance between pi+2 and pi+2 0 is called the position interference verdisk , as shown in Figure 3.
Definition 2 (Angle jamming angdisk). Let Ti = (p1, p2, …, pn) denote the trajectory of an object moving
Definition
from (Angle
p1 to pn2, as shownjamming
in Figureangdis k ). Let Tipoints
4. For tracking = (p1 ,(p
p2i+1, ,. p. i+2
. ,)p∈n) T
denote the
i, i = 1, 2, trajectory
…, n, the of an object
vector moving
composed by
from p1 to pn , as shown in Figure 4. For tracking points (pi+1 , pi+2 ) ∈ Ti , i = 1, 2, . . . , n, the vector composed
pi+1 and pi+2 is the present movement, the angle between pipi+1 and pi+1pi+2 is the angle jamming value angdisk,
by pi+1 and pi+2 is the present movement, the angle between pi pi+1 and pi+1 pi+2 is the angle jamming value
as shown
angdis in Figure 3.
k , as shown in Figure 3.
Definition 3 (Partitioning
(Partitioning termination
terminationthreshold
thresholdaa11 and
and aa22). The variables a11 and a22 respectively
respectively represent
represent
the partitioning termination thresholds in distance and angle.
the partitioning termination thresholds in distance and angle.
Figure
Figure 3.
3. Trajectory
Trajectory partitioning
partitioning based
based on
on position
position and
and angle constraints.
angle constraints.
The main idea of trajectory segmentation for GPS data cleaning is to check the value of verdisk
The main idea of trajectory segmentation for GPS data cleaning is to check the value of verdisk and
and angdisk, k = 1, 2, …, n, with respect to the present movement. This algorithm is introduced as
angdisk , k = 1, 2, . . . , n, with respect to the present movement. This algorithm is introduced as follows:
follows:
Step 1.
Step 1. input
inputthe trajectoryTTi i(p
thetrajectory (p11,, pp22,, pp33,, …,
. . . p, np);n );
Step 2.
Step 2. initialize
initializethe thepartitioning
partitioningparameters’
parameters’characteristic characteristic points
points C, C,
c1 ,cstartIndex, currIndex, length, a1 ,
1, startIndex, currIndex, length,
and a , and set c = p , startIndex
a1, and a2, and set c1 = p1, startIndex = 1, length = 1;
2 1 1 = 1, length = 1;
Step 3.
Step setcurrIndex
3. set currIndex==startIndex
startIndex++ length.length. If If currIndex
currIndex << n, n, go
go toto Step
Step 4;4; otherwise,
otherwise, gogo to
to Step
Step 8;
8;
Step 4. set j = startIndex
set j = startIndex + 2; + 2;
calculateverdis
Step 5. calculate andangdis
verdisj jand angdisj .j. If If verdis
verdisjj > a22 || angdisj j>>a1a,1go
||angdis , gototoStep
Step6;6;otherwise,
otherwise,gogototoStep
Step3;3;
Step 6. push
pushppj jinto intoCCandandsetsetstartIndex
startIndex== jj − − 1, j = j +
1, j = j + 1;1;
Step 7. ififj j<<n,
Step 7. n,gogoto toStep
Step5;5;otherwise,
otherwise,go goto toStepStep3;3;
Step 8. push
Step 8. pushppnn into intoC,
C,andandreturn
returnC. C.
Based
Based on
on this
this segmentation
segmentation algorithm,
algorithm, aa trajectory
trajectory is
is divided
divided into
into several
several sub-trajectories
sub-trajectories if
if any
any
one of verdis k and angdisk, k = 1, 2, …, n, with respect to the present movement meet the partitioning
one of verdisk and angdisk , k = 1, 2, . . . , n, with respect to the present movement meet the partitioning
termination
termination thresholds
thresholds aa11 and
and aa22.. The
Thesub-trajectories
sub-trajectorieswill
willbe
be regarded
regarded asas the
the basic
basic unit
unit for
for the
the
remainder of the cleaning. It should be noted that the characteristic points are stored
remainder of the cleaning. It should be noted that the characteristic points are stored and managed and managed
separately
separately from
from thethe sub-trajectories
sub-trajectories after
after cleaning
cleaning since
since they
they could
could be
be used
used for
for trajectory
trajectory compression
compression
or
or abnormal
abnormal behavior
behavior detection.
detection.
4.2.2. Segmentation Threshold Determination

The distance and angle thresholds (a1 and a2) are used to determine whether the tracking point
has departed from the centerline of the original route. In general, a GPS point is considered as a
turning point if the vertical distance and angle between its two adjacent GPS vectors exceed the
maximum width of the road or the minimal angle of the traffic turn in a city. These turning points
Sensors 2018, 18, 824 6 of 16
4.2.2. Segmentation Threshold Determination

The distance and angle thresholds (a1 and a2 ) are used to determine whether the tracking point
has departed from the centerline of the original route. In general, a GPS point is considered as a turning
point if the vertical distance and angle between its two adjacent GPS vectors exceed the maximum
width of the road or the minimal angle of the traffic turn in a city. These turning points could be
considered as characteristic points that indicate the moving object has changed the moving route or
direction. Beyond that, for different types of trajectories, different a1 and a2 values should be set for
trajectory partitioning relative to their different shapes and unique characteristics. For a trajectory,
the more complicated the shape, the more characteristic points are found in that partition. Thus,
this study defines two deciding factors to determine the partitioning termination threshold for each
trajectory. Especially, the first deciding factor is a global range of distance and angle in trajectory
partitioning for all GPS data; and its value is decided by the knowledge of traffic law in a city.
The second deciding factor is determined by the shape complexity of a trajectory. Both of those factors
combine to determine a specific partitioning termination threshold for each trajectory as follows:
a1 = λ1 + g ( β 1 ) (1)
a2 = λ2 + g ( β 2 ) (2)
where λ1 and λ2 are the variables of the first deciding factor in the aspects of distance and angle,
respectively. In our study, the values of λ1 and λ2 equate with the maximum width of the road and the
minimum angle of the traffic turn in a city, respectively. The functions g(β1 ) and g(β2 ) demonstrate the
relationship between the partitioning scale and the shape complexity of trajectories in the aspects of
distance and angle, respectively. The variables β1 and β2 represent the shape complexity of a trajectory
in aspects of distance and angle, respectively.
Given the significant inverse correlation between the shape complexity and the partitioning
termination threshold of a trajectory, the higher the values of β1 and β2 , and the lower the values of
g(β1 ) and g(β2 ). Furthermore, the values of the partitioning termination thresholds g(β1 ) and g(β2 )
must be sensitive to the shape complexity β1 and β2 of trajectories within the set range. When the
values of the shape complexity β1 and β2 of a trajectory are over a set range, then the partitioning
thresholds g(β1 ) and g(β2 ) have less variation in distance and angle thresholds, respectively. Therefore,
on this basis, combining the previous work for estimating the shape complexity of trajectories [31,32],
the logarithmic function of the elementary function is used to model the relationship of β1 and g(β1 )
and β2 and g(β2 ) as follows:
g( β 1 ) = loga β1 (3)
g( β 2 ) = logaβ2 (4)
where ‘a’ is the base number of functions g(β1 ) and g(β2 ), 0 < a < 1. To always keep the values of
a1 and a2 positive, the absolute values of g(β1 ) and g(β2 ) must be smaller than the values of λ1 and
λ2 . Based on the research, the movement parameters are usually used to describe the complexity of
trajectories [33]. In the movement feature set, classic descriptive statistics of movement parameters,
which include the mean, standard deviation, and skewness of moving speed, the turning angle,
and straightness index, are extracted from trajectories as basic movement features. In this paper,
the standard deviations of projection distance and turning angle are used to represent the complexity
of trajectory in position (β1 ) and direction (β2 ), respectively, as follows:
Sensors 2018, 18, 824 7 of 16
s
n −1 2
1
∑ pi pi0 − µdis

β2 = n −2
i =2
where
n −1
1
∑ pi pi0

µdis = n −2 (5)
i =2
q
| p i p i +1 | = ( x i − x i +1 )2 + ( y i − y i +1 )2
q
Sensors 2018, 18, x | p1 p n | = ( x1 − x n )2 + ( y1 − y n )2 7 of 16
1 1n1 →→ 
s
  
n− 22
β 1 = 1 n− 1
1 ∑  
∠ pi pip+i 1p,i p11, ppn1 pn−µang
ni=11 i 1
ang
where (6)
where (6)
n −1 → →
  

1
µ ang = ∑1 ∠n 1pi pi+ 1 , p1 p n
n −1
ang  i=1   p p
n  1 i 1
i i 1 , p1 pn 
The complexity of the trajectory is positively associated with the values of β1 and β2 , and higher
values Theofcomplexity
β1 and β2 indicateof the trajectory
a higheriscomplexity
positively of associated
the trajectorywith the values ofand
in distance β1 and β2, and
angle. higher
As shown
values of β and β indicate a higher complexity of the trajectory in distance
in Figure 4, given a trajectory Tri = (p1 , p2 , . . . , pn ), pi is the tracking point in the trajectory Tri and
1 2 and angle. As shown in
pFigure
i = (xi ,4,yi given
), i = 1,a2,
trajectory
. . . , n, p1,Trand i = (p
p1n, pare
2, …, pn), pi is the
respectively tracking
the start and point endinpoints
the trajectory Trivariables
of Tri . The and pi = (x
βi1,
yi), i β=21,are
and 2, computed
…, n, p1, and bypEquations
n are respectively (5) and the (6), start
where and
∠(end → points of Tri. The the
, → ) represents variables
angle βbetween
1 and β2 the
are
p i p i +1 p 1 p n
computed by Equations (5) and (6), where ∠(ሱۛۛۛሮ, ሱۛሮ) represents the angle between the vector
vector pi pi+1 (denoted as → ) and the vector p1 p௣n೔௣(denoted ೔శభ ௣భ ௣೙ as → ), i = 1,2, . . . , n. At the same time, to
p i p i +1 p pn
pipi+1 (denoted
avoid the extreme as ሱۛۛۛሮ)
valuesand of βtheand vector
β by p1pan looping as ሱۛሮ), iit1=is1,2,
(denotedtrajectory, …, n. Attothe
necessary same time,
compare to avoid
the positions
௣೔ ௣೔శభ 1 2 ௣భ ௣೙
thep1extreme
of values
and pn first. If the 1 and β2 point
of βstarting by a looping trajectory,
p1 overlaps with theit ending
is necessary
point to
pn ,compare
then pn isthe positions
replaced by pof
n −p11.
and pprocess
This n first. Ifisthe startingfrom
repeated point
pnpto
1 overlaps
p2 until with theisending
a point found point pn, then
that does not poverlap
n is replaced by pstarting
with the n−1. This
process
point p1is
. repeated from pn to p2 until a point is found that does not overlap with the starting point p1.
(a) (b)
Figure 4. The complexity of trajectory in direction and position.
Figure 4. The complexity of trajectory in direction and position.
4.3. GPS Data Cleaning Based on Movement Consistency

4.3. GPS Data Cleaning Based on Movement Consistency
4.3.1. The
4.3.1. The Consistency Model Construction
Consistency Model Construction for
for Each
Each Sub-Trajectory
Sub-Trajectory
The sub-trajectory
The sub-trajectory reflects
reflects the
the tendency
tendency of of moving
moving objects
objects as
as they
they keep
keep moving
moving onon the
the same
same road
road
in the same direction. On this basis and combing through the discussions in Section
in the same direction. On this basis and combing through the discussions in Section 4.2, we find 4.2, we find that
that
high-accuracy vehicle
high-accuracy vehicle tracking
tracking data
data are highly consistent
are highly consistent with position and
with position and direction.
direction. For
For instance,
instance,
tracking points of vehicles with high position accuracy always cluster together
tracking points of vehicles with high position accuracy always cluster together along the centerlinealong the centerlineof
of each lane, while also having similar headings. Thus, in this paper, we propose
each lane, while also having similar headings. Thus, in this paper, we propose using this movement using this movement
consistency to
consistency find high-quality
to find high-quality GPSGPS data data from
from the
the raw
raw GPS
GPS database.
database. Specifically, the movement
Specifically, the movement
consistency model is defined as a directed line segment that belongs to the
consistency model is defined as a directed line segment that belongs to the straight line l, asstraight line l, as shown
shown
in Figure
in Figure 5.
5. Since
Since aa trajectory
trajectory has been segmented
has been segmented into into a set of
a set of sub-trajectories
sub-trajectories based
based on
on trajectory
trajectory
segmentation, GPS tracking points of each sub-trajectory keep with similar headings except for
segmentation, GPS tracking points of each sub-trajectory keep with similar headings except for aa few
few
curves. Therefore, the construction of movement consistency models of each
curves. Therefore, the construction of movement consistency models of each sub-trajectory equatessub-trajectory equates
with the
with the generation
generation ofof straight
straight line
line l.l. At
At present,
present, the
the least
least squares method is
squares method is the
the most
most commonly
commonly used used
method for parameter estimation. However, the estimated parameters from a least squares model
can be corrupted by outliers. To avoid the effects of these outliers, we use the Random Sample
Consensus (RANSAC) algorithm to find GPS points with high consistency in position and heading
and then get the consistency model of each sub-trajectory.
Sensors 2018, 18, 824 8 of 16
method for parameter estimation. However, the estimated parameters from a least squares model can
be corrupted by outliers. To avoid the effects of these outliers, we use the Random Sample Consensus
(RANSAC) algorithm to find GPS points with high consistency in position and heading and then get
the consistency
Sensors 2018, 18, x model of each sub-trajectory. 8 of 16
Figure 5. Construction of the consistency model by using RANSAC.

Figure 5. Construction of the consistency model by using RANSAC.
Given a sub-trajectory STri = (pi, pi+1, …, pi+t), pk = (xk, yk), k = i, i + 1, …, i + t, STri ∈ Tri, assuming
Given a sub-trajectory STri = (pi , pi+1 , . . . , pi +t ), pk = (xk , yk ), k = i, i + 1, . . . , i + t, STri ∈ Tri ,
that the consistency model of STri belongs to the straight line l. Where x0 and y0 are the points that go
assuming that the consistency model of STri belongs to the straight line l. Where x0 and y0 are the
through the consistency model, then b0 and b1 are the coefficients of the straight line l:
points that go through the consistency model, then b0 and b1 are the coefficients of the straight line l:
x  x0  b0t
x = x0 + b0 t (7)
y  y bt (7)
y = y00+ b11t
The estimated model in the RANSAC algorithm is termed M* and the same as Equation (7). The
The estimated model in the RANSAC algorithm is termed M* and the same as Equation (7).
threshold τ defines a GPS tracking point pi and conforms to model M*. The number of iterations is
The threshold τ defines a GPS tracking point pi and conforms to model M*. The number of iterations
set as N and the parameter s is used to represent the number of data elements required to fit M*. The
is set as N and the parameter s is used to represent the number of data elements required to fit M*.
concrete procedure for finding position points with high consistency using the RANSAC algorithm
The concrete procedure for finding position points with high consistency using the RANSAC algorithm
was obtained from [34].
was obtained from [34].
4.3.2. Discussion of Similarity and Consistency Model for GPS Data Cleaning
4.3.2. Discussion of Similarity and Consistency Model for GPS Data Cleaning
The consistency model of each sub-trajectory is constructed based on the movement consistency
The consistency model of each sub-trajectory is constructed based on the movement consistency
of high-accuracy GPS points. Thus, for a sub-trajectory, the value of the similarity between a GPS
of high-accuracy GPS points. Thus, for a sub-trajectory, the value of the similarity between a GPS
point and its consistency model relates directly to the level of the position accuracy of it. In this study,
point and its consistency model relates directly to the level of the position accuracy of it. In this study,
the similarity evaluation between a GPS point and the consistency model in distance and direction is
the similarity evaluation between a GPS point and the consistency model in distance and direction is
defined as Equation (8) by consulting the previous methods [35]:
defined as Equation (8) by consulting the previous methods [35]:
 p p' 
 1 cos t 

sim t
1 e1e
sim((ppt ,G,G) )=ω + ω2 e2 e1−cos (θt ))
−| pt pt0t | t   −( (8)
(8)
where |pttpptt′|0 |isisthe

where thedistance
distancebetween
betweenptpandt anditsits
projection
projectionpoint pt′ pon
point 0 thethe
t on consistency model,
consistency θt isθ tthe
model, is
angle
the between
angle between 0
pt′ heading angleangle
pt heading and the
anddirection of theofconsistency
the direction model,
the consistency ω1 andωω1 2and
model, are the weight
ω 2 are the
of the vertical
weight distancedistance
of the vertical and angle,andω1angle,
+ ω2 = ω 1.1The
+ ωsimilarity
2 = 1. Theof GPS measurements
similarity and the consistency
of GPS measurements and the
model rangemodel
consistency from 0range
to 1. from 0 to 1.
Based on the results results of
of similarity
similarity calculation,
calculation, the high-quality
high-quality GPS data are detecteddetected by setting
setting
different similarity thresholds. All cleaned GPS points will be joined back into a long long trajectory
trajectory and
and
be used as raw material for information mining (e.g., road network generation, traffic flow detection,
human mobility
human mobilitypattern pattern mining,
mining, etc.).
etc.). TheThe similarity
similarity threshold
threshold determines
determines the smoothness
the smoothness and
and quality
quality
of cleanedof cleaned
GPS points, GPSandpoints,
eachand each similarity
similarity should correspond
should correspond to an estimation
to an estimation accuracy of accuracy
GPS data. of
GPS data. However,
However, since theresince there
are still are still
many many uncertainties
uncertainties in movementin movement
consistency consistency construction,
construction, it is very it
is very difficult
difficult to obtaintothe obtain the relation
definite definite between
relation between the similarity
the similarity and the estimation
and the estimation accuracy of accuracy
GPS data. of
GPS
In thisdata. In we
paper, thisuse
paper, we use of
the relation thesimilarity
relation of similarity
(denoted (denoted
as Sim) as Sim) deviation
and position and position deviation
(denoted as ε,
(denoted as ε, also called an estimation accuracy) between GPS data and the ground truth to estimate
the similarity threshold. The detailed analysis of the relation between Sim and ε by using GPS data
in the real world is discussed in the next section.
5. Experimental Study
Sensors 2018, 18, 824 9 of 16
also called an estimation accuracy) between GPS data and the ground truth to estimate the similarity
threshold. The detailed analysis of the relation between Sim and ε by using GPS data in the real world
is discussed in the next section.
5. Experimental Study
5.1. Experimental Dataset

Sensors 2018, 18, x 9 of 16
To test the performance of our method, we experimented with real trajectory datasets.
To test the trajectory
The experimental performance ofwere
data our method,
collectedwe byexperimented
several shuttle with real trajectory
vehicles in Wuhan. datasets. The
These shuttle
experimental trajectory data were collected by several shuttle vehicles in
vehicles were equipped with the GPS logger (model: Trimble_R9), several smartphones (model:Wuhan. These shuttle
vehicles were
MDM6610, equipped with
UBX-G6010-ST, the GPS logger
MTK-MT6627, etc.),(model: Trimble_R9),
hand-GPS (model: several smartphones
SIRF systems), and an(model:
inertial
MDM6610, UBX-G6010-ST, MTK-MT6627, etc.), hand-GPS (model: SIRF systems), and an inertial
measurement unit (model: POS310PCS) that recorded two kinds of traces, GPS and synchronized DGPS
measurement unit (model: POS310PCS) that recorded two kinds of traces, GPS and synchronized
traces. It must be stressed that one GPS point corresponds to one DGPS point and all points represent
DGPS traces. It must be stressed that one GPS point corresponds to one DGPS point and all points
the position of a moving object with different positional accuracies. The position accuracies of the GPS
represent the position of a moving object with different positional accuracies. The position accuracies
and DGPS data in an urban area were about 10–15 m and 0.05–0.1 m, respectively. The sampling rate
of the GPS and DGPS data in an urban area were about 10–15 m and 0.05–0.1 m, respectively. The
for these data was 1 s. The time interval between two adjacent tracking points on a trajectory was not
sampling rate for these data was 1 s. The time interval between two adjacent tracking points on a
more than 360
trajectory wass; otherwise, storing
not more than the
360 s; trajectorystoring
otherwise, was restarted from the
the trajectory wasposition where
restarted fromitthe
exceeded
positionthe
setwhere
value.itThe
exceeded the set value. The data collection period for the shuttle vehicles was 7 days.million
data collection period for the shuttle vehicles was 7 days. We obtained about 140 We
GPSobtained aboutpoints,
and DGPS as shown
140 million in Figure
GPS and DGPS6.points, as shown in Figure 6.
(a)
(b)
Figure 6. Experimental data. (a) DGPS and synchronized GPS trajectories collected by IMU/DGPS
Figure 6. Experimental data. (a) DGPS and synchronized GPS trajectories collected by IMU/DGPS
systems and GPS loggers; (b) DGPS and synchronized GPS trajectories collected by IMU/DGPS
systems and GPS loggers; (b) DGPS and synchronized GPS trajectories collected by IMU/DGPS
systems, smartphones, and hand-GPS.
systems, smartphones, and hand-GPS.
In our study, the highest accuracy of the cleaned data reached the meter level. The position
accuracy of trajectories generated by the IMU/DGPS system reached the centimeter level. Therefore,
in a follow-up experiment, the synchronized high-accuracy DGPS traces were regarded as ground
truth to validate the effectiveness of the proposed method. The raw low-accuracy GPS data will be
considered as the experimental data.
Sensors 2018, 18, 824 10 of 16
In our study, the highest accuracy of the cleaned data reached the meter level. The position
accuracy of trajectories generated by the IMU/DGPS system reached the centimeter level. Therefore,
in a follow-up experiment, the synchronized high-accuracy DGPS traces were regarded as ground
truth to validate the effectiveness of the proposed method. The raw low-accuracy GPS data will be
considered as the experimental data.
Sensors 2018, 18, x

5.2. Parameters Discussion 10 of 16
The constants
The constants λλ11and andλλ , and
2, 2and base
base ‘a’‘a’
forfor partitioning
partitioning threshold
threshold determination
determination are are necessary
necessary for
for trajectory
trajectory partitioning.
partitioning. Based Basedon theonabove,
the above, the value
the value of λ1of λ1 equates
equates with with the maximum
the maximum rangerange of
of road
road width and λ depends on the turning angle of vehicles in a city.
width and λ2 depends on the turning angle of vehicles in a city. The experimental traces data were
2 The experimental traces data
were collected
collected in Wuhan. in Wuhan.
Based on Based on the construction
the construction rule of the ruleroads,
of thetheroads, the maximum
maximum width of width of the
the one-way
one-way
road road
in the in the experimental
experimental region region
was about was about
17.5 m, 17.5som,the so value
the value
of λof1 λ 1 was
was setset
asas17.5
17.5m. m.As As the
the
minimum angle of a traffic turn in China is about 60 ◦ and the heading error in the GPS data is about
minimum angle of a traffic turn in China is about 60° and the heading error in the GPS data is about
5◦ –15◦ , we
5°–15°, we set
set the
◦ . The value for base ‘a’ in Equations (1) and (2) ranges from 0 to 1 and affects
theλλ22toto4545°. The value for base ‘a’ in Equations (1) and (2) ranges from 0 to 1 and
the minimum
affects and maximum
the minimum and maximum g(β1 ) and
values ofvalues of g(β g(β ). Based
1) 2and g(β2).onBased
Equations (1) and (2),
on Equations (1)the
and functions
(2), the
g(β ) and
functions
1 g(β ) have decreasing property with the value of trajectory
g(β12) and g(β2) have decreasing property with the value of trajectory complexity complexity β 1 and β 2β1 and are
; and β2;
less than
and zero
are less if β1zero
than andifββ2 1areandallβ2greater
are all than
greater 1. Tothanalways
1. Tokeepalwaysthe keep
values a1 andof
theofvalues a2 aas
1 andpositive,
a2 as
the absolute
positive, the minimum values of g(β
absolute minimum 1 ) and
values ofg(β
g(β21)) must
and g(βbe 2smaller
) must be than the constants
smaller and λ2 . Figure
than theλ1constants λ1 and7
shows the changing rules of g(β
λ2. Figure 7 shows the changing1 rules of g(β ) and g(β ) with the specific base under different
2 1) and g(β2) with the specific base under different values of β and β2 .
1 values
The
of β1base
and β‘a’
2. ranges
The base between
‘a’ rangesabout 0 and about
between 1. 0 and 1.
Figure 7. Discussion of base ‘a’ in Equation (2).

Figure 7. Discussion of base ‘a’ in Equation (2).
In Figure 7, the smaller the base ‘a’, the smaller will be the g(β1) and g(β2) change. As the constants
In Figure
λ1 and λ2 are 7, the smaller
equal to 17.5 the base 45°
m and ‘a’, the smaller
in this willbase
paper, be the g(β
‘10 g(β2 ) change.
) andselected
−1’1 was as theAs the constants
value of ‘a’ in
λ and λ are equal to 17.5 m and 45 ◦ in this paper, base ‘10−1 ’ was selected as the value of ‘a’ in
1 2
Equation (2). After trace partitioning (Figure 8a), the sub-trajectories are regarded as raw data and
Equation
cleaned (2). After
based on thetrace partitioning
movement (Figure
consistency 8a), the
model. Forsub-trajectories
consistency model are regarded as raw
construction, thedata
valueand
of
cleaned based on the movement consistency model. For consistency model construction,
τ was set as 0.1 m according to the accuracy requirement; other parameters such as N are self-adaptive the value of
τ was set
(Figure 8b).as 0.1 m according to the accuracy requirement; other parameters such as N are self-adaptive
(Figure 8b).
(a)
In Figure 7, the smaller the base ‘a’, the smaller will be the g(β1) and g(β2) change. As the constants
λ1 and λ2 are equal to 17.5 m and 45° in this paper, base ‘10−1’ was selected as the value of ‘a’ in
Equation (2). After trace partitioning (Figure 8a), the sub-trajectories are regarded as raw data and
cleaned based on the movement consistency model. For consistency model construction, the value of
τ was 2018,
Sensors set as18,0.1 m according to the accuracy requirement; other parameters such as N are self-adaptive
824 11 of 16
(Figure 8b).
Sensors 2018, 18, x 11 of 16

(a)
(b)
Figure 8. Experimental consistency model construction results for all vehicle movement data. (a) The
Figure 8. Experimental consistency model construction results for all vehicle movement data.
consistency model construction results for the entire dataset; (b) the results of a part of the vehicle
(a) The consistency model construction results for the entire dataset; (b) the results of a part of the
movement data.
vehicle movement data.
A similarity evaluation model is used to calculate the similarity between GPS tracking points
A similarity
and the movement evaluation modelmodel.
consistency is used This
to calculate
similaritythe similarity
evaluation between
modelGPS tracking
is used notpoints
only and
for
the movement consistency model. This similarity evaluation model is
estimating the similarity of GPS points and the consistency model but for cleaning threshold used not only for estimating the
similarity ofThese
discussion. GPS points and the consistency
two applications model
of similarity but for model
evaluation cleaning arethreshold discussion.
done to evaluate the These two
similarity
applications of similarity evaluation model are done to evaluate the similarity
between GPS point and high-accuracy spatial reference in aspects of distance and angle. In this paper, between GPS point and
high-accuracy spatial reference in aspects of distance and angle. In this paper,
the weight of distance and angle of the similarity evaluation model is estimated using the correlation the weight of distance
and anglethe
between of distance
the similarity evaluation
and angle with model is estimated
measuring errors ofusingGPS thedatacorrelation
[20]. The between
experimentalthe distance
results
show that the weights in the similarity evaluation model are 0.91 and 0.09, respectively. weights in
and angle with measuring errors of GPS data [20]. The experimental results show that the
the similarity
We use the evaluation model are 0.91
linear regression and 0.09,
analysis of the respectively.
similarity and the position deviation of GPS
We use the
measurements to linear
derivate regression
the relationanalysis
of Simofand the ε.similarity
With the and resulttheofposition
multipledeviation of GPS
linear regression
measurements to derivate the relation of Sim and ε. With the result of multiple
analysis, the relation of similarity (Sim) and position deviation (ε) between GPS data and the ground linear regression
analysis,
truth theexponential
fits an relation of similarity
model, as(Sim)shown and
in position
Equationdeviation
(9): (ε) between GPS data and the ground
truth fits an exponential model, as shown in Equation (9):
Sim  aeb  c (9)
Sim = aebε + c (9)
The values of parameters a, b, c in Equation (9) are determined by weights of the similarity
evaluation model.ofThe
The values cleaning threshold
parameters with the specific
a, b, c in Equation estimation accuracy
(9) are determined is obtained
by weights based on
of the similarity
Equation (9). Based on plenty of experiment data and analyzing results, the
evaluation model. The cleaning threshold with the specific estimation accuracy is obtained based on correlation coefficient R
for Sim and ε is about 0.942 when the values of a, b, c in Equation (9) for GPS
Equation (9). Based on plenty of experiment data and analyzing results, the correlation coefficient R for data with 10–15 m
accuracy
Sim and εare set as0.942
is about 1, −0.263,
when0,therespectively.
values of a, Figure 9 shows (9)
b, c in Equation thefor
result
GPSofdataexponential
with 10–15 regression
m accuracy of
similarity and position accuracy of two different datasets collected in different
are set as 1, −0.263, 0, respectively. Figure 9 shows the result of exponential regression of similarity andenvironments with
the sameaccuracy
position overall position accuracy.datasets
of two different ‘Dataset 1’ and ‘Dataset
collected 2’ were
in different collected in
environments an the
with urbansamearea on a
overall
shadowed road and a semi-shadowed road, respectively. The model of
position accuracy. ‘Dataset 1’ and ‘Dataset 2’ were collected in an urban area on a shadowed road GPS receivers for collecting
‘Dataset 1’ and ‘Datasetroad,
and a semi-shadowed 2’ were Trimble R9
respectively. Theandmodel SIRFofsystems, respectively.
GPS receivers The ground
for collecting ‘Dataset truths of
1’ and
these two2’datasets
‘Dataset were Trimblewere obtained
R9 and SIRF based on therespectively.
systems, CORS system Thebyground
assemblingtruthsthe
of GPS
thesereceivers and
two datasets
CORS system together. Based on the similarity of Equation (9), we can get some cleaning thresholds
by tuning the value of ε. Figure 10 shows the cleaned results of GPS points from raw GPS traces of
two datasets, with its estimation accuracy set as 3 m; that is, ε equals to 3 m.
Equation (9). Based on plenty of experiment data and analyzing results, the correlation coefficient R
for Sim and ε is about 0.942 when the values of a, b, c in Equation (9) for GPS data with 10–15 m
accuracy are set as 1, −0.263, 0, respectively. Figure 9 shows the result of exponential regression of
similarity and position accuracy of two different datasets collected in different environments with
the same
Sensors 2018,overall
18, 824 position accuracy. ‘Dataset 1’ and ‘Dataset 2’ were collected in an urban area 12on a
of 16
shadowed road and a semi-shadowed road, respectively. The model of GPS receivers for collecting
‘Dataset 1’ and ‘Dataset 2’ were Trimble R9 and SIRF systems, respectively. The ground truths of
were
these obtained basedwere
two datasets on the CORS system
obtained based on by the
assembling the GPS
CORS system byreceivers and the
assembling CORS system
GPS together.
receivers and
Based on the similarity of Equation (9), we can get some cleaning thresholds by tuning
CORS system together. Based on the similarity of Equation (9), we can get some cleaning thresholds the value of
ε. Figure
by tuning10 theshows
valuethe cleaned
of ε. Figureresults
10 shows of GPS points from
the cleaned raw
results of GPS
GPS traces
pointsoffrom
tworaw
datasets, with its
GPS traces of
estimation accuracy set as 3 m; that is, equals to 3 m.
two datasets, with its estimation accuracy set as 3 m; that is, ε equals to 3 m.
ε
Sensors 2018, 18, x (a) (b) 12 of 16
Figure 9.
Figure Linear regression
9. Linear regression results
results for
for similarity
similarity and
and position
position accuracy
accuracy of
of the
the GPS
GPS data.
data. (a)
(a) ‘Dataset
‘Dataset 1’
1’
(7604 GPS points) was collected in an urban area on a shadowed road; (b) ‘Dataset 2’ (6543 GPS
(7604 GPS points) was collected in an urban area on a shadowed road; (b) ‘Dataset 2’ (6543 GPS points) points)
was
was collected
collected in
in an
an urban
urban area
area on
on aa semi-shadowed
semi-shadowed road.
road.
(a) (b)
Figure
Figure 10.
10. The
Thecleaning results
cleaning of all
results of vehicle movement
all vehicle movement data data
at an at
estimation accuracy
an estimation of 3 m. of
accuracy (a) 3The
m.
results of the entire dataset; (b) the results of part of the vehicle movement data.
(a) The results of the entire dataset; (b) the results of part of the vehicle movement data.
5.3. Quantitative Evaluation and Discussion

5.3. Quantitative Evaluation and Discussion
To evaluate the effectiveness of the proposed method, we implemented it on the vehicle
To evaluate the effectiveness of the proposed method, we implemented it on the vehicle movement
movement datasets collected in the real world. The position accuracy of those raw GPS data sets is
datasets collected in the real world. The position accuracy of those raw GPS data sets is different since
different since the performance of the GPS devices varies. Based on field testing, the average value of
the performance of the GPS devices varies. Based on field testing, the average value of the position
the position accuracy of vehicle trajectories collected by Trimble R9, hand-held GPS, and
accuracy of vehicle trajectories collected by Trimble R9, hand-held GPS, and smartphones are about
smartphones are about 5.1 m (4.1), 5.0 m (3.6), and 9.1 m (4.7), respectively. The numerical values in
5.1 m (4.1), 5.0 m (3.6), and 9.1 m (4.7), respectively. The numerical values in parentheses are the
parentheses are the standard deviations of each category. The raw datasets were then cleaned
standard deviations of each category. The raw datasets were then cleaned depending on different
depending on different cleaning thresholds that were determined by the values of estimation
cleaning thresholds that were determined by the values of estimation accuracy. The experimental
accuracy. The experimental results for three different GPS datasets are displayed in Table 1.
results for three different GPS datasets are displayed in Table 1. According to the figures given by
According to the figures given by Table 1, the accuracy and size of the cleaned GPS data are improved
Table 1, the accuracy and size of the cleaned GPS data are improved greatly compared with the
greatly compared with the accuracy of the raw dataset, though there is still a difference between the
accuracy of the raw dataset, though there is still a difference between the estimation accuracy and the
estimation accuracy and the real accuracy of the cleaned GPS data. In addition, based on the
real accuracy of the cleaned GPS data. In addition, based on the experimental results of three tested
experimental results of three tested datasets, the accuracy of the cleaned GPS data also depends on
datasets, the accuracy of the cleaned GPS data also depends on the accuracy of the raw dataset itself.
the accuracy of the raw dataset itself. It is still a challenge for us to identify the high-accuracy GPS
It is still a challenge for us to identify the high-accuracy GPS data from the raw datasets if there is
data from the raw datasets if there is no high-accuracy GPS data in the data in the first place. To
no high-accuracy GPS data in the data in the first place. To further illustrate this point, we analyzed
further illustrate this point, we analyzed the distribution of accuracy for the cleaned data extracted
the distribution of accuracy for the cleaned data extracted from the vehicle trajectories collected by
from the vehicle trajectories collected by Trimble R9, as shown in Figure 11.
Trimble R9, as shown in Figure 11.
Table 1. Evaluation of the cleaned data from different datasets.
Trajectory Estimation Proportion of The Accuracy of GPS The Accuracy of GPS Data
Acquisition Accuracy: ε the Cleaned Data after Cleaning after Cleaning (Standard
Device (m) Data (%) (Average Value/m) Deviation/m)
2 46.62 2.1 1.0
3 58.87 2.9 1.2
Sensors 2018, 18, 824 13 of 16
Table 1. Evaluation of the cleaned data from different datasets.
Trajectory Estimation Proportion of The Accuracy of GPS The Accuracy of GPS

Acquisition Accuracy: ε the Cleaned Data after Cleaning Data after Cleaning
Device (m) Data (%) (Average Value/m) (Standard Deviation/m)
2 46.62 2.1 1.0
3 58.87 2.9 1.2
Trimble R9
4 66.30 3.5 1.8
5 72.48 4.1 1.8
2 36.86 2.0 0.8
3 41.38 2.4 1.2
Hand-held GPS
4 46.32 2.9 1.3
5 48.76 3.7 2.3
2 27.43 3.8 2.4
3 32.69 4.8 2.9
Smartphones
4 40.23 5.1 3.0
Sensors 2018, 18, x
5 48.11 5.6 3.3 13 of 16
Figure
Figure 11.
11. Comparison
Comparison ofof the
the position
position accuracy
accuracy of
of the
the cleaned
cleaned data
data and
and raw
raw GPS
GPS data
data in
in different
different
estimation accuracy levels.
estimation accuracy levels.
In Figure 11, the thick green solid line represents the proportion of raw GPS data in several
In Figure 11, the thick green solid line represents the proportion of raw GPS data in several ranges
ranges of position accuracy; the other solid lines show the proportion of cleaned data with different
of position accuracy; the other solid lines show the proportion of cleaned data with different estimation
estimation accuracies. We observe that the proportion of GPS points that satisfy changing demands
accuracies. We observe that the proportion of GPS points that satisfy changing demands for position
for position accuracy generally increase as the estimation accuracy falls. Although the average value
accuracy generally increase as the estimation accuracy falls. Although the average value and standard
and standard deviation of cleaned data based on the estimation accuracy illustrate that the proposed
deviation of cleaned data based on the estimation accuracy illustrate that the proposed method is
method is effective, a small percentage of low-position accuracy points beyond the estimation
effective, a small percentage of low-position accuracy points beyond the estimation accuracy still
accuracy still exists in the cleaned dataset. For example, a very small subset of GPS points with 4 m
exists in the cleaned dataset. For example, a very small subset of GPS points with 4 m accuracy is
accuracy is still mixed in the cleaned data when the estimation accuracy is set to about 1 m.
still mixed in the cleaned data when the estimation accuracy is set to about 1 m. Experimental results
Experimental results demonstrate that it is very difficult to find GPS data at 1 m position accuracy.
demonstrate that it is very difficult to find GPS data at 1 m position accuracy. The reason why the
The reason why the proposed method cannot strictly identify data based on the estimation accuracy
proposed method cannot strictly identify data based on the estimation accuracy is complex. The most
is complex. The most important issue is that the GPS error follows a stable distribution; raw GPS
important issue is that the GPS error follows a stable distribution; raw GPS points of a sub-trajectory
points of a sub-trajectory include some high-accuracy points and low-accuracy points. The
include some high-accuracy points and low-accuracy points. The consistency model constructed using
consistency model constructed using the RANSAC algorithm is considered as the position reference
the RANSAC algorithm is considered as the position reference to identify the accuracy of GPS data
to identify the accuracy of GPS data but sometimes the position of the consistency model may be
but sometimes the position of the consistency model may be wrong, especially when there are only
wrong, especially when there are only low-accuracy points in the sub-trajectory. In addition, the
low-accuracy points in the sub-trajectory. In addition, the similarity threshold for cleaning is derived
similarity threshold for cleaning is derived from the relation between GPS data and DGPS data, but
from the relation between GPS data and DGPS data, but there are still a lot of uncertainties caused by
there are still a lot of uncertainties caused by the collection environment, devices, techniques, etc. In
the collection environment, devices, techniques, etc. In the future work, we will address this problem.
the future work, we will address this problem.
To evaluate the performance of the proposed method, we conducted a qualitative comparison
of position accuracy for cleaned data based on the methods discussed in the related work section
(e.g., the RGCPK [20], the ADOM [19], the Kernel density method [18], and the Kalman filtering
method [17]) and our method. These comparisons of the quality of cleaned data used datasets that
Sensors 2018, 18, 824 14 of 16
To evaluate the performance of the proposed method, we conducted a qualitative comparison

of position accuracy for cleaned data based on the methods discussed in the related work section
(e.g., the RGCPK [20], the ADOM [19], the Kernel density method [18], and the Kalman filtering
method [17]) and our method. These comparisons of the quality of cleaned data used datasets that
were collected by vehicles equipped with Trimble R9. Table 2 shows the highest position accuracy
results for the cleaned data from the test datasets using these methods.
Table 2. Comparisons of the previous methods for GPS data cleaning.
Vehicle Trajectories Collected by Trimble R9

Methods for GPS Data
Cleaning Mean Value of the Accuracy of Standard Deviation of the Accuracy
the Cleaned Data (m) of the Cleaned Data (m)
Method proposed in this paper 2.1 1.0
RGCPK 2.5 1.2
ADOM 4.5 3.2
KDE 4.6 3.3
KF 3.8 7.8
According to these results, the datasets employing the method proposed in this paper achieved
the highest extracting accuracy when compared to the four other methods. Although RGCPK
can also extract high-accuracy GPS data from the raw dataset, the results using RGCPK required
prior knowledge to calculate the clustering threshold and filtering standard [20]. The comparison
experiment also shows that methods such as ADOM and KDE (Kernel density method) can only
remove low-density GPS points. However, sometimes the low-density GPS points do not equal the
low-accuracy GPS points, so the cleaning effect is limited [18,19]. The KF (Kalman filtering method)
is effective when the trajectory data are particularly noisy [17]. It is usually used to correct GPS data
rather than to extract high-quality GPS data from raw datasets. Thus, the accuracy of the cleaned data
derived using the filtering method was lowest in comparison with the other methods. Analyzing from
practical applications (e.g., road network generation), the high-quality GPS data found from the raw
datasets by using our method not only improves the position accuracy of road network extraction
results but can also be used to detect lane-based road information.
6. Conclusions
Nowadays, the growing volume of spatial big data not only creates process management
difficulties but also adds uncertainty for knowledge mining. Unlike previous approaches that clean
GPS data based on clustering or filtering algorithms, in this paper, we proposed a method to clean
GPS data through the adjustment of movement consistency of GPS data. The mechanism of vehicle
GPS data cleaning based on movement consistency includes two steps: trajectory segmentation and
consistency model construction. First, the whole trajectory is partitioned into a set of sub-trajectories by
characteristic points. Those characteristic points are extracted from trajectories based on the constraints
of moving distance or direction. Then, GPS data are cleaned based on the similarities of GPS points and
the movement consistency model of the sub-trajectory. The movement consistency model is built using
the random sample consensus algorithm based on the high spatial consistency of high-quality GPS
data. Moreover, the accuracy of cleaned data can be controlled by tuning the threshold of similarities
of GPS data and the local consistency model. The proposed method was evaluated based on extensive
experiments, using GPS trajectories generated by a sample of vehicles over a 7-day period in Wuhan,
China. Although these experimental results show the effectiveness and efficiency of the proposed
method, there are still many problems and shortcomings that need further improving and refining.
Due to the position accuracy of the raw GPS data being too low, the proposed method cannot find
enough high-quality data from the original database according to the cleaning threshold, which is
calculated by the estimation accuracy. In addition, in this paper, GPS data were collected with a high
Sensors 2018, 18, 824 15 of 16
sampling rate by testing vehicles. In the real world, however, the sampling rate of most GPS data is
not very high. Thus, this kind of sparse dataset also brings difficulty for data cleaning. In future work,
we will address these shortcomings and continue to improve the filtering method proposed here.
Acknowledgments: The authors would like to sincerely thank the anonymous reviewers for their constructive
comments and valuable suggestions to improve the quality of this article. This work was funded by the National
Key R&D Program of China (2017YFB0503604, 2016YFE0200400), the National Natural Science Foundation
of China (Nos. 41671442, 41571430, 41271442), and the Joint Foundation of Ministry of Education of China
(6141A02022341).
Author Contributions: Xue Yang and Luliang Tang conceived and designed the algorithms of big trace data
cleaning method presented in this paper. Xue Yang performed the experiments and wrote the paper. Xia Zhang
and Qingquan Li contributed analysis tools and to the paper’s refinement.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. McAfee, A. Big data. The Management Revolution. Harv. Bus. Rev. 2012, 90, 61–67.
2. Saha, B.; Srivastava, D. Data quality: The other face of big data. In Proceedings of the 2014 IEEE
30th International Conference on Data Engineering (ICDE), Chicago, IL, USA, 31 March–4 April 2014;
pp. 1294–1297.
3. Chatzimilioudis, G.; Konstantinidis, A.; Laoudias, C.; Zeinalipour-Yazti, D. Crowdsourcing with smartphones.
IEEE Internet Comput. 2012, 16, 36–44. [CrossRef]
4. Zheng, Y.; Xie, X.; Ma, W.Y. GeoLife: A Collaborative Social Networking Service among User, Location and
Trajectory. IEEE Data Eng. Bull. 2010, 33, 32–39.
5. Van der Spek, S.; van Schaick, J.; de Bois, P.; de Haan, R. Sensing Human Activity: GPS Tracking. Sensors
2009, 9, 3033–3055. [CrossRef] [PubMed]
6. Tang, L.A.; Yu, X.; Gu, Q.; Han, J.; Jiang, G.; Leung, A.; Porta, T.L. A framework of mining trajectories from
untrustworthy data in cyber-physical system. ACM Trans. Knowl. Discov. Data 2015, 9, 16. [CrossRef]
7. Rakthanmanon, T.; Campana, B.; Mueen, A.; Batista, G.; Westover, B.; Zhu, Q.; Zakaria, J.; Keogh, E.
Addressing big data time series: Mining trillions of time series subsequences under dynamic time warping.
ACM Trans. Knowl. Discov. Data 2013, 7, 10. [CrossRef]
8. Castro, P.S.; Zhang, D.; Chen, C.; Li, S.; Pan, G. From taxi GPS traces to social and community dynamics:
A survey. ACM Comput. Surv. (CSUR) 2013, 46, 17. [CrossRef]
9. Tang, L.; Kan, Z.; Zhang, X.; Sun, F.; Yang, X.; Li, Q. A network Kernel Density Estimation for linear features
in space—Time analysis of big trace data. Int. J. Geogr. Inf. Sci. 2016, 30, 1717–1737. [CrossRef]
10. Song, J.H.; Jee, G.I. Performance Enhancement of Land Vehicle Positioning Using Multiple GPS Receivers in
an Urban Area. Sensors 2016, 16, 1688. [CrossRef] [PubMed]
11. Ertan, G.; Oğuz, G.; Yüksel, B. Evaluation of Different Outlier Detection Methods for GPS Networks. Sensors
2008, 8, 7344–7358.
12. Yang, X.; Tang, L. Crowdsourcing big trace data filtering: A partition-and-filter model. In Proceedings of
the International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, Prague,
Czech Republic, 11–19 July 2016; pp. 257–262.
13. Fan, H.; Zipf, A.; Fu, Q.; Neis, P. Quality assessment for building footprints data on OpenStreetMap. Int. J.
Geogr. Inf. Sci. 2014, 28, 700–719. [CrossRef]
14. Berti-Equille, L.; Dasu, T.; Srivastava, D. Discovery of complex glitch patterns: A novel approach to
quantitative data cleaning. In Proceedings of the 2011 IEEE 27th International Conference on Data
Engineering, Washington, DC, USA, 11–16 April 2011; pp. 733–744.
15. Hellerstein, J.M. Quantitative Data Cleaning for Large Databases. White Paper, United Nations Economic
Commission for Europe (UNECE). 2008. Available online: http://db.cs.berkeley.edu/jmh/ (accessed on 4
March 2018).
16. Bohannon, P.; Fan, W.; Geerts, F.; Jia, X.; Kementsietsidis, A. Conditional functional dependencies for data
cleaning. In Proceedings of the 2007 IEEE 23th International Conference on Data Engineering, Istanbul,
Turkey, 11–15 April 2007; pp. 746–755.
Sensors 2018, 18, 824 16 of 16
17. Cong, G.; Fan, W.; Geerts, F.; Jia, X.; Ma, S. Improving data quality: Consistency and accuracy. In Proceedings
of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007;
pp. 315–326.
18. Chen, Y.; Krumm, J. Probabilistic modeling of traffic lanes from GPS traces. In Proceedings of the 18th
SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA,
2–5 November 2010; pp. 81–88.
19. Wang, J.; Rui, X.; Song, X.; Tan, X.; Wang, C.; Raghavan, V. A novel approach for generating routable road
maps from vehicle GPS traces. Int. J. Geogr. Inf. Sci. 2015, 29, 69–91. [CrossRef]
20. Tang, L.; Yang, X.; Kan, Z.; Li, Q. Lane-Level Road Information Mining from Vehicle GPS Trajectories Based
on Naïve Bayesian Classification. ISPRS Int. J. Geo-Inf. 2015, 4, 2660–2680. [CrossRef]
21. Tang, L.; Yang, X.; Dong, Z.; Li, Q. CLRIC: Collecting Lane-Based Road Information Via Crowdsourcing.
IEEE Trans. Intell. Transp. Syst. 2016, 17, 2552–2562. [CrossRef]
22. Mohamed, A.H.; Schwarz, K.P. Adaptive Kalman filtering for INS/GPS. J. Geodesy 1999, 73, 193–203.
[CrossRef]
23. Jiang, Z.; Shekhar, S. Spatial Big Data Science; Springer International Publishing: New York, NY, USA, 2017.
24. Lee, W.; Krumm, J. Trajectory preprocessing. In Computing with Spatial Trajectories; Zheng, Y., Ed.; Springer:
New York, NY, USA, 2011; pp. 3–33.
25. Gupta, M.; Ga1, J.; Aggarwal, C.; Han, J. Outlier detection for temporal data. Synth. Lect. Data Min.
Knowl. Discov. 2014, 5, 129. [CrossRef]
26. Parkinson, B.W.; Enge, P.; Axelrad, P.; Spilker, J.J. Global Positioning System: Theory and Applications I.
1996. Available online: https://ci.nii.ac.jp/naid/10012561387/ (accessed on 4 March 2018).
27. Rasetic, S.; Sander, J.; Elding, J.; Nascimento, M.A. A trajectory splitting model for efficient spatio-temporal
indexing. In Proceedings of the International Conference on Very Large Data Bases, Trondheim, Norway,
30 August–2 September 2005; pp. 934–945.
28. Gonzalez, P.A.; Weinstein, J.S.; Barbeau, S.J.; Labrador, M.A.; Winters, P.L.; Georggi, N.L.; Perez, R.
Automating mode detection for travel behaviour analysis by using global positioning systems-enabled
mobile phones and neural networks. IET Intell. Transp. Syst. 2010, 4, 37–49. [CrossRef]
29. Lee, J.; Han, J.; Li, X. Trajectory outlier detection: A partition-and-detect framework. In Proceedings of the
2008 IEEE 24th International Conference on Data Engineering Workshop, Cancun, Mexico, 7–12 April 2008;
pp. 140–149.
30. Zhang, L.; Wang, Z. Trajectory Partition Method with Time-Reference and Velocity. J. Converg. Inf. Technol.
2011, 6, 134–142.
31. Fisher, N.I. Statistical Analysis of Circular Data; Cambridge University Press: Cambridge, UK, 1993.
32. Nams, V.O. Using animal movement paths to measure response to spatial scale. Oecologia 2005, 143, 179–188.
[CrossRef] [PubMed]
33. Li, X. Using complexity measures of movement for automatically detecting movement types of unknown
GPS trajectories. Am. J. Geogr. Inf. Syst. 2014, 3, 63–74.
34. Derpanis, K.G. Overview of the RANSAC Algorithm. Image Rochester N. Y. 2010, 4, 2–3.
35. Li, H.; Shen, I.F. Similarity measure for vector field learning. In Proceedings of the Advances in Neural
Networks—ISNN 2006, Chengdu, China, 28 May–1 June 2006; pp. 436–441.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Data Cleaning Method For Big Trace Data Using Mouiuiu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Data Cleaning Method For Big Trace Data Using Mouiuiu

Uploaded by

Copyright:

Available Formats

sensors

Received: 31 January 2018; Accepted: 6 March 2018; Published: 9 March 2018

Sensors 2018, 18, 824; doi:10.3390/s18030824 www.mdpi.com/journal/sensors

3.1. Spatial Big Data: Vehicle GPS Data

3.2. Discussion: Movement

4. GPS Data Cleaning Method Based on Movement Consistency

4.2.1. The Principle of Trajectory Segmentation

4.2.2. Segmentation Threshold Determination

4.2.2. Segmentation Threshold Determination

4.3. GPS Data Cleaning Based on Movement Consistency

Figure 5. Construction of the consistency model by using RANSAC.

where |pttpptt′|0 |isisthe

5.1. Experimental Dataset

Sensors 2018, 18, x

Figure 7. Discussion of base ‘a’ in Equation (2).

Sensors 2018, 18, x 11 of 16

Sensors 2018, 18, x (a) (b) 12 of 16

5.3. Quantitative Evaluation and Discussion

Table 1. Evaluation of the cleaned data from different datasets.

Trajectory Estimation Proportion of The Accuracy of GPS The Accuracy of GPS

To evaluate the performance of the proposed method, we conducted a qualitative comparison

Table 2. Comparisons of the previous methods for GPS data cleaning.

Vehicle Trajectories Collected by Trimble R9

You might also like