Professional Documents
Culture Documents
0160-5682/16
www.palgrave.com/journals
Keywords: big data; predictive analytics; case-based reasoning; data stream; delay detection; real-time analytics
2. Methodology
This section reviews the CBR and presents the proposed
methods for the early detection of vessel delays. The term
‘early’ as used here has the following two meanings: (1) prior
to vessel departure and (2) real time. In Section 2.1, we review
the traditional CBR and propose a new CBR framework to
incorporate real-time data streams. The proposed methods for
detecting a vessel delay prior to vessel departure and in real
time are proposed in Sections 2.2 and 2.3, respectively.
2.1. A new CBR that incorporates real-time information Figure 2 New CBR cycle of 5Rs: the traditional 4Rs and Refine.
CBR is a reasoning method that solves new problems based on similar cases. Hence, CBR has been recognized as an approach
the solutions of similar past cases. According to Aamodt and that is not appropriate for the management of data streams, as a
Plaza (1994), CBR generally operates on a 4R cycle with the data stream, which is an ordered sequence of instances, can
steps of Retrieve, Reuse, Revise, and Retain, as depicted in only be read once or a small number of times with limited
Figure 1. Each step is summarized below. computing or storage capabilities.
• Retrieve: Obtain a new case description, measure the The proposed method enhances the traditional CBR in order
similarity of the new case to previous cases stored in a case to employ real-time data streams in addition to historical data
base (or memory) with their known solutions, and retrieve through the addition of a new phase called Refine, as depicted
one or multiple similar cases from the case base. in Figure 2. The new CBR cycle then consists of 5Rs: the
traditional 4Rs and Refine. The Refine phase conducts real-
• Reuse: Attempt to reuse the known solutions of the
time analytics by utilizing similar previous cases along with
retrieved cases.
real-time vessel tracking data. Based on the enhanced CBR,
• Revise: Revise the reused solutions by applying them to this paper proposes a data-driven method for the early
the initial problem or by assessing them with the assistance detection of vessel delays in which both historical shipping
of a domain expert. data and real-time S-AIS vessel tracking data are combined.
• Retain: Retain the new case description and its solution as
a new case in the case base.
2.2. Detection of vessel delays prior to vessel departure
Using the above 4R cycle, the system learns how to solve new
cases that may occur in the future. For this reason, CBR is In the CBR framework, historical shipping data are regarded
considered to be a subfield of machine learning (Aamodt, as the case base. An important step in the CBR cycle is the
1993). retrieval of ‘‘good’’ cases that can be used to solve the new
As depicted in Figure 1, the data source for CBR is a case case. In this study, a ‘‘good case’’ refers to a case that is similar
base, which contains previous cases. In principal, CBR is to the new case to the greatest extent possible. A question then
designed to use only stored historical data; thus, it is inherently arises regarding how similarity is defined. Many different
required to search all data in the case base in order to retrieve approaches to assessing similarity have been discussed in the
Sungil Kim et al—Early detection of vessel delays using combined
research communities associated with CBR and machine feedback loop, which is depicted in Figure 3. With the
learning (Lopez De Mantaras et al, 2005). This paper defines feedback loop, the delay of a vessel is predicted as follows.
similar cases as previous cases whose feature values are The index set of retrieved similar cases for a particular new
identical to those of the new case with regard to selected case c is denoted as follows:
features, considering that most features are categorical vari-
ables or date-time variables in the applications in this paper. SCc ¼ fi : ðxi ; yi Þ; i ¼ 1; . . .; ‘;
ð2Þ
Here, the features are known prior to the departure of a vessel; are retrieved similar cases of a new case cg;
they are not related to the real-time tracking of the vessel.
where xi ¼ ðxi1 ; . . .; xip Þ is a vector of the values of the p
Some examples of features include the vessel name, carrier,
consignee, lane code, and arrival port. Usually, a case involves selected features, where xij is the value of the jth selected
hundreds of features (variables); however, not all are impor- feature of the ith similar case, and yi 2 f0; 1g is the solution
tant. In order to select the important features, this paper uses for the ith similar case: yi ¼ 0 if a vessel arrives on time, and
the measure of ‘‘variable importance’’ for the classification yi ¼ 1 if a vessel arrives late. Note that yi for i 2 SCc is
and regression tree (CART) algorithm (Breiman et al, 1984) known, as similar cases are retrieved from the case base
implemented in the rpart function in R 3.1.0 (Therneau and (historical shipping data).
Atkinson, 2014). The measure quantifies the contribution of Then, the solution for the new case c, denoted by yc , is
each variable to the splits in a given tree to minimize the obtained as follows:
degree of node impurity by awarding a score between 0% and 8
X
100%. Larger values of the variable importance correspond to < 1; if 1
>
wi Iðyi ¼ 1Þ [ a;
yc ¼ ‘ i2SCc ð3Þ
more important variables. The variables whose scores are >
:
greater than 1% have been selected as important in this paper. 0; otherwise.
Assume that p features, i.e., X1 ; . . .; Xp , are selected and
where ‘ is the size of SCc , I is an indicator function, a is a
sorted in a descending order of variable importance. The threshold, and wi is a weight parameter that is determined from
values of the p selected features of a new case can be obtained the feedback loop as follows:
as follows:
q; if the ith case is incorrect,
X1 ¼ x1 ; X2 ¼ x2 ; . . .; Xp ¼ xp : ð1Þ wi ¼ ð4Þ
1; otherwise,
As explained above, cases similar to the new case can be
where q (\1) represents the penalty for the ith case. Note that
retrieved from a given case base by searching for previous
wi is known in advance from the case base.
cases that satisfy the condition in Equation (1). The number of
In general, a ¼ 0:5 is reasonable, but it can be set higher for
retrieved similar cases may vary from zero to many for each
more conservative decisions. Note that a single vessel may be
new case. In order to avoid a situation in which no similar case
in charge of shipping multiple similar cases. If this is the case,
is retrieved, the condition in Equation (1) can be loosened by
similar cases belonging to the same vessel should be regarded
removing the least important feature. If no similar case is
as a single case in the calculation of Equation (3), because the
retrieved after this adjustment, the next least important feature
delay of a new case should be determined using the delay
can be removed until at least one similar case is retrieved. For
status of the vessel to which the new case belongs.
example, if p ¼ 3, the condition for similar cases can be made
less restrictive in the following order: fX1 ¼ x1 ; X2 ¼ x2 ; X3 ¼
x3 g ! fX1 ¼ x1 ; X2 ¼ x2 g ! fX1 ¼ x1 ; X3 ¼ x3 g ! fX2 ¼
x2 ; X3 ¼ x3 g ! fX1 ¼ x1 g ! fX2 ¼ x2 g ! fX3 ¼ x3 g. The
extracted similar cases will be used for the detection of vessel
delays in real time. This is described in Section 2.3.
Although this paper focuses more on the real-time detection
of vessel delays as the vessel is traveling, predicting if a vessel
arrives on time prior to its departure may also be of interest.
For this purpose, a CBR-based method with a feedback loop is
also proposed. Suppose that the prediction of a certain case
was incorrect. Such a case should then be revised before it is
retained in the case base. Otherwise, the case may be used and
therefore provide an incorrect prediction in the future. In order
to reflect the performance of the prediction, a feedback loop
that considers the outputs from the logistic system and adjusts
the parameters of the method is also proposed. Among the 5Rs
in Figure 2, the Revise and Retain phases are related to the Figure 3 A feedback loop for the Revise and Retain phases.
Journal of the Operational Research Society
(a) (b)
240
240
220
220
200
200
1 day 1 day
Longitude
Longitude
180
180
160
160
140
140
120
120
Time(s) Time(s)
Figure 5 Illustration of the tracking patterns of delayed cases. a Delay at the transhipment port, b delay near the arrival port.
Journal of the Operational Research Society
binary classification studies. The F-measure is defined as the better with fewer but more recent data. This is possibly due to
harmonic mean of Precision and Recall, as follows: the improvement in the data quality in 2013 compared to that
in 2012. This indicates that the proposed method can be
Precision Recall
F-measure ¼ 2 ; enhanced by using more up-to-date data.
Precision þ Recall Next, more intensive experiments are conducted under a
and a larger value of F-measure is preferred. Figure 8 show the setting that can mimic the real situation in which the delay
values of the F-measure with various a values, together with status of a new vessel is predicted using similar past cases in a
the values for Recall and Precision. Based on the results in case base. Afterwards, the vessel becomes an old case and is
Figure 8, the value of a can be determined as a ¼ 0:5, which newly retained in the case base. It is assumed that the period
corresponds to the largest F-measure value. from September 1, 2013 to November 11, 2013 is not yet
In order to evaluate the relative performance of the proposed observed at the beginning of the experiments. That is, only the
method, several competing methods are compared in terms of set of historical shipping data whose arrival dates at arrival
accuracy using the six training-test sets. A decision tree (DT) ports are prior to September 1, 2013 is regarded as the training
(Murthy, 1998), K-nearest neighbors (KNN) of K ¼ data at the beginning, and a total of 60,310 cases whose
f5; 10; 20g with the Euclidean distance for measuring the departure dates vary from September 1, 2013 to November 11,
similarity between two cases (Boriah et al, 2008), and linear 2013 are predicted. The case with the earliest departure date is
regression (LR) (Neter et al, 1996) of the lead time are predicted first, after which the other cases are predicted
considered. The results are summarized in Table 1, demon- sequentially according to their departure dates. Once a case is
strating that the proposed method outperforms other methods. predicted, the accuracy of the prediction is evaluated and the
Across all training-test sets, the proposed method consistently case is then newly retained in the case base. In this way, the
produces higher accuracy levels than the other methods. For training data are accumulated gradually as new cases are
the proposed method, Set 1 provides higher accuracy than Set retained after their arrivals, and this process is continued until
2, which indicates that the proposed model has been trained the last case (i.e., with the departure date of November 11,
2013) is predicted.
In the evaluation of the performance of the proposed method
1.0
Accuracy phase but without the revising phase, and (3) predictions with
Type I neither the retaining nor the revising phase. In the revising
Type II
Recall
phase, the proposed revising scheme in Equation (4) is used
0.4
1.0
The value of measurement
0.9
0.8
0.7
0.6
Recall
Precision
F−measure
0.5
α
Figure 8 F-measure values with various values of a (using Set 1a). Figure 9 Comparison of the cumulative accuracy level among
the three scenarios for the Revise and Refine phases.
the beginning, because very few cases had been newly retained
due to their long lead times. From approximately September
constructed with similar historical S-AIS data for on-time
20, 2013, the second scenario performed worse than the other
arrival vessels using Equation (6). Whenever the tracking data
two. This occured because the predicted cases began to move
are updated, the delay status is computed immediately using
into the case base, but some incorrectly predicted cases
Equation (7). The range of t in d(t) is from the time when the
distorted the future prediction. Compared to the third scenario,
vessel departed from the departure port to the time when it
this demonstrated that incorrect information is worse than no
arrived at the arrival port, i.e., 0 t T, where T is the lead
information. However, from approximately October 15, 2013,
time of the corresponding baseline.
the performance of the third scenario degraded dramatically.
As described in Section 2.3, as soon as the event of dðtÞ [ 0
This resulted from no new case being retained in the case base,
is detected, the detection system determines the risk of a delay
and it led to a lack of capability to predict the delay of BoLs
by visualizing the magnitude of d(t) with colors between green
correctly. In contrast, the first scenario with both the revising
and red. In this case, the delay was detected as soon as the
and retaining phases performed best over the entire time
vessel was behind the baseline, as depicted in Figure 11. From
period, although the difference from the second scenario was
the colors, it could clearly be seen that the vessel had reduced
not significant.
its speed near the arrival port, Long Beach port, for some
reason.
Using the Refine phase proposed in Section 2.3, a refined
3.2. Real-time detection of vessel delays
similar case is identified from the set of similar cases SCc to
The proposed method for the real-time detection of vessel predict the time-distance pattern of the target case c in the
delays is illustrated with real data. For this illustration, a single future and to understand the causes of the delay. Recall that
BoL was selected from among 100,000 BoLs in 2014. The SCc is selected in advance using the historical shipping data. In
selected BoL was to transport a container from Weihai port, Figure 12, the patterns of similar cases in SCc are represented
China, to Long Beach port, USA. Figure 10 depicts the vessel by green dotted lines, along with the baseline as the black solid
tracking data from the port-to-port route on the map. Through line and the target case c as the red dashed line. As expected,
the satellite-based automatic identification system (S-AIS), the the D(t) patterns of similar cases in SCc are similar to that of
real-time vessel tracking data were collected from March 15, the target case c because they have identical values for the
2014 to April, 4, 2014. On average, the S-AIS sent the vessel important features. From the set of SCc , the refined similar
tracking data every 30 minutes (varying from approximately case c is identified using Equation (8) for each t when
every 10 minutes to a couple of hours). For the real-time dðtÞ [ 0. In Figure 13, the red circles represent the real-time
detection of vessel delays, the above real-time tracking data tracking data and the black line is its baseline. The coordinates
were used in combination with the historical data of Set 1 as of the red circles on the y-axis represent the actual location of
described in Section 3.1. the vessel at travel time t. We detected dðtÞ [ 0 at t ¼
Before the real-time tracking data were used in the system, 1; 605; 606 (in seconds) for the first time; after that time,
the baseline DB ðtÞ for the route of interest should be d(t) remained positive according to the subsequently updated
Sungil Kim et al—Early detection of vessel delays using combined
Figure 10 Vessel tracking data from Weihai port, China, to Long Beach port, U.S.A.
250
240
Longitude
230
220
210
1500000 1550000 1600000 1650000 1700000 1750000
Figure 11 Determining the risk of a delay using colors near Time(s)
Long Beach port.
Figure 13 Prediction of the D(t) pattern using a refined similar
case: the observed tracking data, its baseline, the pattern of a
refined similar case, and the predicted pattern of the target case
are indicated by red circle, black solid line, blue solid line, and
240
Equation (8), that is, the best-matching pattern Dc ðtÞ at the
time of td ¼ 1; 669; 916. Although c arrived at the arrival port
200
on time, it had been selected because its slowing trend near the
Longitude
Baseline Dc ðtÞ using Equation (9), as indicated by the blue dotted line in
Target case
Figure 13. Moreover, the highly probable cause of a delay was
140
Similar cases
congestion at the arrival port. Because the cause of the delay
prior to the vessel arrival is now known, the delay can be
120
scheme for the Revise and Retain phases. Our numerical study Fancello G, Pani C, Pisano M, Serra P, Zuddas P and Fadda P (2011).
confirmed that an elaborated feedback scheme is required for Prediction of arrival times and human resources allocation for
container terminal. Maritime Economics & Logistics 13(2):142–173.
efficient information flows in logistics systems. In order to
Fawcett T (2006). An introduction to ROC analysis. Pattern
detect the delay of a vessel in real time, we proposed a real-time Recognition Letters 27(8):861–874.
analytics-based approach that combines historical data and Feixiang Z (2011). Mining ship spatial trajectory patterns from AIS
real-time data. Through the proposed new phase of CBR called database for maritime surveillance. In: 2nd IEEE International
Refine, the delay can be detected in real time and the movement Conference on Emergency Management and Management Sciences
(ICEMMS), 2011, IEEE, pp. 772–775.
patterns of a vessel can be predicted until its arrival. Moreover,
Ge Y, Kong PY, Tham CK and Pathmasuntharam JS (2007)
highly likely causes of a delay can be identified; thus, the delay Connectivity and route analysis for a maritime communication
at the arrival port can be appropriately managed. The primary network. In: 6th International Conference on Information, Com-
contribution of this paper is that it proposes a data-driven munications & Signal Processing, 2007, IEEE, pp. 1–5.
method that detects vessel delays in multiple stages (prior to Holsten S (2009). Global maritime surveillance with satellite-based
AIS. In: OCEANS 2009-EUROPE, IEEE, pp. 1–4.
vessel departure and on the open seas) using both historical
Høye GK, Eriksen T, Meland BJ and Narheim BT (2008). Space-
shipping data and real-time tracking data in the framework of based AIS for global maritime traffic monitoring. Acta Astronau-
refined case-based reasoning. In this paper, we only utilize tica 62(2):240–245.
historical shipping data and real-time S-AIS vessel tracking Murthy SK (1998). Automatic construction of decision trees from
data. Including other various types of data such as weather and data: A multi-disciplinary survey. Data Mining and Knowledge
Discovery 2(4):345–389.
social data will be investigated in future research.
Neter J, Kutner MH, Nachtsheim CJ and Wasserman W (1996)
Applied Linear Statistical Models, vol. 4. Irwin: Chicago.
Pallotta G, Vespe M and Bryan K (2013a). Traffic route extraction
Acknowledgments—This research was partially supported by Basic
Science Research Program through the National Research Foundation of and anomaly detection from AIS data. In: International COST
Korea (NRF) funded by the Ministry of Science, ICT & Future Planning MOVE Workshop on Moving Objects at Sea, Brest, France.
(2015R1C1A1A02037090). Also, this work was supported by the 2016 Pallotta G, Vespe M and Bryan K (2013b). Vessel pattern knowledge
Research Fund (1.160093.01) of UNIST(Ulsan National Institute of discovery from ais data: A framework for anomaly detection and
Science & Technology). route prediction. Entropy 15(6):2218–2245.
Pani C, Fadda P, Fancello G, Frigau L and Mola F (2014). A data
mining approach to forecast late arrivals in a transhipment
References container terminal. Transport 29(2):175–184.
Pani C, Vanelslander T, Fancello G and Cannas M (2015). Prediction
Aamodt A (1993). Explanation-driven retrieval, reuse, and learning of of late/early arrivals in container terminals–A qualitative approach.
cases. In: EWCBR-93: First European Workshop on Case-Based European Journal of Transport and Infrastructure Research
Reasoning, pp. 279–284. 15(4):536–550.
Aamodt A and Plaza E (1994). Case-based reasoning: Foundational Ristic B, La Scala B, Morelande M and Gordon N (2008). Statistical
issues, methodological variations, and system approaches. AI analysis of motion patterns in AIS data: Anomaly detection and
communications 7(1):39–59. motion prediction. In: 11th International Conference on Informa-
Boriah S, Chandola V and Kumar V (2008). Similarity measures for tion Fusion, 2008, IEEE, pp. 1–7.
categorical data: A comparative evaluation. In: Proceedings of the Riveiro M and Falkman G (2011). The role of visualization and
2008 SIAM International Conference on Data Mining, SIAM, interaction in maritime anomaly detection. In: IS&T/SPIE Elec-
pp. 243–254. tronic Imaging, International Society for Optics and Photonics, vol.
Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classifi- 7868.
cation and Regression Trees. CRC press: Boca Raton Therneau TM and Atkinson EJ (2014). An introduction to recursive
De Mantaras, RL, McSherry D, Bridge D, Leake D, Smyth B, Craw S, partitioning using the RPART routines. Rochester: Mayo
Faltings B, Maher ML, Cox MT, Forbus K et al. (2005). Retrieval, Foundation.
reuse, revision and retention in case-based reasoning. The Knowl- Vespe M, Visentini I, Bryan K and Braca P (2012). Unsupervised
edge Engineering Review 20(3):215–240. learning of maritime traffic patterns for anomaly detection. In:
Eriksen T, Høye G, Narheim B and Meland BJ (2006). Maritime Data Fusion & Target Tracking Conference (DF&TT 2012):
traffic monitoring using a space-based AIS receiver. Acta Astro- Algorithms & Applications, 9th IET, IET, pp. 1–5.
nautica 58(10):537–549.
Eriksen T, Skauen AN, Narheim B, Helleren O, Olsen Ø and Olsen Received 18 March 2015;
RB (2010). Tracking ship traffic with space-based AIS: Experience accepted 4 August 2016
gained in first months of operations. In: Waterside Security
Conference (WSS), 2010 International, IEEE, pp. 1–8.