You are on page 1of 10

Journal of the Operational Research Society (2016) ª 2016 The Operational Research Society. All rights reserved.

0160-5682/16

www.palgrave.com/journals

Early detection of vessel delays using combined


historical and real-time information
Sungil Kim1, Heeyoung Kim2* and Yongro Park3
1
School of Management Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan,
Republic of Korea; 2 Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and
Technology (KAIST), Daejeon, Republic of Korea; and 3 Data Analytics Lab, R&D Center, Samsung SDS, Seoul,
Republic of Korea
In ocean transportation, detecting vessel delays in advance or in real time is important for fourth-party logistics
(4PL) in order to fulfill the expectations of customers and to help customers reduce delay costs. However, the
early detection of vessel delays faces the challenges of numerous uncertainties, including weather conditions, port
congestion, booking issues, and route selection. Recently, 4PLs have adopted advanced tracking technologies
such as satellite-based automatic identification systems (S-AISs) that produce a vast amount of real-time vessel
tracking information, thus providing new opportunities to enhance the early detection of vessel delays. This paper
proposes a data-driven method for the early detection of vessel delays: in our new framework of refined case-
based reasoning (CBR), real-time S-AIS vessel tracking data are utilized in combination with historical shipping
data. The proposed method also provides a process of analyzing the causes of delays by matching the tracking
patterns of real-time shipments with those of historical shipping data. Real data examples from a logistics
company demonstrate the effectiveness of the proposed method.
Journal of the Operational Research Society (2016). doi:10.1057/s41274-016-0104-4

Keywords: big data; predictive analytics; case-based reasoning; data stream; delay detection; real-time analytics

1. Introduction (BoLs) through approximately 1200 routes. A bill of lading is


a document issued by a carrier that details a shipment of
In ocean transportation, avoiding vessel delays is a critical
merchandise and provides a title for that shipment to a
issue for fourth-party logistics (4PL) to allow them to fulfill
specified party. In this paper, BoL documents are regarded as
the expectations of customers and to help their customers
shipping data. From these vessels, the logistics system should
reduce the costs incurred through vessel delays. However,
manage approximately 360,000 S-AIS tracking points per day.
shipping and delivering large, long-lead-time orders on time is
The system manages more than 100,000 BoLs per year, and
a challenge in the port and waterborne transport sector, as
the total number of routes related to those BoLs exceeds 4000.
ocean transportation functions under numerous uncertainties,
These numbers increase over time.
such as weather conditions, port congestion, booking issues,
However, the collected tracking data of such a large size
route selection, and others. While not all delays can be
have rarely been used for the early detection of delays,
prevented, they can be minimized through appropriate man-
whereas there have been many studies of surveillance for
agement strategies, such as delay forecasting and real-time
safety of vessels and protection of the environment (Holsten,
monitoring. In order to manage delays and enhance the
2009), traffic monitoring for improved port planning and
shipping visibility of logistics systems, 4PLs have recently
operation (Eriksen et al, 2006; Eriksen et al, 2010), anomaly
adopted advanced technologies such as satellite-based auto-
detections in traffic patterns for risk management activities
matic identification systems (S-AISs), which produce a vast
(Vespe et al, 2012; Pallotta et al, 2013a), and route pattern
amount of real-time vessel tracking information (Høye et al,
recognition for efficient operations management (Ge et al,
2008; Pallotta et al, 2013b).
2007; Ristic et al, 2008; Feixiang, 2011). Furthermore, in the
According to the logistics system of the company that is
literature, there has been very little research on vessel delay
considered in this paper, at any given time, 1400*1500
predictions. There have been a few attempts to solve these
vessels are in service, handling around 12,000 bills of lading
problems by exploring and understanding historical arrivals
*Correspondence: Heeyoung Kim, Department of Industrial and Systems
using machine learning algorithms. Fancello et al (2011) and
Engineering, Korea Advanced Institute of Science and Technology Pani et al (2014) used a neural network algorithm and a
(KAIST), Daejeon, Republic of Korea. regression tree algorithm, respectively, to forecast late arrivals
E-mail: heeyoungkim@kaist.ac.kr
Journal of the Operational Research Society

in a Mediterranean port. Pani et al (2015) conducted a


comparative study of three different algorithms of logistics
regression, classification tree, and random forest using vessel
arrival data on delays/advances at the individual vessel level.
In this paper, we propose a new data-driven method for the
early detection of vessel delays using real-time S-AIS vessel
tracking data. The proposed method utilizes real-time vessel
tracking data in combination with historical shipping data to
detect vessel delays in advance or in real time, for which a
modified framework of case-based reasoning (CBR) is
Figure 1 4R (Retrieve, Reuse, Revise, and Retain) of the
proposed. The proposed method differs from earlier methods traditional CBR cycle.
of vessel arrival prediction in that it uses real-time vessel
tracking data in addition to historical shipping data, whereas
previous methods only use historical shipping data.

2. Methodology
This section reviews the CBR and presents the proposed
methods for the early detection of vessel delays. The term
‘early’ as used here has the following two meanings: (1) prior
to vessel departure and (2) real time. In Section 2.1, we review
the traditional CBR and propose a new CBR framework to
incorporate real-time data streams. The proposed methods for
detecting a vessel delay prior to vessel departure and in real
time are proposed in Sections 2.2 and 2.3, respectively.

2.1. A new CBR that incorporates real-time information Figure 2 New CBR cycle of 5Rs: the traditional 4Rs and Refine.

CBR is a reasoning method that solves new problems based on similar cases. Hence, CBR has been recognized as an approach
the solutions of similar past cases. According to Aamodt and that is not appropriate for the management of data streams, as a
Plaza (1994), CBR generally operates on a 4R cycle with the data stream, which is an ordered sequence of instances, can
steps of Retrieve, Reuse, Revise, and Retain, as depicted in only be read once or a small number of times with limited
Figure 1. Each step is summarized below. computing or storage capabilities.
• Retrieve: Obtain a new case description, measure the The proposed method enhances the traditional CBR in order
similarity of the new case to previous cases stored in a case to employ real-time data streams in addition to historical data
base (or memory) with their known solutions, and retrieve through the addition of a new phase called Refine, as depicted
one or multiple similar cases from the case base. in Figure 2. The new CBR cycle then consists of 5Rs: the
traditional 4Rs and Refine. The Refine phase conducts real-
• Reuse: Attempt to reuse the known solutions of the
time analytics by utilizing similar previous cases along with
retrieved cases.
real-time vessel tracking data. Based on the enhanced CBR,
• Revise: Revise the reused solutions by applying them to this paper proposes a data-driven method for the early
the initial problem or by assessing them with the assistance detection of vessel delays in which both historical shipping
of a domain expert. data and real-time S-AIS vessel tracking data are combined.
• Retain: Retain the new case description and its solution as
a new case in the case base.
2.2. Detection of vessel delays prior to vessel departure
Using the above 4R cycle, the system learns how to solve new
cases that may occur in the future. For this reason, CBR is In the CBR framework, historical shipping data are regarded
considered to be a subfield of machine learning (Aamodt, as the case base. An important step in the CBR cycle is the
1993). retrieval of ‘‘good’’ cases that can be used to solve the new
As depicted in Figure 1, the data source for CBR is a case case. In this study, a ‘‘good case’’ refers to a case that is similar
base, which contains previous cases. In principal, CBR is to the new case to the greatest extent possible. A question then
designed to use only stored historical data; thus, it is inherently arises regarding how similarity is defined. Many different
required to search all data in the case base in order to retrieve approaches to assessing similarity have been discussed in the
Sungil Kim et al—Early detection of vessel delays using combined

research communities associated with CBR and machine feedback loop, which is depicted in Figure 3. With the
learning (Lopez De Mantaras et al, 2005). This paper defines feedback loop, the delay of a vessel is predicted as follows.
similar cases as previous cases whose feature values are The index set of retrieved similar cases for a particular new
identical to those of the new case with regard to selected case c is denoted as follows:
features, considering that most features are categorical vari-
ables or date-time variables in the applications in this paper. SCc ¼ fi : ðxi ; yi Þ; i ¼ 1; . . .; ‘;
ð2Þ
Here, the features are known prior to the departure of a vessel; are retrieved similar cases of a new case cg;
they are not related to the real-time tracking of the vessel.
where xi ¼ ðxi1 ; . . .; xip Þ is a vector of the values of the p
Some examples of features include the vessel name, carrier,
consignee, lane code, and arrival port. Usually, a case involves selected features, where xij is the value of the jth selected
hundreds of features (variables); however, not all are impor- feature of the ith similar case, and yi 2 f0; 1g is the solution
tant. In order to select the important features, this paper uses for the ith similar case: yi ¼ 0 if a vessel arrives on time, and
the measure of ‘‘variable importance’’ for the classification yi ¼ 1 if a vessel arrives late. Note that yi for i 2 SCc is
and regression tree (CART) algorithm (Breiman et al, 1984) known, as similar cases are retrieved from the case base
implemented in the rpart function in R 3.1.0 (Therneau and (historical shipping data).
Atkinson, 2014). The measure quantifies the contribution of Then, the solution for the new case c, denoted by yc , is
each variable to the splits in a given tree to minimize the obtained as follows:
degree of node impurity by awarding a score between 0% and 8
X
100%. Larger values of the variable importance correspond to < 1; if 1
>
wi  Iðyi ¼ 1Þ [ a;
yc ¼ ‘ i2SCc ð3Þ
more important variables. The variables whose scores are >
:
greater than 1% have been selected as important in this paper. 0; otherwise.
Assume that p features, i.e., X1 ; . . .; Xp , are selected and
where ‘ is the size of SCc , I is an indicator function, a is a
sorted in a descending order of variable importance. The threshold, and wi is a weight parameter that is determined from
values of the p selected features of a new case can be obtained the feedback loop as follows:
as follows:

q; if the ith case is incorrect,
X1 ¼ x1 ; X2 ¼ x2 ; . . .; Xp ¼ xp : ð1Þ wi ¼ ð4Þ
1; otherwise,
As explained above, cases similar to the new case can be
where q (\1) represents the penalty for the ith case. Note that
retrieved from a given case base by searching for previous
wi is known in advance from the case base.
cases that satisfy the condition in Equation (1). The number of
In general, a ¼ 0:5 is reasonable, but it can be set higher for
retrieved similar cases may vary from zero to many for each
more conservative decisions. Note that a single vessel may be
new case. In order to avoid a situation in which no similar case
in charge of shipping multiple similar cases. If this is the case,
is retrieved, the condition in Equation (1) can be loosened by
similar cases belonging to the same vessel should be regarded
removing the least important feature. If no similar case is
as a single case in the calculation of Equation (3), because the
retrieved after this adjustment, the next least important feature
delay of a new case should be determined using the delay
can be removed until at least one similar case is retrieved. For
status of the vessel to which the new case belongs.
example, if p ¼ 3, the condition for similar cases can be made
less restrictive in the following order: fX1 ¼ x1 ; X2 ¼ x2 ; X3 ¼
x3 g ! fX1 ¼ x1 ; X2 ¼ x2 g ! fX1 ¼ x1 ; X3 ¼ x3 g ! fX2 ¼
x2 ; X3 ¼ x3 g ! fX1 ¼ x1 g ! fX2 ¼ x2 g ! fX3 ¼ x3 g. The
extracted similar cases will be used for the detection of vessel
delays in real time. This is described in Section 2.3.
Although this paper focuses more on the real-time detection
of vessel delays as the vessel is traveling, predicting if a vessel
arrives on time prior to its departure may also be of interest.
For this purpose, a CBR-based method with a feedback loop is
also proposed. Suppose that the prediction of a certain case
was incorrect. Such a case should then be revised before it is
retained in the case base. Otherwise, the case may be used and
therefore provide an incorrect prediction in the future. In order
to reflect the performance of the prediction, a feedback loop
that considers the outputs from the logistic system and adjusts
the parameters of the method is also proposed. Among the 5Rs
in Figure 2, the Revise and Retain phases are related to the Figure 3 A feedback loop for the Revise and Retain phases.
Journal of the Operational Research Society

2.3. Real-time detection of vessel delays


Combining the information from past cases with other informa-
tion is an important issue in CBR (Aamodt and Plaza, 1994). In
particular, real-time data streams are difficult to combine,
because the computing and storage resources are too limited to
manage real-time data streams in practice. Due to this difficulty,
real-time analytics procedures should be simple and fast.
This paper proposes an enhanced CBR with the addition of a
new phase, Refine, to the traditional CBR. That is, the new
CBR cycle consists of 5Rs: the traditional 4Rs and Refine. The
Refine phase conducts real-time analytics by utilizing both
similar previous cases and real-time vessel tracking data.
Based on the enhanced CBR, this paper proposes a data-driven
method for the early detection of vessel delays combining both
Figure 4 Flowchart of the Refine phase for real-time analytics.
historical shipping data and real-time S-AIS vessel tracking
data. The proposed method consists of four steps, as illustrated
in the middle column of Figure 4: (1) comparing the travel
distance, (2) determining the risk of a delay, (3) refining dðtÞ ¼ maxð0; DB ðtÞ  Dc ðtÞÞ: ð7Þ
similar cases, and (4) analyzing the causes of delays. Each step
A positive value of d(t) indicates that the new vessel is behind
is detailed in the following sections.
schedule, while a zero value of d(t) indicates that it travels
normally or is ahead of schedule.
2.3.1. Comparing the travel distance In the first step, the
real-time travel distance of a new vessel is compared with the
2.3.2. Determining the risk of a delay In the second step, the
average travel distance of on-time arrival vessels. Note that the
detection system visualizes d(t) in Equation (7) and determines
travel distance of a new vessel is updated in real time when the
the risk if the difference becomes severe. Here, ‘‘severe’’ refers
real-time tracking data arrive and the average travel distance
to a delay of more than one day. Depending on the magnitude
of the on-time arrival vessels, referred to as the baseline, is
of d(t), the risk of a delay can be presented in different colors:
calculated for each port-to-port route in advance using the
green for dðtÞ ¼ 0, colors between green and red for
historical tracking data. Therefore, the travel distance of case
0\dðtÞ\d1 , red for dðtÞ  d1 , and black for t [ TB , where
k from a departure port is defined using
d1 and TB denote a one-day delay and the total travel time,
Dk ðtÞ 2 R; ð5Þ respectively, according to the baseline DB ðtÞ in Equation (6).
Therefore, t [ TB indicates that the vessel of the new case c is
where t is the travel time (0  t  T) from the departure port en route despite the fact that it should have already arrived at
and T is the total travel time (or lead time) of the route to the arrival port according to its baseline. This coloring
which case k belongs. The total travel time of each route varies approach is helpful for practitioners to notice the degree of a
from within one day to 20*30 days. For the route to which delay as it occurs (Riveiro and Falkman, 2011).
case k belongs, the baseline DB ðtÞ of the route can be
constructed using historical tracking data as follows: 2.3.3. Refining similar cases In the third step, if the system
1X determines the risk, the case whose tracking pattern of time-
8t; DB ðtÞ ¼ Dk ðtÞ; for k 2 OC; ð6Þ distance is most similar to that of the new case c is
n k
extracted from SCc in Equation (2). Note that SCc , which is
where OC is the index set of the on-time arrival cases that the set of similar cases for c, is identified in advance from the
belong to the route of interest and n is the size of OC. Note that historical data, as described in Section 2.2; moreover, the
the baseline DB ðtÞ is a route-specific function of time and, tracking pattern of each similar case is obtained in advance
therefore, needs to be determined for each route. from the historical tracking data. The most similar case in
For a new vessel, the travel distance is updated in real time terms of the time-distance pattern is located among the set of
from continuously incoming tracking data. The time interval similar cases that were already identified as similar in terms
of incoming tracking data varies from a few minutes to hours. of the similarity of selected features. That is, similar cases
Usually, it is more frequent near ports. Consider the travel that are identified using the historical tracking data are
distance, Dc ðtÞ, of a target case c and its baseline DB ðtÞ in refined using additional real-time tracking data in the Refine
Equation (6). Then, the difference between Dc ðtÞ and DB ðtÞ phase.
presents the delay status in real time. For travel time t, the Formally, the refined similar case, which is denoted by c , is
delay status, dðtÞ 2 R, is defined as follows: found by solving the following optimization problem:
Sungil Kim et al—Early detection of vessel delays using combined

X have a problem in transshipping, but the vessel reduced its


c ¼ argmin k2SCc min ðDk ðt þ dÞ
td þr  d  TB td
td r  t  td speed near the arrival port. These examples demonstrate that
 Dc ðtÞÞ2 ; ð8Þ different circumstances for delays can be understood from
different tracking patterns. By analyzing the tracking patterns,
where SCc and Dk are defined in Equations (2) and (5), the causes of the delays can be categorized into several types
respectively, TB denotes the total travel time of the baseline for each route. If a similar case whose D(t) pattern behaved
DB ðtÞ from Equation (6), td is the time at which the delay of c similarly in the past can be located, it will be helpful to
is detected according to Equation (7), d is a time-shift understand the cause of the delay and to predict the behavior
parameter that is considered to align two tracking patterns, and of the tracking patterns in the future.
r is the duration of time for which the tracking pattern of c is
compared with that of the cases in SCc . For example, r can be 2.3.4. Analyzing the causes of a delay In the last step, the
chosen as the travel time for on-time arrival vessels (according cause of the delay of target case c is analyzed by looking at the
to the baseline DB ðtÞ) from a transshipment port to an arrival refined similar case c in Equation (8). By leveraging the
port. Then, Dc ðtÞ is the most similar pattern for Dc ðtÞ during information in c , including the cause of the delay, the impact
the time period td  r  t  td . Note that c is specific to td . of the delay can be reduced. In general, international freight is
The most similar case for c can change over time as more intermodal freight transport; that is, it involves the
updated tracking information is incorporated. transportation of freight in an intermodal container or
In principal, Dk ðtÞ is the time-distance pattern and the slope vehicle, using multiple modes of transportation (e.g., rail,
of Dk ðtÞ, i.e., dDdtk ðtÞ, indicates the speed of a vessel at the travel ship, truck). Therefore, ocean freight is part of an end-to-end
time of t. Figure 5 illustrates the tracking patterns of two intermodal freight transport process. By analyzing the cause of
delayed cases that have an identical baseline DB ðtÞ. In the delay, the impact of the delay on the subsequent mode of
Figure 5, the longitude is regarded as the unit of D(t) for transportation can be minimized. Moreover, predictive
simplicity, as vessels move along lines of longitude in both analytics can be conducted in real time. At time td when the
examples. DB ðtÞ and Dc ðtÞ are represented using black and red delay is detected, we can predict the future distance of c, i.e.,
lines, respectively, and the upper and lower blue dotted lines Dc ðtd þ hÞ for h [ 0, using the following equation:
represent the arrival and transshipment ports, respectively. It
Dc ðtd þ hÞ ¼ Dc ðtd þ h þ d Þ; ð9Þ
can be seen that the current shipment (red line) arrives later
than predicted by the baseline information (black line) because where
it intersects the top dashed line of the arrival port to the right of X
the black line. The shipment in Figure 5(a) arrived late because d ¼ argmin td þr  d  TB td ðDc ðt þ dÞ  Dc ðtÞÞ2 ;
the transshipment was later than predicted by the baseline td r  t  td

information. In contrast, the shipment in Figure 5(b) did not ð10Þ

(a) (b)
240
240
220

220
200

200

1 day 1 day
Longitude

Longitude
180

180
160

160
140

140
120

120

0 500000 1000000 1500000 0 500000 1000000 1500000

Time(s) Time(s)

Figure 5 Illustration of the tracking patterns of delayed cases. a Delay at the transhipment port, b delay near the arrival port.
Journal of the Operational Research Society

where r and TB are defined after Equation (8). Note that d in


Equation (10) is a time-shift parameter that minimizes the
distance between Dc ðtÞ and Dc ðt þ d Þ.

3. Real data examples


The proposed method is applied to real data collected from the
integrated logistics workplace platform of a logistics company
over a period of 28 months, from January of 2012 to April of
2014. In this company, 2012 was when the logistics company
launched the logistics workspace platform and implemented its
use. Since that time, the data collection scheme has improved
and become more systematic, and the data quality has also
improved in terms of data accuracy and completeness. The
data consist of 184,083 cases based on distinct BoLs
containing 85 variables. These BoLs are associated with
ocean transportation along more than 4000 local and interna-
Figure 6 Histogram of the number of similar cases (using Set 1a).
tional routes worldwide. With the real data, the proposed
method for predicting delays prior to vessel departure (as Based on the selected features, similar cases are extracted
proposed in Section 2.2) and for detecting delays in real time from all training sets. For most BoLs, less than 20 similar
(as proposed in Section 2.3) are validated in Sections 3.1 and cases were identified as illustrated in Figure 6 for Set 1a.
3.2, respectively. These similar cases are then used to predict the delay status of
a new case in the testing sets. As presented in Equation (3), if
the ratio of delayed cases to all similar cases is greater than a
3.1. Detection of vessel delays prior to vessel departure predetermined delay threshold (a), the corresponding new case
The real data described above are divided into two sets based is predicted to be delayed. For example, for Set 1a, the vessel
on the issue dates of the BoLs. These are Set A for training delays were detected in advance at an accuracy level of 0.721
from January of 2012 to February of 2014 and Set B for testing with a ¼ 0:5. Here, accuracy is defined as the ratio of correctly
from March of 2014 to April of 2014. Using Set A, two predicted cases to all cases that are predicted as delayed or not.
training sets are prepared in order to determine if the Figure 7 compares the performance of the proposed method
predictive power of the proposed method is improved using with various values of a. Including the accuracy level, five
more up-to-date information. One is the total amount of data of different performance measures were considered. Type I
Set A for 26 months, called Set 2, and the other is a subset of indicates the Type I error, which is the probability that a case
Set A in which older data issued in 2012 are removed; this is is predicted as delayed when it is in fact not delayed. Type II
referred to as Set 1. For testing, three sets, each with 4000 denotes the Type II error, which is the probability that a case is
cases, are randomly sampled from Set B; they are called Set a, predicted as not delayed when it is in fact delayed. The sum of
Set b, and Set c. Thus, a total of six combinations of training the accuracy, Type I error, and Type II error should be 1. The
and test sets are considered, referred to as Set 1a, Set 1b, Set next two measures are Recall and Precision. Recall denotes the
1c, Set 2a, Set 2b, and Set 2c. conditional probability that a case is correctly predicted as
For a single BoL, it is defined as delayed if its actual arrival delayed given all truly delayed cases, while Precision denotes
date is later than the expected arrival date. The expected the conditional probability that a case is correctly predicted as
arrival date is calculated by adding the lead time of the route to delayed given all cases predicted as delayed. The test results
the departure date. In order to detect a delay prior to the demonstrate that as a increases, the Type I error decreases,
departure of the vessel, the following ten important features whereas the Type II error increases. Moreover, as a increases,
were selected using Set 1 based on the measure of variable Recall decreases and Precision increases.
importance as described in Section 2.2: the vessel name at the Based on these observations, the right choice of a can be
departure port, the name of the main vessel traveling along a determined by practitioners considering the applicable prac-
longest lane, the bill of distribution code, the code of the main tical concerns. For example, if the first priority of the
lane (i.e., the longest lane), the consignee ID, the carrier ID, practitioners is to improve the accuracy, the value of a is
the third-party logistics ID, the final destination, the arrival determined as a ¼ 0:5 from Figure 7. Another practical way to
port, and the feeder vessel name. The same set of features was determine the value of a is to use the F-measure (Fawcett,
identified as important using Set 2. 2006), a commonly used measure of selection accuracy in
Sungil Kim et al—Early detection of vessel delays using combined

binary classification studies. The F-measure is defined as the better with fewer but more recent data. This is possibly due to
harmonic mean of Precision and Recall, as follows: the improvement in the data quality in 2013 compared to that
in 2012. This indicates that the proposed method can be
Precision  Recall
F-measure ¼ 2  ; enhanced by using more up-to-date data.
Precision þ Recall Next, more intensive experiments are conducted under a
and a larger value of F-measure is preferred. Figure 8 show the setting that can mimic the real situation in which the delay
values of the F-measure with various a values, together with status of a new vessel is predicted using similar past cases in a
the values for Recall and Precision. Based on the results in case base. Afterwards, the vessel becomes an old case and is
Figure 8, the value of a can be determined as a ¼ 0:5, which newly retained in the case base. It is assumed that the period
corresponds to the largest F-measure value. from September 1, 2013 to November 11, 2013 is not yet
In order to evaluate the relative performance of the proposed observed at the beginning of the experiments. That is, only the
method, several competing methods are compared in terms of set of historical shipping data whose arrival dates at arrival
accuracy using the six training-test sets. A decision tree (DT) ports are prior to September 1, 2013 is regarded as the training
(Murthy, 1998), K-nearest neighbors (KNN) of K ¼ data at the beginning, and a total of 60,310 cases whose
f5; 10; 20g with the Euclidean distance for measuring the departure dates vary from September 1, 2013 to November 11,
similarity between two cases (Boriah et al, 2008), and linear 2013 are predicted. The case with the earliest departure date is
regression (LR) (Neter et al, 1996) of the lead time are predicted first, after which the other cases are predicted
considered. The results are summarized in Table 1, demon- sequentially according to their departure dates. Once a case is
strating that the proposed method outperforms other methods. predicted, the accuracy of the prediction is evaluated and the
Across all training-test sets, the proposed method consistently case is then newly retained in the case base. In this way, the
produces higher accuracy levels than the other methods. For training data are accumulated gradually as new cases are
the proposed method, Set 1 provides higher accuracy than Set retained after their arrivals, and this process is continued until
2, which indicates that the proposed model has been trained the last case (i.e., with the departure date of November 11,
2013) is predicted.
In the evaluation of the performance of the proposed method
1.0

under this setting, the revising scheme in the feedback loop in


Figure 5 is validated and the effects of retaining additional
cases when predicting the delay status are also investigated.
The value of measurement
0.8

Three scenarios are considered: (1) predictions with both


revising and retaining phases, (2) predictions with the retaining
0.6

Accuracy phase but without the revising phase, and (3) predictions with
Type I neither the retaining nor the revising phase. In the revising
Type II
Recall
phase, the proposed revising scheme in Equation (4) is used
0.4

Precision with q ¼ 0:5. Figure 9 compares the cumulative accuracy level


along the arrival date of the vessels for the three scenarios.
Here, the cumulative accuracy refers to the average of
0.2

accuracy values calculated based on all prior predictions, not


just one prediction at the time point of the arrival date. The
0.0

black, red, and blue lines indicate the first (revise(o)/retain(o)),


second (revise(x)/retain(o)), and third (revise(x)/retain(x))
0.2 0.3 0.4 0.5 0.6 0.7 0.8
scenarios, respectively. These lines are erratic at the begin-
α ning, as there are insufficient values of accuracy accumulated
Figure 7 Performance evaluation in terms of five different at that time point due to the short time period. Furthermore, the
performance measures with various values of a (using Set 1a). three scenarios yielded very little difference in the accuracy at

Table 1 Comparison of the performances of competing methods in terms of accuracy (%)

Set 2a Set 2b Set 2c Set 1a Set 1b Set 1c

Proposed method 71.2 72.1 71.3 72.1 72.8 71.9


DT 61.6 61.1 62.1 61.0 63.5 59.0
5NN 59.4 56.3 55.8 55.9 56.7 59.5
10NN 57.4 53.8 56.7 57.4 55.4 58.3
20NN 56.8 58.4 57.4 59.7 56.0 60.7
LR 66.1 66.0 65.6 66.5 66.1 65.5
Journal of the Operational Research Society

1.0
The value of measurement
0.9
0.8
0.7
0.6

Recall
Precision
F−measure
0.5

0.2 0.3 0.4 0.5 0.6 0.7 0.8

α
Figure 8 F-measure values with various values of a (using Set 1a). Figure 9 Comparison of the cumulative accuracy level among
the three scenarios for the Revise and Refine phases.
the beginning, because very few cases had been newly retained
due to their long lead times. From approximately September
constructed with similar historical S-AIS data for on-time
20, 2013, the second scenario performed worse than the other
arrival vessels using Equation (6). Whenever the tracking data
two. This occured because the predicted cases began to move
are updated, the delay status is computed immediately using
into the case base, but some incorrectly predicted cases
Equation (7). The range of t in d(t) is from the time when the
distorted the future prediction. Compared to the third scenario,
vessel departed from the departure port to the time when it
this demonstrated that incorrect information is worse than no
arrived at the arrival port, i.e., 0  t  T, where T is the lead
information. However, from approximately October 15, 2013,
time of the corresponding baseline.
the performance of the third scenario degraded dramatically.
As described in Section 2.3, as soon as the event of dðtÞ [ 0
This resulted from no new case being retained in the case base,
is detected, the detection system determines the risk of a delay
and it led to a lack of capability to predict the delay of BoLs
by visualizing the magnitude of d(t) with colors between green
correctly. In contrast, the first scenario with both the revising
and red. In this case, the delay was detected as soon as the
and retaining phases performed best over the entire time
vessel was behind the baseline, as depicted in Figure 11. From
period, although the difference from the second scenario was
the colors, it could clearly be seen that the vessel had reduced
not significant.
its speed near the arrival port, Long Beach port, for some
reason.
Using the Refine phase proposed in Section 2.3, a refined
3.2. Real-time detection of vessel delays
similar case is identified from the set of similar cases SCc to
The proposed method for the real-time detection of vessel predict the time-distance pattern of the target case c in the
delays is illustrated with real data. For this illustration, a single future and to understand the causes of the delay. Recall that
BoL was selected from among 100,000 BoLs in 2014. The SCc is selected in advance using the historical shipping data. In
selected BoL was to transport a container from Weihai port, Figure 12, the patterns of similar cases in SCc are represented
China, to Long Beach port, USA. Figure 10 depicts the vessel by green dotted lines, along with the baseline as the black solid
tracking data from the port-to-port route on the map. Through line and the target case c as the red dashed line. As expected,
the satellite-based automatic identification system (S-AIS), the the D(t) patterns of similar cases in SCc are similar to that of
real-time vessel tracking data were collected from March 15, the target case c because they have identical values for the
2014 to April, 4, 2014. On average, the S-AIS sent the vessel important features. From the set of SCc , the refined similar
tracking data every 30 minutes (varying from approximately case c is identified using Equation (8) for each t when
every 10 minutes to a couple of hours). For the real-time dðtÞ [ 0. In Figure 13, the red circles represent the real-time
detection of vessel delays, the above real-time tracking data tracking data and the black line is its baseline. The coordinates
were used in combination with the historical data of Set 1 as of the red circles on the y-axis represent the actual location of
described in Section 3.1. the vessel at travel time t. We detected dðtÞ [ 0 at t ¼
Before the real-time tracking data were used in the system, 1; 605; 606 (in seconds) for the first time; after that time,
the baseline DB ðtÞ for the route of interest should be d(t) remained positive according to the subsequently updated
Sungil Kim et al—Early detection of vessel delays using combined

Figure 10 Vessel tracking data from Weihai port, China, to Long Beach port, U.S.A.

250
240
Longitude
230
220
210
1500000 1550000 1600000 1650000 1700000 1750000
Figure 11 Determining the risk of a delay using colors near Time(s)
Long Beach port.
Figure 13 Prediction of the D(t) pattern using a refined similar
case: the observed tracking data, its baseline, the pattern of a
refined similar case, and the predicted pattern of the target case
are indicated by red circle, black solid line, blue solid line, and
240

blue dotted line, respectively.


220

Equation (8), that is, the best-matching pattern Dc ðtÞ at the
time of td ¼ 1; 669; 916. Although c arrived at the arrival port
200

on time, it had been selected because its slowing trend near the
Longitude

arrival port is similar to that of c. The location parameter to


180

align Dc ðtÞ and Dc ðtÞ was computed and found to be d ¼


90; 000 using Equation (10). Leveraging the pattern infor-
mation from c , practitioners can predict the future pattern of
160

Baseline Dc ðtÞ using Equation (9), as indicated by the blue dotted line in
Target case
Figure 13. Moreover, the highly probable cause of a delay was
140

Similar cases
congestion at the arrival port. Because the cause of the delay
prior to the vessel arrival is now known, the delay can be
120

managed appropriately in advance in order to fulfill the


0 500000 1000000 1500000 expectations of customers and to reduce the costs due to the
Time(s) delay.
Figure 12 The patterns of the baseline (black solid), target case
(red dashed), and similar cases (green dotted): the patterns of
similar cases are similar to that of the target case, because the 4. Conclusion
similar cases were retrieved from the case base in order to have
features similar to those of the target case. This paper proposes a novel methodology for the early
detection of vessel delays that combines historical information
tracking information. This indicates that the vessel continued and real-time information. We have used the term ‘early’ with
to remain behind the baseline. In order to predict the future the following two meanings: ‘prior to vessel departure’ and
pattern of D(t) and to understand the causes of the delay of the ‘real time.’ In order to predict whether or not a vessel will be
vessel, the refined similar case is found. The blue solid line in delayed prior to its departure, we propose a case-based
Figure 13 is the refined similar case c identified using reasoning (CBR) method for categorical data with a feedback
Journal of the Operational Research Society

scheme for the Revise and Retain phases. Our numerical study Fancello G, Pani C, Pisano M, Serra P, Zuddas P and Fadda P (2011).
confirmed that an elaborated feedback scheme is required for Prediction of arrival times and human resources allocation for
container terminal. Maritime Economics & Logistics 13(2):142–173.
efficient information flows in logistics systems. In order to
Fawcett T (2006). An introduction to ROC analysis. Pattern
detect the delay of a vessel in real time, we proposed a real-time Recognition Letters 27(8):861–874.
analytics-based approach that combines historical data and Feixiang Z (2011). Mining ship spatial trajectory patterns from AIS
real-time data. Through the proposed new phase of CBR called database for maritime surveillance. In: 2nd IEEE International
Refine, the delay can be detected in real time and the movement Conference on Emergency Management and Management Sciences
(ICEMMS), 2011, IEEE, pp. 772–775.
patterns of a vessel can be predicted until its arrival. Moreover,
Ge Y, Kong PY, Tham CK and Pathmasuntharam JS (2007)
highly likely causes of a delay can be identified; thus, the delay Connectivity and route analysis for a maritime communication
at the arrival port can be appropriately managed. The primary network. In: 6th International Conference on Information, Com-
contribution of this paper is that it proposes a data-driven munications & Signal Processing, 2007, IEEE, pp. 1–5.
method that detects vessel delays in multiple stages (prior to Holsten S (2009). Global maritime surveillance with satellite-based
AIS. In: OCEANS 2009-EUROPE, IEEE, pp. 1–4.
vessel departure and on the open seas) using both historical
Høye GK, Eriksen T, Meland BJ and Narheim BT (2008). Space-
shipping data and real-time tracking data in the framework of based AIS for global maritime traffic monitoring. Acta Astronau-
refined case-based reasoning. In this paper, we only utilize tica 62(2):240–245.
historical shipping data and real-time S-AIS vessel tracking Murthy SK (1998). Automatic construction of decision trees from
data. Including other various types of data such as weather and data: A multi-disciplinary survey. Data Mining and Knowledge
Discovery 2(4):345–389.
social data will be investigated in future research.
Neter J, Kutner MH, Nachtsheim CJ and Wasserman W (1996)
Applied Linear Statistical Models, vol. 4. Irwin: Chicago.
Pallotta G, Vespe M and Bryan K (2013a). Traffic route extraction
Acknowledgments—This research was partially supported by Basic
Science Research Program through the National Research Foundation of and anomaly detection from AIS data. In: International COST
Korea (NRF) funded by the Ministry of Science, ICT & Future Planning MOVE Workshop on Moving Objects at Sea, Brest, France.
(2015R1C1A1A02037090). Also, this work was supported by the 2016 Pallotta G, Vespe M and Bryan K (2013b). Vessel pattern knowledge
Research Fund (1.160093.01) of UNIST(Ulsan National Institute of discovery from ais data: A framework for anomaly detection and
Science & Technology). route prediction. Entropy 15(6):2218–2245.
Pani C, Fadda P, Fancello G, Frigau L and Mola F (2014). A data
mining approach to forecast late arrivals in a transhipment
References container terminal. Transport 29(2):175–184.
Pani C, Vanelslander T, Fancello G and Cannas M (2015). Prediction
Aamodt A (1993). Explanation-driven retrieval, reuse, and learning of of late/early arrivals in container terminals–A qualitative approach.
cases. In: EWCBR-93: First European Workshop on Case-Based European Journal of Transport and Infrastructure Research
Reasoning, pp. 279–284. 15(4):536–550.
Aamodt A and Plaza E (1994). Case-based reasoning: Foundational Ristic B, La Scala B, Morelande M and Gordon N (2008). Statistical
issues, methodological variations, and system approaches. AI analysis of motion patterns in AIS data: Anomaly detection and
communications 7(1):39–59. motion prediction. In: 11th International Conference on Informa-
Boriah S, Chandola V and Kumar V (2008). Similarity measures for tion Fusion, 2008, IEEE, pp. 1–7.
categorical data: A comparative evaluation. In: Proceedings of the Riveiro M and Falkman G (2011). The role of visualization and
2008 SIAM International Conference on Data Mining, SIAM, interaction in maritime anomaly detection. In: IS&T/SPIE Elec-
pp. 243–254. tronic Imaging, International Society for Optics and Photonics, vol.
Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classifi- 7868.
cation and Regression Trees. CRC press: Boca Raton Therneau TM and Atkinson EJ (2014). An introduction to recursive
De Mantaras, RL, McSherry D, Bridge D, Leake D, Smyth B, Craw S, partitioning using the RPART routines. Rochester: Mayo
Faltings B, Maher ML, Cox MT, Forbus K et al. (2005). Retrieval, Foundation.
reuse, revision and retention in case-based reasoning. The Knowl- Vespe M, Visentini I, Bryan K and Braca P (2012). Unsupervised
edge Engineering Review 20(3):215–240. learning of maritime traffic patterns for anomaly detection. In:
Eriksen T, Høye G, Narheim B and Meland BJ (2006). Maritime Data Fusion & Target Tracking Conference (DF&TT 2012):
traffic monitoring using a space-based AIS receiver. Acta Astro- Algorithms & Applications, 9th IET, IET, pp. 1–5.
nautica 58(10):537–549.
Eriksen T, Skauen AN, Narheim B, Helleren O, Olsen Ø and Olsen Received 18 March 2015;
RB (2010). Tracking ship traffic with space-based AIS: Experience accepted 4 August 2016
gained in first months of operations. In: Waterside Security
Conference (WSS), 2010 International, IEEE, pp. 1–8.

You might also like