You are on page 1of 7

IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

Learning from Cloud Latency Measurements

Pavol Mulinka and Lukas Kencl


Czech Technical University in Prague
{pavol.mulinka,lukas.kencl}@fel.cvut.cz

Abstract—Measuring, understanding, troubleshooting and op- only limited insight into its internals. We are therefore not
timizing various aspects of a Cloud Service hosted in remote in possession of any ground-truth data, and it is impossible to
datacenters is a vital, but non-trivial task. Carefully arranged evaluate parameters such as false positives.
and analyzed periodic measurements of Cloud-Service latency
can provide strong insights into the service performance. A Cloud Nevertheless, the multi-faceted nature of the CLAudit
Service may exhibit latency and jitter which may be a compound measurements allows to draw some conclusions. As the mea-
result of various components of the remote computation and surements are executed from multiple geographical vantage
intermediate communication. We present methods for automated points and on multiple protocol layers, we can exploit time-
detection and interpretation of suspicious events within the multi- series interrelation to pinpoint likely causes of the service
dimensional latency time series obtained by CLAudit, the previ-
degradation (datacenter process, remote/local network, etc.).
ously presented planetary-scale Cloud-Service evaluation tool. We
validate these methods of unsupervised learning and analyze the As ”normal” behavior of a Cloud Service is not defined, we
most frequent Cloud-Service performance degradations. set our expectations based on past measurements and design
I. I NTRODUCTION algorithms to identify suspicious events as deviations from
it. We experiment with multiple metrics, focussing either on
Measuring and understanding various aspects of a Cloud- latency exceeding certain threshold, on increasing jitter, or on
Service performance is a difficult and important task, from changing latency distribution pattern over a time window. We
the perspective of both of its users and operators. In this sketch a design of a sliding-window-based automated tool that
paper we focus on Cloud-Service latency, understood as a could provide real-time indication of service degradation based
combination of the intra-Cloud latency, the application delay on latency measurements, and an event-interpretation database.
and the Internet content-delivery time.
The main contributions of this work are as follows:
Cloud-Service latency time series can be an important in- (i) Detection metrics. We present and evaluate three metrics
dicator for rapid detection of service disruption or degradation, that can be applied to the client-Cloud communication latency
as well as for its troubleshooting and root-cause identification. measurements time-series and indicate occurrences of suspi-
This information, especially about parts beyond one’s control, cious events in the client-Cloud communication.
can be vital to many players involved in the Cloud Com- (ii) Interpretation guidelines. Machine-based interpretation of
puting ecosystem: to Cloud-infrastructure and Cloud-service the detected events is possible thanks to the presented interpre-
providers for troubleshooting and optimizing their infrastruc- tation methods over the multi-dimensional indication vectors of
ture (network connectivity in particular), to Cloud tenants for detected events. We present examples of such events, including
provider selection and contract supervision, and to Cloud- those exhibiting interrelation across multiple time-series.
service end-users for service benchmarking, selection and its (iii) Latency measurements analysis. Thanks to the above tools,
quality evaluation. Latency analysis is vital as, especially from we present a breakdown of the frequency and persistence of
the user or tenant perspective, service latency might easily the various types of incidents as detected in the latency data.
be the only measurable information at hand about the actual This gives a statistical overview of the type of issues users of
Cloud-service performance. contemporary Cloud Service platforms might be experiencing.
Our detection and interpretation tool works with active
latency measurements, using globally distributed probes to II. R ELATED WORK
servers in the Cloud, obtained in our prior-work, the CLAu-
dit [1] planetary-scale Cloud Latency Auditing platform. While Latency measurements and anomaly detection are well-
understanding origins and symptoms of Cloud Service latency known concepts in telecommunications and networking. Many
is difficult, fast and automatic identification and interpretation studies have focused on methods of network-latency mea-
of suspicious events on the basis of latency time-series is an surement and estimation at a single layer of the OSI model,
equally hard task, suited better for a machine-type of analysis. not considering the geographic location of the nodes [2],
The objective is to detect suspicious events digressing from [3]. Techniques were proposed for anomaly detection using a
normal behavior that can be acted upon as appropriate, by traffic-flow pattern analysis [4], [5], [6]. Only recently attention
operators, tenants or users. Defining such automated detection was given to Cloud Latency from the user perspective [7].
and interpretation methods is the subject of this work. We focus on the user perspective specifically, studying Cloud-
Latency measurements at multiple layers of OSI and across
The presented methods fall in the category of unsupervised multiple geographic vantage points.
learning. The observed latency-measurements time series do
not provide any information about the actual events in the Many commercial or open-source monitoring solutions
datacenter or along the communication path. The Cloud- exist, ranging from monitoring of servers, network and ser-
Service ecosystem can thus be seen as a black box with vices: Nagios [8] or MRTG [9], to monitoring of the clouds:

978-1-4673-6305-1/15/$31.00 ©2015 IEEE 1895


IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

Cedexis [10], GWOS [11], CloudHarmony [12], CloudCli-


mate [13], Eucalyptus [14], IBM SmartCloud Monitoring [15]
or OpenStack monitoring tools [16]. However, these solutions
focus solely on gathering and evaluation of data from within TCP SYN
the cloud. Solutions monitoring clouds from the user per-

TCP_WS_RTT
spective e.g. Nimsoft Cloud User Experience Monitor [17], TCP SYN-ACK
Dynatrace [18], CloudSpectator [19] already exist, but focus TCP ACK
solely on data gathering and lack mechanisms for detection

OVERALL_LATENCY
and interpretation of suspicious events. Research-community HTTP GET
driven projects are needed for application of machine learning TCP SYN

SQL_RTT TCP_DB_RTT
techniques on gathered data for further analysis.
TCP SYN-ACK
Previous Cloud and network studies focused on passive
monitoring of: volumes of transferred data [20], CDN deployed TCP ACK

HTTP_RTT
in PlanetLab [21] - PlanetSeer [22], or on active monitoring of SQL QUERY
availability of ICMP and HTTP service [23]. Their common
goal was to measure and aggregate the suspicious events SQL REPLY
and anomalous behavior, whereas we focus on proposing a NSE
HTTP RESPO
framework for distinguishing suspicious events based on their
geographical location and affected OSI layer.
A recent cloud monitoring survey [24] summarizes state of
Fig. 2. Selected CLAudit-measured variables. Sequence diagram of the client-
the art solutions in the field of Cloud monitoring, but only little web-server-database interactions along with three selected elementary round-
attention is given to detection of suspicious events from latency trip time variables and two composite variables, as measured by CLAudit.
measurements. Latency as metric is used for performance
evaluation [25], [26] and benchmarking [27] but not for
suspicious event detection. Time series studies [28] [29] [30] the PlanetLab [21] global research network (see Fig. 1), so
focused on similarity search using wavelets and statistical that various end user perspectives can be evaluated. Clients
methods. In contrast, as we lack a definition of normal cloud are centrally instructed to send probes against applications
behavior, we focus on search of deviations from its empirically hosted on Application servers to obtain protocol delays from
defined parameter values. respective RTTs (see Fig. 2). Probes exist using various
ISO/OSI layer protocols in two forms, depending whether a
III. CLAUDIT PRIOR - ART SHORT OVERVIEW further connection to a database is used or not (see Table I).
CLAudit thus reveals protocol-level behavioral differences
This section provides a very short overview of CLAudit, and benchmarks the obtained results. For example, TCP/SQL
the globally-distributed cloud latency measurements platform, storage (on-premise and remote) is included in our testbed
which is the source of cloud-latency data analyzed in this that simulates a multi-tier WEB 2.0 application. Client nodes
paper. Details of CLAudit are described in [1]. CLAudit are deployed in triplets for the purposes of redundancy and
consists of two major components: the globally-distributed anomaly-discovery verification.
probing Client nodes, and the responding Application servers,
deployed within the Cloud datacenters. Application servers are client-request serving machines
deployed in the Cloud data centers, provisioned in the same
Clients play the role of real end-users of Cloud appli- way as in the production environment. Application servers are
cations and work with identical protocols. Client nodes are geographically distributed across a global set of data centers to
geographically distributed throughout the whole world using enable evaluation of distributed applications. Microsoft Win-
dows Azure [31] serves as the platform for hosting Application
servers and Windows Azure SQL hosts the database.
CLAudit measurements are gathered periodically every 3
minutes and are publicly available at the CLAudit website [32].

measurement database
description
type server
RTT between client TCP SYN and TCP SYN ACK
tcp2ws no
from web server
RTT between client ICMP Echo Request and ICMP Time
tracert no
Exceeded from farthest visible network hop in target datacenter
RTT between TCP SYN from web server and TCP ACK
tcp2db yes
from db server
RTT between client simple HTTP request and HTTP response
http2ws(db) yes
from web server including a response from a db server
http2ws(null) same as http2ws(db) except without db server no
total RTT between client TCP SYN and HTTP response
overall(db) yes
Fig. 1. Map of the CLAudit prototype deployment throughout the world. The from web server including a response from a db server
central Monitor is represented by a magnifier. Client nodes are represented by overall(null) same as overall(db) except without db server no
laptops, web servers by the globe icons and database servers by the cylinder
icons. Client nodes are deployed in redundant fashion in each region. TABLE I. Measurement description

1896
IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

IV. S OLUTION ARCHITECTURE input: X, S, O, N, M LT


Output: Y
In this section we describe the techniques of post- begin
processing the CLAudit latency measurements. We partition X 0 = 0;
the post-processing in two sequential tasks: detection of sus- xmin = x0 ;
picious events, and interpretation of these events. This process for i = 1 to length(X) do
is a variation of unsupervised machine learning - we possess if xmin > xi and xi > 0 then
no ground-truth data about datacenter or network disruptions, xmin = xi ;
yet search for indications of such events and, in the interpre- else if xi ≥ (xmin ∗ M LT ) or xi = 0 then
tation phase, use the multi-dimensional characteristics of the x0i = 1;
measurements time series to estimate the disruptions cause. end
W 0 = window(S, O, X 0 );
for j = 0 to length(W 0 ) do
A. Detection PS
yj = ( k=1 wj0 (k) > N )
Our focus is on detecting occurrences of suspicious events end
within the latency measurements time-series X = {xi }, where end
xi = 0 denotes server unreachable. As suspicious events have Algorithm 1. MLT metric pseudocode
various characteristics [1], we use three different metrics oper-
ating over a sliding window W = {wj }. Each window contains
a subsequence of xi of size S and overlap O ∈ {0, 1, ..., S−1}, 1) Maximum Latency Threshold Metric: For pseudocode
denoted as window(S, O, X). To address the k-th element of a see Algorithm 1. The main idea is based on the assumption
j-th window we use the notation wj (k), k ∈ {1, 2, ..., S}. The that under steady-state conditions defined by the technical and
algorithm is calculating metrics over a W , comparing them to physical specifications of the path between node and the server,
empirically determined thresholds (see Section V). Each metric the measured latency should be close to its minimum value
uses a different approach to profile regular behavior and mark xmin . Due to generally unstable networking and computing
other events as suspicious. We use a latency threshold, the conditions, we concede that within the given window, a frac-
coefficient of variation and a histogram as bases for the metrics tion of measurements N might fluctuate above the value of
(see Fig. 3). Outputs from algorithm variations Y = {yj } may xmin multiplied by the given coefficient M LT . This implies
be combined into the final output R = {rj }. that we use two conditions to detect suspicious events: First,
a condition to mark xi as suspicious:
x0i = 1 ⇐⇒ ( xi > xmin ∗ M LT ), (1)
370 windowsize
the output of which is a boolean representation of X, de-
E(wj )
320 noted X 0 = {x0i }.
V Second, a condition that marks the window
time[ms]

CV ∗ E(wj )
270 σ(wj ) wj0 , (wj0 ∈ W 0 ) (wj0 (k) ∈ {0, 1}), as suspicious if a fraction
of suspicious values is greater than N ∈ {0, 1, ..., S − 1}:
220
S
M LT ∗ xmin X
170 yj = 1 ⇐⇒ ( wj0 (k) > N ). (2)
120 k=1
sample#
Measurement time series
As the minimum latency changes over time the xmin value
xmin
must be adjusted. Auxiliary mechanisms must be implemented
xi 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
so that the algorithm can continuously detect suspicious events,
a) MAXIMUM LATENCY THRESHOLD METRIC
e.g. periodically resetting xmin value. This includes specifica-
x0i [boolean] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 tions of criteria under which a new minimum xmin could be
elected. We did not design the mechanisms and criteria as the
yj [boolean] 0 0 1
CV METRIC
paper focuses on framework proposal and its evaluation, not
b)
σ(wj )[ms] 12.7 7.1 65.8 real world adjustment.
E(wj )[ms] 135.4 131.7 249.4
cv 0.094 0.054 0.264 2) CV Metric (Coefficient of Variation): For pseudocode
yj [boolean] 0 0 1
see Algorithm 2. This metric is based on the premise that
c) HISTOGRAM METRIC σ(w )
increased coefficient of variation cv(wj ) = E(wjj ) indicates
instability. The tolerable extent is given by a threshold CV :
HT EST

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 input: X, S, O, CV
bi
Output: Y
pLOW , pHIGH 128, 157 128, 157 128.3, 151.8
begin
HT RAIN 1 7 0 0 0 0 0 0 1 1 1 7 0 0 0 0 0 0 1 1 1 6 0 1 0 0 0 0 1 1
HT EST 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 8
W = window(S, O, X);
for i = 0 to length(W ) do
yj [boolean] N/A 0 1 yj = (cv(wj ) > CV );
end
Fig. 3. Example of the MLT, CV and Histogram metrics computed over a end
latency measurements sample. Algorithm 2. CV metric pseudocode

1897
IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

a) measurement client server


db location YMLT YCV YHIST R
type location priority location

yj ⇐⇒ ( cv(wj ) > CV ). (3) http2ws(null)


http2ws(null)
au
ru
PRIMARY
PRIMARY
wus
wus
null
null
1
1
1
0
1
0
1
0
tcp2ws au PRIMARY wus null 1 0 1 1
3) Histogram Metric: For pseudocode see Algorithm 3.

...

...

...

...

...

...

...

...

...
We assume, that measurements of data centers in the Cloud
have data-center specific outputs in short time unless the b)
triggering event
conditions are altered, which indicates suspicious event. We
affected nodes
represent the measurements by histograms from W and search
for deviations from historic measurements represented by his- 1 2 ... all

togram Htrain , that would indicate the alteration of conditions. db related


...
We put emphasis on most recent measurements in window underlying
protocols assessment
evaluation by use of low-pass filter function where we sum up
interpretation#1 interpretation#2 ... weak positive
the Htrain weighed by parameter α with the most recent non- c)
http2ws(null) =1
suspicious measurements Htest weighed by (1 − α), where
α ∈ (0, 1). The outcome is compared to the current Htest 0
tcp2ws
1

using root mean square error (RM SE). We use percentiles 0


tracert 0
tracert
1

pL and pH computed from an initial tuning set of measure- HTTP SERVICE ISSUE
ments xi ∈ {1, 2, ..., tune} as a demarcation range Rdem of d) underlying measurement number of
concurrent
histogram values which have B elements. The percentile ranks triggering event
tracert tcp2ws tcp2db http2ws(null) http2ws(db) detections
interpretation

L and H are calculated as ( 100 100


B ) and (100 − B ) to set the
http2ws(null)
http2ws(null)
0
1
0
0
N/A
N/A
N/A
N/A
N/A
N/A
6
1
global http service issue
path issue
size of histogram edge elements. Threshold HIST is used to http2ws(null)
overall(null)
0
0
1
0
N/A
N/A
N/A
0
N/A
N/A
1
1
data connection issue
weak positive
decide whether the window is marked as suspicious: overall(null) 0 0 N/A 1 N/A 1 local http service issue
overall(null) 1 0 N/A 1 N/A 1 path issue
overall(null) 0 1 N/A 1 N/A 1 data connection issue
yj ⇐⇒ ( RM SE(Htrain , Htest ) > HIST ). (4) http2ws(eas)
http2ws(null)
0
1
0
1
1
N/A
0
N/A
N/A
N/A
5
1
db issue
connectivity issue
overall(null) 1 1 N/A 1 N/A 1 connectivity issue

...

...

...

...

...

...

...

...
B. Interpretation
Fig. 4. Event Tree. a) example of a measurement snapshot compared to the
The CLAudit architecture provides measurements from the Interpretation database; b) general structure of an event tree; c) assessment
user perspective - considering the cloud service as a complete of underlying protocols in relation to triggering event http2ws(null) with
black box. Thus, we can only make educated estimates as interpretation: HTTP SERVICE ISSUE event; d) part of the Interpretation
database created according to the event tree.
to the interpretation of the measured data and the detected
events. However, given the multi-dimensional (protocol layer,
location, machine) perspective of the measurements, we are
1) Measurement snapshot: Outputs from algorithm metrics
able to confidently identify and separate various causes of
are combined into 2D arrays called snapshots, which contain
suspicious changes in latency measurements, such as network
information about the type of the measurements, location of
failure, data center problem, etc. In the following we show a
the nodes and servers etc. and outputs from each metric.
list of events together with the expected footprint of such an
Each snapshot corresponds to a single wj in the latency
event in the multi-dimensional detection array, as determined
measurements of a single server. Outputs from the metrics are
by the detection algorithm metrics presented in the previous
combined into final output R{rj }, which are then compared to
subsection. The process of results interpretation is separated
the Interpretation database, described in the next paragraph.
into creation of a Measurement snapshot and to its comparison
For an example of a measurements snapshot see Fig. 4a).
to an Interpretation database:
2) Event Tree and Interpretation database: Interpretation
input: X, S, O, L, H, B, HIST , α, tune database is a multidimensional array that consists of series
Output: Y of interpretable suspicious events and their interpretations
begin constructed from Event Tree. Interpretable suspicious events
W = window(S, O, X); are denoted as triggering and the suspicious events used for
Rdem =< pL (xi ), pH (xi ) >, i = 0, 1, ..., tune; their interpretation as underlying. Interpretations include weak
Htrain = histogram(w0 , Rdem , B); positives. Under weak positive we understand a suspicious
for j = 0 to length(W ) do event not supported by underlying OSI layer measurement
Htest = histogram(wj , Rdem , B); detections e.g. tcp2ws event is detected but http2ws is not.
if RM SE(Htrain , Htest ) > HIST then
yj = 1; In the following we use notation from Table I. We cre-
else ated an Event Tree using the geographic location of the
Htrain = α ∗ Htest + (1 − α) ∗ Htrain ; affected nodes and the top-to-bottom OSI model approach by
end designation of triggering events: {http2ws(null), http2ws(db),
end overall(null), overall(db)}, and analysis of dependencies of
#function histogram(w, Rdem , B) - computes a detected events on underlying protocols: {tracert, tcp2ws,
histogram over w with B elements defined by Rdem tcp2db, http2ws(null), http2ws(db)}, where {http2ws(null),
http2ws(db)} can be either triggering or underlying for
Algorithm 3. Histogram metric pseudocode
{overall(null), overall(db)}. See examples in Fig. 4b-d.

1898
IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

no error 62%
V. P ERFORMANCE EVALUATION mlt & hist 6%
We tested the proposed solution on the historic measure-
ments in [1]. Testing was divided into the Detection part, where
all 0.13%
we determined the optimal parameters for the metrics, and the
Interpretation part, where we analyzed relations among the mlt & cv 1%
results from all measurement types. cv & hist 0.52%
mlt 13%

A. Simulation setup cv 4%
hist 14%
The measurements comprise the 4 types and their com- a) all events b) events evaluated as suspicious
binations, from 6 nodes to 4 servers, in combination with 5
databases (including null, see Table I). Each type is recorded Fig. 6. Suspicious event detection by algorithm metric . a) shows ratio
in 480 measurement samples per day (consecutive samples are between suspicious and non-suspicious rj , where 62% of all windows W
were not marked by any algorithm metric, 31% were marked by only one
∼3 mins away) of median and minimum values, over 28 days. and 7% were marked as suspicious, b) shows detailed view of rj marked as
The initial 20% of measurements were not used in the statis- suspicious, from which 0.13% is a combination of all metrics and the rest is
tics, but for initial adjustment of the metrics parameters, i.e. a combination of two metrics.
the selection of xmin and creation of a representative Htrain
and Rdem . The emphasis was put on the future feasibility and MLT metric does not significantly depend on size of S.
of real-time suspicious event detection, therefore we focused We used S = 10 (30 min) for adjustments of the Histogram
on S ∈ (5; 20). The solution was implemented using Python metric. The Histogram metric heatmaps show that by increase
scripts. For validation purposes we encourage the reader to of histogram elements B, the RM SE lowers and setting of
visit the CLAudit website [32] where all input data is available. HIST gets more difficult. We set B = 20 and HIST = 2.2,
as this combination had the highest success rate in marking
B. Results observed suspicious events in the measurements. Based on the
above, in further analysis we set the parameters as follows:
1) Detection: Parameter tuning. Each metric uses specific ALL: S = 10, O = 0; MLT: N = 50%, M LT = 1.8; CV:
parameters that must be adjusted to achieve best results. To CV = 0.5; Histogram: B = 20, HIST = 2.2, α = 0.8.
understand the impact of each parameter on the metric outcome We set  threshold values based on the highest success rate in
we created a series of heatmaps that show dependency on the marking observed suspicious events in the measurements.
parameters (see selected example in Fig. 5). We decided to
choose combinations marking 5% − 15% of the all measure- For interpretation we used a Results condition: rj = 1 ⇔
ments as suspicious, to analyze ≈ 10% of the data and avoid (yjM LT +yjCV +yjHIST ≥ 2) meaning that a window is marked
high number of weak positive detections - not supported by suspicious if marked suspicious by at least two metrics. As a
underlying OSI layer detections. The heatmaps confirm that result, out of the 41235 time-series windows, 2818 (≈ 6.8%)
changes in the overlap parameter O have little impact on all were marked as suspicious.
metrics and we thus decide to omit it. The CV metric heatmap Metrics relevance. We need to verify whether each metric
shows a high number of detected suspicious events for lower merits its own existence, i.e. that each metric individually
values of window size S, this is due to high impact of varying marks distinct windows. About 7% of all windows were
samples xi on a value of cv(wi ) in a small measurement marked as suspicious by a combination of metrics, 62% were
window. As the simulation ought to reflect a real-time detection not marked by any metric, leaving 31% marked by exactly
scenario it is desirable to have high granularity through small one metric (Fig. 6). Fig. 6 and 7 show that the distribution of
size of S. S > 5 has no significant impact on CV metric measuement types with detected events significantly differs per
200 100 3.5 metric. Using multiple metrics in conjunction with the Results
190
180
90
80
3.0
2.5
condition thus restricts suspicious event marking, giving a
MLT [%]

plausible amount of detections. At the same time, the Results


CV [%]

170 70 2.0
HIST

160 60 1.5
150 50 1.0 condition should perhaps be avoided and a single metric
5 10 15 30 5 10 15 30 5 10 20 25
window size S window size S number of hist. bins B
preferred if searching for specific event types.
1a) MLT metric(O = 0, N = 50%) 2a) CV metric(O = 0) 3a) HIST metric(α = 0.8, S = 10)
200 100 2.3
190 90 2.2 a) MLT b) CV c) HIST
3% 5% 9% 10% 10%
CV [%]

180 80 2.1 36% 37%


MLT [%]

170 70 2.0 30% 5% http2ws(nul)


11%
HIST

11%
160 60 1.9 http2ws(db)
150 50 1.8 overall(nul)
0 10 20 30 40 50 0 10 20 30 40 50 0.2 0.4 0.6 0.8 12%
overall(db)
overlap O[%] overlap O[%] α
5% tcp2db
1b) MLT metric(S = 10, N = 50%) 2b) CV metric(S = 10) 3b) HIST metric(B = 20, S = 10)
MIN MID MAX tcp2ws
2% 18% tracert
0.03 0.17 0.26 24% 25% 47%
all except 3a)

0 0.49 0.84
Fig. 7. Normalized distribution of the detected events among measurement
3a) scale types per individual metric. a) shows even ratio among measurements linked
to databases. b) shows relatively even ratio among measurements prone to
Fig. 5. Algorithm Metric Heatmaps. Heatmaps show dependency of the jitter. c) shows majority ratio of db measurement (tcp2db) caused by its long-
fraction of detected events on metrics parameters. 1) MLT metric: 1a) M LT term character in conjunction with leaps in measurements. As ratios differ
and S, 1b) M LT and O. 2) CV metric: 2a) CV and S, 2b) CV and O. significantly we conclude that each metric is detecting a different set of
3) Histogram metric: 3a) HIST and B, 3b) HIST and α. deviations from standard behavior.

1899
IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

a)
local issue
measurement max detected 9% only overall(null) weak positive
800.00 label type persistency
700.00 A http2ws(db) 53 76.7% http2ws(null) http service issue
600.00 B http2ws(null) 14 0.5% tracert & tcp2ws connectivity issue
occurance

500.00 C overall(db) 24 6.9% tcp2ws & http2ws data connection issue


400.00 D overall(null) 12
0.5% tracert & tcp2ws & http2ws(null) connectivity issue
300.00 E tcp2db 960
F tcp2ws 12 global issue
200.00 G tracert 15
6.3% http2ws(null) http service issue
100.00
0.00
ABCDEFG
1 2 3 4 5 6 7 8 9 10
persistence of detected suspicious events[number of windows] b)
local issue
91.5% only http2ws(null) http service issue
Fig. 8. Persistence of suspicious events. Histogram of consequent event 0.2% tracert path issue
windows. Maximum per each measurement type can be found in color legend. 2.7% tcp2ws data connection issue
0.2% tracert & tcp2ws connectivity issue
global issue
Persistence of detected suspicious events was measured per 5.2% only http2ws(null) http service issue
each measurement type (see Fig. 8). We are searching for 0.2% tracert path issue
events that are a deviation from normal behavior, therefore any
long-term character of detected events would be an undesirable Fig. 10. Interpretation of suspicious events in relation to a triggering
algorithm feature. However, most of the detected events have event. Pie charts show the distribution of suspicious events in relation to a)
a short-term character, i.e. duration ≤ 30min = 1 × wi , with overall(null)[177 local, 12 global issues] and b) http2ws(null) [455 local, 26
the exception of tcp2db. Tcp2db latency measurements have a global issues] event, and their interpretation. Local issue represents an event
stable long-term character, with only small deviations in pro- detected at one client, global issue an event detected at multiple clients, in
one or more measurements concurrently .
portion to the to the overall latency, but with large deviations in
proportion to the tcp2db measurements themselves. Therefore,
2) Interpretation: We analyze the detected events to show
tcp2db events marked as suspicious according to the metric
how the multi-dimensional vector space created by geograph-
detection conditions also exhibit long-term character. This
ically dispersed vantage points could be used in conjuction
could be a caused by discrepant nature of intra and inter-cloud
with the multi-layer protocol analysis to interpret the results.
communication comparing to client-server communication, or
Based on findings in Fig. 8 we focused on the triggering
by the periodic oscillation of latency of a multi-tier Cloud
events not linked to db measurements. Fig. 10 shows results
application described in [1].
R where the triggering event overall(null) and http2ws(null)
Concurrent detections from geographically dispersed was partitioned according to the underlying events identified
clients: Fig. 9 shows a histogram associating the number of in the same wj . We conclude that the majority of suspicious
geographically dispersed clients detecting a suspicious event events perceived by users in overall(null) measurements -
within the same time window and measurement type. We Fig. 10a), are caused by http service issues as majority of the
distinguished combinations of databases for http2ws(db) and http2ws(null) events - Fig. 10b), does not occur concurrently
overall(db). Most of the non-database dependent suspicious with underlying suspicious events.
events are detected at a single client (indicating a local or a
network issue) whereas database-dependent suspicious events Examples in Fig. 11 and 12 show different results of event-
are detected at multiple concurrent clients (indicating a data- interpretation thanks to the multi-dimensional character of the
center issue). This implies that the proposed geographically- measurement data. Fig. 11 shows suspicious event detection
dispersed detection approach is feasible and allows to pinpoint at a single client in multiple measurement types (layers),
the location of the issue and affected protocols or to create indicating a local issue, a particular case of the tcp2ws event
statistics of performance of cloud services per client location. from Fig. 10b). On the contrary, Fig. 12 shows a suspicious
event detection using multiple geographically dispersed clients,

a) 600 c)
latency[ms]

400
350
500 300 tcp2ws
250 http2ws(null)
latency[ms]

400 200
us 150
300 ru 100
jp 50
cz 0
200
log10 (occurence)

br 9300 9400 sample# 9500 9600


au
100 100
R[boolean]

0 d) 1
9300 9400 9500 9600
sample#
b) 900
10 1
800
2
0
latency[ms]

3 700 us 9300 9400 sample# 9500 9600 YM LT


4 ru YCV
600
5 jp YHIST
6 500 cz R
1 br
R[boolean]

400 au e) 1
300
200

http2ws(db) overall(db) 100


measurement type 0
0
9300 9400 9500 9600 9300 9400 9500 9600
sample# sample#

Fig. 9. Concurrent detections from geographically dispersed clients. His-


togram shows event detections from multiple clients in the same time window. Fig. 11. Example of an isolated (local) suspicious event. a) tcp2ws from all
Measurement-type axis has separate http2ws(db) and overall(db) sections to clients, b) http2ws(null) from all clients, c) tcp2ws and http2ws(null) from
distinguish combinations of concurrent db detections made in 1 time window client au marked as suspicious, d) results of the metrics for tcp2ws and final
in combination with http2ws connection to a data center. result R, e) results of the metrics for http2ws and final result R.

1900
IEEE ICC 2015 - Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA)

a) b)

1000.00
ACKNOWLEDGEMENTS
1

(us) 500.00
We thank Cisco Systems, Inc., for generously supporting
this work as part of its Collaborative Research Program.
0.00 0
1800.00 1

(ru) 1000.00
R EFERENCES
[1] O. Tomanek and L. Kencl, “CLAudit: Planetary-scale cloud latency
0.00 0
800.00 1 auditing platform,” in Proceedings of IEEE CloudNet, 2013.
(jp) 400.00 [2] K. P. Gummadi, S. Saroiu, and S. D. Gribble, “King: estimating latency
between arbitrary internet end hosts,” in Proceedings of the 2nd ACM
latency[ms]

0.00
1800.00
0
1
SIGCOMM Workshop on Internet measurment, 2002, pp. 5–18.
1000.00
[3] K. Johnson, J. Carr, M. Day, and M. Kaashoek, “The measured perfor-
(cz)
mance of content distribution networks,” Computer Communications,
0.00 0 vol. 24, pp. 202–206, 2001.
1600.00 1
[4] J. D. Brutlag, “Aberrant behavior detection in time series for network
(br) 800.00 monitoring,” in In USENIX LISA, 2000, pp. 139–146.
0.00 0
[5] F. Feather, D. Siewiorek, and R. Maxion, “Fault detection in an ethernet
2000.00 1 network using anomaly signature matching,” in SIGCOMM, 1993, pp.
(au) 1000.00 279–288.
[6] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change
0.00 0
detection: methods, evaluation, and applications,” in IMC, 2003, pp.
234–247.
sample#
[7] R. Minnear, “Latency: The achilles heel of cloud computing,” in
Computing Journal, 2011.
Fig. 12. Example of a concurrent suspicious event. a) http2ws(db) from all
[8] Nagios. [Online]. Available: http://www.nagios.org/
clients to the same data center in combination with same db, b) final results
R of the metric detections for each client(green line indicates a window not [9] MRTG. [Online]. Available: http://oss.oetiker.ch/mrtg/
marked as suspicious as either one or no metric marked it). [10] Cedexis. [Online]. Available: http://www.cedexis.com/
[11] GWOS. [Online]. Available: http://www.gwos.com/
[12] CLOUDHARMONY. [Online]. Available: https://cloudharmony.com/
indicating a datacenter issue, as detections shown in Fig. 9. The
[13] CLOUDCLIMATE. [Online]. Available: http://www.cloudclimate.com/
examples show how the interpretation of the triggering event is
[14] Eucalyptus. [Online]. Available: https://www.eucalyptus.com/
divided across different dimensions where we investigate the
affected location and the underlying events. [15] IBM SmartCloud Monitoring. [Online]. Available: http://www-
03.ibm.com/software/products/en/ibmsmarmoni
[16] Openstack monitoring. [Online]. Available: http://www.openstack.org/
VI. C ONCLUSION
[17] Nimsoft Cloud User Experience Monitor. [Online]. Available:
We have presented mechanisms for detection and in- https://cloudmonitor.nimsoft.com/
terpretation of suspicious events in cloud-service latency- [18] DYNATRACE. [Online]. Available: http://www.dynatrace.com/
measurements time series. We have learned that numerous [19] cloudspectator. [Online]. Available: http://cloudspectator.com/
disturbances of steady cloud-service latency occur, splitting [20] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic
into groups of differing explanations. We show that building an anomalies,” in SIGCOMM, 2004, pp. 219–230.
interpretation tree, allowing for fast event cause determination, [21] Planetlab. [Online]. Available: http://www.planet-lab.org/
is feasible and helpful, and that using multiple metrics, each [22] M. Zhang et al., “Planetseer: Internet path failure monitoring and
having its own merit, brings further insight. These metrics can characterization in wide-area services,” in OSDI, 2004.
now easily be applied to other network trace time-series, with [23] Z. Hu et al., “The need for end-to-end evaluation of cloud availability,”
Computer Science, vol. 8342, pp. 119–130, 2014.
a better understanding of the nature of anomaly they capture.
[24] G. Aceto, A. Botta, W. Donato, and A. Pescap, “Cloud monitoring: A
Our event-interpretation is largely speculative, however, survey,” Computer Networks, vol. 57, pp. 2093–2115, 2013.
even the limited information allows in most cases an edu- [25] S. Ostermann et al., “A performance analysis of ec2 cloud computing
cated estimation of the problem cause, based on the multi- services for scientific computing,” Cloud Computing, vol. 34, pp. 115–
131, 2010.
dimensional character of the measurements. Verification with
ground-truth data from actual datacenter operation, confirming [26] Z. Hill and M. Humphrey, “A quantitative analysis of high performance
computing with amazon’s ec2 infrastructure: The death of the local
the above estimation, would do much to improve the validity cluster?” Grid Computing, 10th IEEE/ACM International Conference,
of this research and is going to be the subject of our next pp. 26–33, 2009.
work. For comparison, we also plan to experiment with other [27] E. Walker, “Benchmarking amazon ec2 for high-performance scientific-
cloud-computing providers infrastructure. computing,” LOGIN, vol. 33, pp. 18–23, 2008.
[28] K.-p. Chan and F. Ada Wai-chee, “Efficient time series matching by
As further work in the area of latency measurements inter- wavelets,” in Data Engineering, 1999, pp. 126–133.
pretation we intend to focus on incorporating more advanced [29] I. Popivanov and R. J. Miller, “Similarity search over time-series data
machine learning techniques into proposed framework, on using wavelets,” in Data Engineering, 2002, pp. 212–221.
providing the automated analysis in real-time, and on means [30] J. Lin and Y. Li, “Finding structural similarity in time series data using
how to utilize the outcomes for immediate optimization of the bag-of-patterns representation,” Computer Science, vol. 5566, pp. 461–
user or datacenter processes (e.g. alerts, relocation, process 477, 2009.
re-mapping etc.). We encourage the research community to [31] Windows Azure. [Online]. Available: http://www.windowsazure.com
exploit the freely available traces at [32] to validate and/or [32] CLAudit. [Online]. Available: http://claudit.feld.cvut.cz/claudit/data.php
improve the presented work.

1901

You might also like