You are on page 1of 25

Received: 20 July 2020 Revised: 21 September 2020 Accepted: 3 November 2020

DOI: 10.1002/spe.2934

EXPERIENCE REPORT

Aggregating data center measurements for


availability analysis

Élisson da Silva Rocha1 Leylane G. F. da Silva1 Guto L. Santos1


Diego Bezerra1 André Moreira1 Glauco Gonçalves2
Maria Valéria Marquezini3 Amardeep Mehta4 Mattias Wildeman4
Judith Kelner1 Djamel Sadok1 Patricia T. Endo1

1
Universidade de Pernambuco,
Pernambuco, Brazil Summary
2
Universidade Federal Rural de A data center infrastructure is composed of heterogeneous resources divided
Pernambuco, Pernambuco, Brazil into three main subsystems: IT (processor, memory, disk, network, etc.), power
3
Ericsson Telecomunicações S.A, São (generators, power transformers, uninterruptible power supplies, distribution
Paulo, Brazil
4
units, among others), and cooling (water chillers, pipes, and cooling tower).
Ericsson Sweden, Stockholm, Sweden
This heterogeneity brings challenges for collecting and gathering data from sev-
Correspondence eral devices in the infrastructure. In addition, extracting relevant information
Patricia Takako Endo, Universidade de
is another challenge for data center managers. While seeking to improve the
Pernambuco, Pernambuco, Brazil.
Email: patricia.endo@upe.br cloud availability, monitoring the entire infrastructure using a variety of (open
source and/or commercial) advanced monitoring tools, such as Zabbix, Nagios,
Prometheus, CloudWatch, AzureWatch, and others is required. It is often com-
mon to use many monitoring systems to collect real-time data for data center
components from different subsystems. Such an environment brings an inherent
challenge stemming from the need to aggregate and organize the whole col-
lected infrastructure data and measurements. This first step is necessary prior to
obtaining any valuable insights for decision-making. In this paper, we present
the Data Center Availability (DCA) System, a software system that is able to
aggregate and analyze data center measurements aimed toward the study of
DCA. We also discuss the DCA implementation and illustrate its operation,
monitoring a small University research laboratory data center. The DCA Sys-
tem is able to monitor different types of devices using the Zabbix tool, such as
servers, switches, and power devices. The DCA System is able to automatically
identify the failure time seasonality and trend present in the collected data from
different devices of the data center.

KEYWORDS
availability, data center, data collection, software system, Zabbix

Abbreviations: ANA, anti-nuclear antibodies; APC, antigen-presenting cells; IRF, interferon regulatory factor

Softw: Pract Exper. 2020;1–25. wileyonlinelibrary.com/journal/spe © 2020 John Wiley & Sons, Ltd. 1
2 DA SILVA ROCHA et al.

1 I NT RO DU CT ION

A data center infrastructure is composed of heterogeneous resources which are organized in three main subsys-
tems: (i) the IT subsystem, which refers to networking devices, storage devices, interconnection cables, servers and
associated hardware and software including processors, memory, disks, networks, and fans; (ii) the power sub-
system, composed of all equipment intended for supplying power to the data center such as generators, power
cables, power transformers, uninterruptible power supplies (UPS), power distribution units (PDU), among oth-
ers; and (iii) finally the cooling subsystem, which is composed of water chillers, computer room air condition-
ers (CRACs), pipes, and cooling towers. Cooling components are used to maintain the temperature inside the
data center at recommended levels. Factors as the data center complexity and the criticality of the hosted ser-
vices can demand efficient and extensive monitoring, which includes identification of abnormal behavior of the
resources.1 From the perspective of data center operators, it is essential to understand failures and their causes to
avoid several problems such as financial losses, service interruptions, decreased productivity, and damaged business
reputations.2,3
Nowadays, many strategies to monitor data center resources are utilized by operators such as standards, protocols, and
purpose built software. According to Reference 4, standards are essential in data center management in order to simplify
management of the large diversity of data center components (servers, appliances, storage, network devices, etc.). They aim
to prevent and reduce the impact of critical events, to improve workflow management, to improve capacity planning
through key performance indicators (KPI), and to facilitate general performance monitoring.5
There is a variety of free open source or commercial monitoring and/or alerting systems to support data center oper-
ator’s activities. Examples of these include Zabbix, Nagios, Prometheus, CloudWatch, AzureWatch.6-8 These monitoring
systems provide information that helps controlling the environment, gathering real-time information related to data cen-
ter resources as CPU consumption, RAM memory usage, disk space, UPS input and output power, among several other
metrics. In this case, it is common to concurrently use many monitoring systems to collect real-time data of data center
components at different subsystems.
Nonetheless, such variety of data center monitoring solutions introduces a significant heterogeneity of tools and data
formats for data center vendors, mostly across subsystems. It is often possible to find among four or five different monitor-
ing or automation systems running concurrently in a data center while handling different critical systems independently.9
Thus, data center heterogeneity brings challenges mainly with the need to adopt different solutions leading to the usage
of various approaches to collect and export data. This makes it difficult to correlate problems, failures, and performance
issues by processing distinct resources and subsystems.
Some initiatives have been making efforts to mitigate this problem with the development of new standards and proto-
cols. As an example, we can cite the DMTF Redfish standard that offers specification and schema for system management.
This set of standards includes the definition of properties and actions resource, redundant information, and relationship
between resources and services. It is intended to cover additional data center subsystems namely, power and cooling.10
However, in the meantime and until such standards are commercially supported, data center managers are faced with
the use of existing commercial and open source islands of solutions.
The adoption of several isolated management and device or system-specific tools raises the presence of numerous
data sources and formats that are often difficult to integrate. Managers need solutions that provide global performance
evaluations, correlate different events, and optimize entire services. Unified access to data center subsystems and their
granular resources is primordial for successful data center management.
In this work, we propose the DCA (Data Center Availability) System, a software system able to (i) collect and aggre-
gate data center measurements from different monitoring tools and (ii) perform component failure analysis. The present
system has a data integration through the use of a template strategy to unify data structures. After standardization,
data may be aggregated and analyzed by subsequent management and processing modules. As a proof of concept, we
present an actual implementation of the DCA System running over a small data center at a campus laboratory. We
will also illustrate, through examples, how the methodology can be effectively used to collect insights about device
failures.
This work is organized as follows: in Section 2, we describe in detail our methodology by presenting its require-
ments and main modules. In Section 3, we present the implementation of the methodology using the Zabbix
tool. In Section 4, we present the related works. And in Section 5, we conclude this work highlighting our main
contributions.
DA SILVA ROCHA et al. 3

2 T H E D C A S Y ST E M

In this section, we present the DCA System that aggregates data center measurement data prior to analysis and extracts
relevant information for data center managers. We highlight that, the DCA System was designed to attend the following
requirements:

I. Support multiple data center management tools, that apply to different data center software and hardware, IT, power
and cooling components;
II. Support seamless plug-in for new management tools;
III. Monitor and analyze a wide spectrum of data (IT, power, and cooling).

In order to meet these requirements we proposed the architecture shown in Figure 1. As one may note, DCA Sys-
tem consists of three main modules: Aggregator Collector Modules (ACMs), Aggregator, and Data Analyzer. In the
next subsections, we explain each module methodologically and in Section 3 we describe in detail how each mod-
ule works together with others to satisfy the established requirements. Note that, in this current version, DCA System
does not include data visualization and storage as there are many open source tools that can easily be used to handle
these tasks.

2.1 Aggregator collector modules

An ACM is an intermediate module that sits between a specific monitoring system and the DCA Information System
(composed of Aggregator and Data Analyzer). The main roles of an ACM are:

I. Collect measurements of data center components captured from specific monitoring systems (such as databases,
system log files, network protocol logs, application logs, APIs, etc.);
II. Parse the component’s data into new templates with standard attributes (as described in Section 3.1); and
III. Send the collected data to an aggregator (Section 2.2) that groups the data to be analyzed by the Data Analyzer
(Section 2.3) of DCA System.

F I G U R E 1 The Data Center Availability (DCA) System architecture composed of three main modules: Aggregator Collector Modules
(ACMs), Aggregator, and Data Analyzer. Note that the Aggregator and the Data Analyzer compose the DCA Information Core [Color figure
can be viewed at wileyonlinelibrary.com]
4 DA SILVA ROCHA et al.

Note that, an ACM must define templates with sets of standard attributes in order to facilitate to the DCA System the
task of analyzing the collected data. Furthermore, these templates will also allow the DCA System to faithfully represent
and save the relevant information related to the monitored components in the data center infrastructures. For more details
about the adopted templates and their structures, we refer the reader to Section 3.1.
Considering that the main ACM task is converting component information gathered using the diverse monitoring
tools to the DCA System template, then whenever another tool is integrated to the DCA system, a new ACM (or backend
plugin) for the respective tool must be developed. Depending on the monitoring data format that this other tool provides,
the template information could be extracted automatically through Named-Entity Recognition techniques.11 For example,
if the input format is a log, the fields presented in the DCA template have to be extracted automatically from the raw text
using machine learning techniques.12

2.2 Aggregator

Data center infrastructure is currently monitored using a myriad of monitoring systems (such as Zabbix, Prometheus,
CloudWatch, etc.). As a result, a single data center component (e.g., a computer system) could also be monitored by more
than one of those tools. For instance, a tool may collect data about computational resource consumption of a server (CPU
and memory utilization) whereas another one may concurrently monitor the temperature of the same server.
All data related to a given component must be aggregated in order to provide a holistic view and enable more detailed
analysis of components. Therefore, the main role of the Aggregator module will be to integrate, in a single component
representation, the measurement data and configuration information obtained from a variety of monitoring systems.
The Aggregator will work with the component’s data (in a template format) obtained by the ACM. Once all the data
(see the Component structure in Table 1) related to a data center component is aggregated, it is then stored in a database.
This data will be stored with the respective measurement time (present in the DCA template) resulting in streams of time
series information. This representation facilitates data visualization for operators and data analysis performed by the Data
Analyzer, for example, to determine the presence of trends and seasonality.

T A B L E 1 Template of standard attributes


Structure Description

Component Represents any data center equipment or hardware


ID object Uniquely identify a component
componentType Type Type of component (CRAC, disk, generator, PDU, etc.)
monitoring<Measurement> Monitoring data from a component
iLinks<Links> Internal links to other components
eLinks<Links> External links to other components
Measurement Represents any measurement data from a component
measurementType Type Type of measurement data (disk usage, CPU temperature, etc.)
values<RawData> Measurement values obtained from monitoring tool
Links Represents the relationship between components
redundancyType Type Type of redundancy (see Table 2)
IDs<object> ID(s) of component(s) that it is connected to
Type Represents a type (component, measurement, and redundancy)
ID object Uniquely identify a type
name object Friendly name of the type
RawData Represents a raw data of a monitored component)
Time Long Time collected in timestamp
value Double Raw value
DA SILVA ROCHA et al. 5

2.3 Data analyzer

The Data Analyzer module provides valuable insights for decision-making extracted from the raw data collected from the
data center infrastructure. Therefore, the Data Analyzer’s main roles are to retrieve information stored by the Aggregator
module periodically; to perform preprocessing needed in the stored data; and to apply techniques to identify patterns
present in the data in order to provide relevant information for data center manager.
In addition, the measured time is a requirement in our template (see the Section 3.1). Assuming the data as time series,
the Data Analyzer can use standard time series techniques to evaluate various aspects of the data, such as periodicity
analysis, trend detection, making predictions, etc. For example, considering CPU consumption of a server group, the Data
Analyzer may perform periodicity tests to identify whether the consumption increases to 90% every 3 days for no apparent
reason. In this case, the data center manager can be notified and invited to perform a more careful analysis and identify
the possible cause of this increase in consumption.

3 IMPLEMENTATION: AN USE CASE WITH ZABBIX

In this section, we present the implementation of the proposed methodology centered around the Zabbix management
tool1 as a source of measurement data. Zabbix is a software for real-time monitoring of different metrics generated from
different devices, such as servers, virtual machine, network devices etc. This information is stored in a database and can
be gathered by external tools using the Zabbix API2 that exposes such data.
Firstly, we present the templates that the monitoring tools must comply with in order to send the data collected
(Section 3.1). Next, we describe the ACM implementation for the Zabbix tool (Section 3.2). And finally, the implementa-
tion of the Aggregator module (Section 3.3), the API developed for integration with the Thingsboard tool (Section 3.4),
and the Data Analyzer module (Section 3.5) are detailed.

3.1 Templates of standard attributes

In this subsection, we describe the templates and their standard attributes defined to allow the analysis of the data center
infrastructure components. Please note that these templates are independent of the monitoring tool and contain all the
required attributes to represent any data center component as well as its temporally labeled measurement data.
Table 1 presents the description of standard attributes present in our templates. These templates refer to four
structures: Component, Measurement, Links, Type, and RawData.
The Component structure applies to any data center equipment or hardware, such as CPU, disk, server, fan, power
supply, router, switch, CRAC, generator, among others. It consists of five attributes: component’s ID, which uniquely
identifies a component (e.g. the component ID in a given monitoring system); the componentType, used to represent the
correspondent type of the component (e.g. CRAC, disk, generator, etc.); monitoring, that is a set of any monitoring data
types from a component (e.g., memory usage, fan state, free disk space, UPS battery level, generator oil level, etc.); iLinks,
which establishes internal relationships between this resource and other components (e.g., a server would have links to
its CPU, network interface, and disk associated to it); and eLinks, a set of external links to other related components (e.g.,
an eLink can be used for instance to establish a distribution path of interconnected resources in a power subsystem such
as generator, UPS, and PDU).
The Measurement structure represents any measurement data from a component and consists of only two attributes.
The measurementType uniquely qualifies the type of the measurement data (e.g., disk usage, CPU temperature, etc.),
whereas the values attribute represents a set of specific measurement values obtained by the monitoring tool. Note that,
this last attribute is of the object type, in order to allow developers to define their own object structure (if required), with
the attributes according to their needs.
The Links structure represents the relationship between internal and external components. It consists also of only
two attributes. The redundancyType represents the type of redundancy used between components, which also refers to
the type of arrangement of duplicated components needed to the system to operate correctly. In Table 2, the type of

1
https://www.zabbix.com/features
2
https://www.zabbix.com/documentation/4.0/pt/manual/api
6 DA SILVA ROCHA et al.

T A B L E 2 Redundancy types
Type
ID Name

1 N
2 N+M
3 K/N

redundancy relates to a specific ID, where “1” represents no redundancy, “2” represents redundancy of M components,
and “3” represents a redundancy where K of the N components should be available for the system operation. Finally, the
IDs attribute represents a set of one or more other components’ ID that the specific component is connected to.
The Type structure classifies different components, measurements and links to each other. It consist of two attributes:
the ID which uniquely identifies a type; and the name, that is a user friendly name of the type. Please note that Component
and Measurement structures include an attribute of the structure Type, in order to uniquely identify different compo-
nents in a data center (disk, fan, generator, CRAC, etc.) and different measurements (free disk space, UPS battery level,
etc.), respectively. Besides, in Links structure, we have the redundancyType attribute as a Type structure to represent the
different redundancy types (see Table 2).
Lastly, the RawData structure associates time-series data obtained from a monitored resource to its correspondent
component. This also consists of two attributes: time - the collection timestamp; and value- the raw value. We note that a
list of RawData structures is used as an attribute in the Measurement structure in order to hold the measurement times
and values obtained from the monitoring tool.

3.2 Aggregator collector module

As explained previously, an ACM developed for a given monitoring tool collects data and adapts them to the template
presented in the Section 3.1. In this subsection, we present how an ACM for data collection of the Zabbix monitoring tool
was implemented. The development of ACM for other monitoring tools (e.g., Prometheus) needs to follow the standards
and interfaces of the respective tools (APIs, libraries, and protocols). In this paper, we focused on the Zabbix, but new
ACM versions can be developed in future works.

3.2.1 ACM Zabbix

The ACM Zabbix module was developed using the Python programming language version 3.63 to collect data from servers,
network switches, UPS, among others, through the API provided by Zabbix. This process of collecting data and forwarding
it to the DCA System is illustrated in Figure 2.
Three input configuration files are used by the ACM Zabbix module. The Zabbix configuration filecontains data
needed to access the Zabbix instance (USER_ZB, IP_ZB, PORT_ZB, and PASSWORD_ZB) and to establish communica-
tion with the Aggregator module (IP_AGG and PORT_AGG) (see Table 3), as well as the period of time for sending ACM
data to the Aggregator module (COLLECTION_PERIOD).
The Host item type file configures the Host Item type that Zabbix should collect. Examples of such configuration
include CPU, memory, hardware, system, disk, ping, toner, printing, APACHE, bind, NET, serial, battery, input, power,
output. When the file filed is left blank, it is assumed that Zabbix by default must collect all types (see Table 4). Finally, the
Host configuration file defines which hosts will be monitored by the ACM (see Table 5). If the user needs to monitor
all components already monitored by Zabbix, only the word ”all” should be defined in the configuration file. On the
other hand, if the user needs to specify the hosts to be polled by ACM, the user must inform the names of such hosts
in the file through the id_server that is registered in Zabbix. If the user does not have access to the content present in
the Zabbix, the function to collect the ID of all hosts present in Zabbix must be executed before configuration of the
Zabbix ACM.
3
https://www.python.org/psf/
DA SILVA ROCHA et al. 7

FIGURE 2 Aggregator Collector


Module Zabbix components [Color
figure can be viewed at
wileyonlinelibrary.com]

T A B L E 3 Connection file structure


Attribute Description

USER_ZB Zabbix User name


IP_ZB Zabbix IP address
PORT_ZB Zabbix Port
PASSWORD_ZB Zabbix Password
IP_AGG Aggregator IP address
PORT_AGG Aggregator Port
COLLECTION_PERIOD Request interval in seconds

T A B L E 4 Host item monitoring types file structure


Attribute Description

ID Uniquely identify a type


type Name of the type

T A B L E 5 Hosts monitoring file structure


Attribute Description

ID_server Uniquely identify a Zabbix host


*
all - when filled with ”all” it means that you must obtain all
the hosts

As shown in Figure 2, the ACM loads all information from the input configuration files, and after that, it requests the
collected data to the Controller, periodically (as specified by the input file). This way, the Controller collects data and fits it
in the necessary template. Next, it performs a clustering algorithm in order to help component-type identification. Finally,
the Controller sends the collected data to the DCA system, in other words, the Aggregator and Data Analyzer modules.
Regarding data collection, the Controller calls the Service module, which creates a connection to Zabbix through its
API and performs requests to collect information about the monitored components, following the configuration present
in the respective file. When data is received, the relevant information is extracted to fit the template previously defined
(see Section 3.1), in order to be sent forward.
Through the Zabbix API, it is not possible to collect the type of the component (server, switch, UPS, etc.) and this infor-
mation is mandatory for the template. To overcome this issue, the Controller starts the Cluster module that automatically
groups related components based on the similarity of the monitored data.
To create the clusters, all components and their collected information through the Zabbix API (such as disk usage,
CPU, temperature, etc.) are considered. The components data are text data, and we need to convert this into number
8 DA SILVA ROCHA et al.

FIGURE 3 Template definition (JSON example)

vectors to create the clusters. Thus, we applied the Word2Vectechnique, which is a widely used algorithm to extract
low-dimensional representations of words.13 Afterward, these vectors are used to create the component clusters using
the k-means algorithm.14 The k-means technique aims to divide the elements (components vector representation) into
k groups. To obtain the optimal k number of clusters, our proposal varies the k value by analyzing and optimizing the
Within-Cluster-Sum-of-Squares (WCSS) metric. The WCSS metric calculates the distance between the points and their
centroid, as shown in Equation (1).

k
∑ ∑
wcss = (Xi − Yj )2 , (1)
j=1 i ∈ cluster j

where Y j represents the centroid of the cluster j and X i represents a point in the cluster j.
Our strategy computes wcss for increasing values of K and when the ratio between the current wcss by the reference
wcss (i.e., the wcss calculated for K=1) falls below 0.05, the correspondent K value is considered as optimal, since wcss
suffers only marginal decreasing with the increase of K beyond this point. Finally, we group the hosts by similarity of
monitored data into the K groups defined by the algorithm.
After this step, the third and last task is to send the data to the DCA System (Aggregator module). The data is trans-
formed into JSON and sent by a POST request to the Aggregator to the IP address and port configured in the Zabbix
configuration file.
For an example of a JSON created by the ACM, see the Figure 3. In this figure, one can see a Server (ServerB, ID 123)
that has information about its CPU, memory, and disk components.
Among the JSON data, one can see that the component with that the ID 123 does not contain data; however there are
three components connected to it, they have the IDs 321, 322, and 323 (see the iLinks). Component with ID 321 contains
monitoring data for the cpu_load resource. This component has no internal links to any other one but has an external
link (eLinks) to ServerB (ID 123).
DA SILVA ROCHA et al. 9

T A B L E 6 REST API Functions


Method Function Endpoint Description

POST registerComponent /aggregator/component To receive data from Aggregator Collector Module


GET getAllComponents /aggregator/component Get all received components
GET getComponentById /aggregator/component/{id} Get a specific component by id

FIGURE 4 Aggregator
components [Color figure can be
viewed at wileyonlinelibrary.com]

3.3 Aggregator module

The Aggregator module is composed of the REST API developed with Spring Boot4 and all implemented functions are
shown in Table 6.
The Aggregator components are shown in Figure 4. The Aggregator module is responsible for mapping all data from
ACM module according to the template syntax (see Section 3.1).
The data received from the ACM module, through the POST method, should be processed and stored later using
the registerComponent function. Additionally, this data may be recovered through the getAllComponents and
getComponentById functions, to gather all components and to obtain a specific component by ID, respectively.
As shown in Figure 4, the Aggregator receives data from the Zabbix ACM, through a REST API. The received data is
processed and stored in the Thingsboard platform 5 , using the Cassandra database. The Thingsboard is an open-source
IoT platform used for data collection, processing, visualization, and device management. More details about the used
Thingsboard API functions are presented in Section 3.4.
The Thingsboard platform works with two main abstractions: devices and telemetry data. Devices represent the com-
ponents (servers, UPS, networking switches, etc) of our template, which are identified by name and type; and telemetry
data is data that is associated with a specific measurement (key) and a timestamp in our template.
The Aggregator Controller verifies the component links. If this component includes iLink data, this implies that
the component is a device, and that all elements mentioned in its iLink should be converted into monitoring data to
be saved as telemetry data in the Thingsboard platform, see the example shown in Figure 5. The component with id
“A” shows two iLinks ids: “B” and “C”. These two components, “B” and “C” refer to the monitoring data of com-
ponent “A”. Therefore, the component “A” should be saved as a device, whereas the other components should be its
telemetry data.

4
https://spring.io/projects/spring-boot
5
https://thingsboard.io/
10 DA SILVA ROCHA et al.

FIGURE 5 JSON example of iLinks and eLinks

The Aggregator Controller verifies also the set of components to register. If the component already exists in the Things-
board platform, only its measurement data should be recorded. In case the component does not exist, a new instance is
inserted and registered, as well as its measurement data. Each measurement data associated with a specific component
is differentiated from the next one by a key in the Thingsboard platform. At the end, the Aggregator Controller calls the
Analyzer module to perform the analyzes on the collected time series, detailed in the Section 3.5.

3.4 Thingsboard API

All methods and functions used to establish communication with Thingsboard API are presented in Table 7. These meth-
ods are invoked by the Aggregator and the Analyzer modules. Before any request to Thingsboard API, a Login request
should be made to authenticate the connection through the OAuth authentication standard6 . The functions setDe-
vice and setTelemetry are used to store data in the Thingsboard platform, the former is responsible for registering
component data and the latter is responsible for registering measurement data, both obtained from the template.
In order to recover data, one can use: getDeviceByName, getDevice, getKeys, and getTelemetry. The get-
DeviceByName function is responsible for verifying whether the specific device exists in Thingsboard platform before
any new device registering takes place. The getDevice function is responsible for searching n devices on the Things-
board platform, where n is passed as parameter, which indicates how many devices one wants to search. The getKeys
function is responsible for searching all measurementType values registered in a Device. And the getTelemetry func-
tion fetches the desired measurementType values. The functions getDevice, getKeys, and getTelemetry are used
to get the time series that will be tested in the Analyzer.
An interaction diagram is used to demonstrate the functioning of the DCA System in front of the Thingsboard API
in Figure 6. The figure starts with the Aggregator receiving a list of ACM components. With this data, the Aggregator
calls the Login with Thingsboard which returns a token to access the tool. After Login, it checks if each device is

6
https://oauth.net/
DA SILVA ROCHA et al. 11

T A B L E 7 Thingsboard API functions


Method Function Endpoint Description Module

POST Login /auth/login Authentication in Thingsboard API Aggregator Analyzer


POST setDevice /device{?accessToken} Register a device in Thingsboard Aggregator
GET getDevice-ByName /tenant/devices?deviceName Recover a device given a name Aggregator
GET getDevice /tenant/devices{?type,textSearch, Recover devices given a limit Analyzer
idOffset,textOffset,limit}
POST setTelemetry /plugins/telemetry/{entityType}/ Register measurement data from Aggregator
{entityId}/timeseries/{scope} device in Thingsboard
GET getKeys plugins/telemetry/{entityType}/ Recover keys from a given device Analyzer
{entityId}/keys/timeseries
GET getTelemetry /plugins/telemetry/{entityType}/ Recover measurement data from a Analyzer
{entityId}/values/timeseries{?keys} given device

FIGURE 6 Diagram of interaction of the Data Center Availability system with the Thingsboard API

already registered on Thingsboard by getDeviceByName, if it does not exist, setDevice is called for registration.
Then, the Aggregator sends the telemetry data for each Device via setMensurementInTelemetry. At the end of
the procedure, the Data Analyzer is launched, which creates a connection with the Thingsboard through the Login.
The Analyzer requests all devices registered with getDevice and all keys for each device with getKeys. Finally,
with all devices and keys, the Analyzer requests the telemetry list,getTelemetry, which will be used for further
analysis.

3.5 Data analyzer

The Data Analyzer is a module of the DCA System that anals the collected time series of data center components and
calculates key performance indicators related to failures in the monitored environment. In this work, since Zabbix does
not have a measure that maps directly to availability, we assume that when there is no data measured at any given time
interval for a given component, such a component is considered to experience a failure during this time interval. Although,
this seems to be a naive approach since the lack of data can occur for instance due to a multitude of causes as a network
12 DA SILVA ROCHA et al.

F I G U R E 7 Data Analyzer components: Seasonality test and trend test components require data preprocessing and are available
through an API [Color figure can be viewed at wileyonlinelibrary.com]

failure, a monitoring software failure, or an actual component failure, we employ this approach as we seek to focus on
explaining and validating the Analyzer services such trend and seasonality analysis.
As explained previously, the Aggregator module processes the stored data as time series. Therefore, we implemented
two tests to evaluate the failures of the monitored components: the seasonality and the trend tests. The seasonality test
(see Section 3.5.1) is performed to understand if there is a periodic behavior in the system regarding the observed failures,
whereas the trend test (see Section 3.5.2) checks whether there is an increasing or decreasing behavior in the duration of
these failures.
The Data Analyzer module was developed using the Java programming language and its entire process can be seen
in Figure 7. The process is divided into four steps, where the first two steps are general preprocessing and the third and
fourth stages represent the actual tests that we perform, which can be executed concurrently.
The Data Analyzer gathers information of data center components stored in the Thingsboard platform. The returned
JSON-based file contains all measurements regarding to the individual or groups of components. Each measurement
consists of a tuple with the date and time stamp, and the respective measured value.
Using this JSON structure, the Data Analyzer retrieves the specific component’s time series (step 1) and per-
forms the first data preprocessing to define the sampling period of the data series. This is a necessary step because
each monitoring system aggregated by the DCA System can use a different sampling period for each different
component.
In order to define the sampling period of a time series X with length N, one needs to define the time between subse-
quent measurements. Thus, we compose a sequence Y , where each element Y i of this sequence, as shown at Equation (2),
is the difference between the time that two subsequent samples Xti and Xti+1 are taken. Afterward, the sampling period is
defined as the mode of Y .

Yi = |ti − ti+1 |. (2)

After obtaining the sampling period, the time series is transformed into a binary time series that represents the pres-
ence or absence of data in the original time series (step 2). In other words, a sample is attributed a value 1 if there is data
at the correspondent sample point and 0, otherwise.
Using this binary time series, we next checked if there is a sufficient number of missing data intervals in it. There
should be at least 30 missing data intervals in order to guarantee the statistical efficiency of the employed tests. If this is
not the case, the process ends because it is considered as having an insufficient number of samples needed to perform the
tests, and it will be executed again when new data are received from the ACM. On the other hand, when there is enough
missing data, the process will call services to perform steps 3 and 4.
The services were developed using the Flask framework15 in Python programming language, version 3.6. We chose
to implement these tests as services in order to leverage the scalability of the DCA System, since it is simple to scale
self-contained stateless services. Moreover, this architecture allows including further tests and evaluations in future ver-
sions of the system. In addition, we can use different technologies, frameworks and languages to develop such tests, as
long as they can be accessed through a service interface.
DA SILVA ROCHA et al. 13

FIGURE 8 Seasonality test


methodology

3.5.1 Seasonality test

The seasonality test is performed to check whether there is one or more failure periods exhibited in time series, which
indicates that the component is subject to nonrandom failures that deserve deeper investigation from the data center man-
ager. To find the periods automatically, we adapted the methodology presented in References 16,17. We performed some
changes due to the different data shape we are considering. Figure 8 shows the seasonality test pipeline implemented.
The seasonality test receives two parameters: the binary time series (step 2) and the sampling period (step 1); and
goes initially through the first process which is the low-pass filter. This filter is needed because, since we are working
with a square wave (the binary time series), the periodogram tends to present false peaks at the odd harmonics of the
fundamental frequencies/periods of the signal. A sixth-order low-pass filter at 1 Hz is used in our case.
After the low-pass filtering stage, the periodogram of the time series is computed (the periodogram function of the
SciPy package 7 was used). The Periodogram generates the squared length of the Fourier coefficient (named power) at
each frequency. This way, a high-powered frequency (period) can be considered as a relevant frequency in the time series.
The function receives as inputs the binary time series and the sampling frequency in Hz. The sampling frequency was
calculated as follows:
1
frequency = , (3)
SP

where (SP) is the sampling period provided in seconds.


An example of a periodogram is shown in Figure 9. Figure 9(A) presents an example of the original data, while
Figure 9(B) is its periodogram. In this example, if considering a sampling period of 1 s, one can observe a pattern in the
data at each 4 s (≈0.25 Hz).
In order to define which frequencies can be considered relevant in a time series, it is necessary to calculate a power
threshold that must distinguish the noise of the Fourier Transform from the dominant frequencies (or conversely, the
dominant periods).
To calculate the power threshold (following Reference 17), we shuffle the binary time series randomly, calculate its
correspondent periodogram, and save the greatest power found. This process is performed 100 times in order to obtain
the distribution of the maximum power of the noise in the series. When randomly permuted, the time series loses any of
its periodic components. The 99th percentile of this distribution is chosen as the power threshold. Therefore, this number
is considered to be the threshold that determines the relevant cases.
With the power threshold defined, we select the frequencies from the periodogram whose power is greater than the
power threshold (see Figure 9). These frequencies are first candidates for the periods of original time series. If there is no
power greater than the power threshold, we can consider the sequence as non-periodic.
Afterward, we used Density Clustering to group similar periods. Approximate periods can be given as relevant by the
periodogram due to spectral leakage in the Fourier Transform, which causes power to disperse over the spectrum around
the central frequency. For instance, a candidate period of 24 h and another periods of 23 h and 30 min can probably be
considered as a single period.

7
https://www.scipy.org/
14 DA SILVA ROCHA et al.

FIGURE 9 Periodogram result [Color figure can be viewed at wileyonlinelibrary.com]

F I G U R E 10 Clustering pseudocode

The clustering algorithm is shown in Figure 10. It receives the list of possible periods found in the periodogram, in
ascending order, and returns list of possible periods based on the clustering centroids.
Finally, these candidate centroids are checked against the autocorrelation function (ACF) of the filtered series, which
measures similarity of a series for different periods. In this case, a high autocorrelation indicates a potential candidate
period. In order to capture if the centroid has a high autocorrelation, it is checked if it is in a hill (when autocorrelation
is increasing) or a valley (when autocorrelation is decreasing).
Figure 11 shows an example of the ACF of a signal with some candidate centroids (green and red points). For each
centroid, a quadratic function is adjusted to the ACF around the vicinity of the centroid (provided by the range from N/2
to N + N/2, where N is the centroid value). The centroid is validated (green points) if the adjusted function has a negative
second-degree term, that is, downward concavity; otherwise, if the concavity is up, the centroid is discarded (red points).
Finally, the validated points are taken and returned as strong candidates for periods in the time series.

3.5.2 Trend test

The trend test checks whether there is an increasing or decreasing trend in the failure times of the monitored components.
To perform this test, we followed the process shown in Figure 12.
DA SILVA ROCHA et al. 15

F I G U R E 11 Autocorrelation example [Color figure can be


viewed at wileyonlinelibrary.com]

F I G U R E 12 Trend test methodology [Color figure can be viewed at wileyonlinelibrary.com]

First, we calculate the failures’ times from the binary time series (step 2 of Figure 12). A failure time consists of the
period of the binary time series equals to zero, that is, the period where the component was unavailable. Next, we check
if there is enough failure periods (we considered at least 30 periods); in the positive case, the test continues; otherwise, it
stops because it is not possible to perform the trend test.
A linear regression (step 3 in Figure 12) is performed on the failure times data and the angular coefficient is used to
perform the statistical test.
Our statistical test has a null hypothesis test stating that the angular coefficient is equal to 0. Hence the alternative
hypothesis is that the coefficient is different from 0 with a 95% significance level. The statistical test was performed using
a Python library, the statsmodels.18
This statistical test returns a p-value. If the p-value is greater than 0.05, we cannot reject the null hypothesis, therefore
we cannot guarantee that there is a trend in the failures of the given time series. However, if the result is less than 0.05, we
can reject the null hypothesis and guarantee that there is a tendency in the time series. To find out if the trend is increasing
or decreasing, we look at the value of the angular coefficient, if it is negative, it is decreasing; otherwise, it is increasing.

3.6 Examples

In this section, we present two scenarios in order to show examples for applying our DCA System as implemented.

3.6.1 Scenario A

Our scenario A consists of six distinct devices, in which three of them are servers (Carcará, Cão, and Hyena), two are
switches (SW0A and SW0F) and one is an UPS (WBRC -II). These devices are located at the GPRT University lab and
16 DA SILVA ROCHA et al.

F I G U R E 13 Devices saved on
Thingsboard [Color figure can be viewed
at wileyonlinelibrary.com]

F I G U R E 14 Carcará monitoring data related to the system.cpu.load.avg metric [Color figure can be viewed at wileyonlinelibrary.com]

are being monitored by the Zabbix tool. In this scenario, the actual monitoring data, from February 29, 2020 to March
13, 2020, was used and no manual failure was added. Figure 13 depicts the Thingsboard platform and six devices we are
analyzing.
In this section, we present step-by-step examples showing when applying the seasonality and the trend tests
considering two devices: Carcará server and SW0F switch.

Carcará server
The Carcará server contains several monitoring data, and we selected system.cpu.load.avg, that saves data from the aver-
age CPU load, to be used in the tests. The data was monitored between February 29, 2020 to March 13, 2020 and its
respective time series is shown in Figure 14. The transformation of the collected time series into a binary time series of
failures is presented in Figure 15.
With Carcará binary time series, the entire process previously described in Section 3.5.1 is performed. First, it passes
through the low-pass filter, then the periodogram is calculated, the power threshold is defined, and the first possible
periods are found and used in the Density Cluster. Once the candidate periods are identified, the ACF of the binary
time series is calculated and the concavity tests of each candidate period are carried out, selecting only those that are in
downwards concavities (green dots), as shown in Figure 16. The red dots indicates the candidate periods rejected by the
concavity test.
Table 8 summarizes the periods found after performing each step. After clustering the candidates indicated by the
periodogram, 19 candidates are presented, but only 6 candidates are confirmed by the autocorrelation test. In this case,
DA SILVA ROCHA et al. 17

F I G U R E 15 Binary time series of


the Carcará server [Color figure can be
viewed at wileyonlinelibrary.com]

F I G U R E 16 Autocorrelation with
possible periods of Carcará server [Color
figure can be viewed at
wileyonlinelibrary.com]

T A B L E 8 Possible periods found


Step Quantity Possible periods (in minutes)
related to the Carcará server
Periodogram and 19 [180, 226, 281, 349, 412, 488, 594, 716, 878, 1007, 1119,
density clustering 1259, 1452, 1679, 2015, 2518, 3358, 5037, 10,074]
Autocorrelation 6 [716, 1259, 1452, 1679, 3358, 10,074]
and concavity test

the method indicates the presence of consistent periodic failures repeating at periods of about 12, 21, 24, and 28 h, as well
as, at periods of about 2 days and 1 week.
The trend test is also performed on the data. Figure 17 shows the trend test result, after a linear regression is applied
on the data and the execution of the statistical test of hypothesis (𝛼 = −0.099 and p-value is 0.562), up to 95% confidence
level.
From the tests, we can state that the Carcará server presents six possible periods of seasonable failures and does not
show an increasing or decreasing trend in its failure times.

SW0F switch
The SW0F switch also presents several monitored metrics and we selected, for this example, the ifHCInOctets.13 metric,
that is the packet entry counter on port 13 of the switch. Similarly, the data was monitored between February 29, 2020 to
March 13, 2020 and is shown in Figure 18. The respective binary time series of failures is presented in Figure 19.
After the first steps of the seasonality test, we obtained 23 candidate periods of failure, which are shown in Table 9. This
table also shows that the autocorrelation and concavity tests confirm only five candidate periods as shown in Figure 20.
The selected periods occur at about 1,6, 21, and 24 h. Longer periods of failure are detected at 1,7 days and 2,3 days.
18 DA SILVA ROCHA et al.

F I G U R E 17 Time (in minutes) of


each failure with linear regression
related to the Carcará server [Color
figure can be viewed at
wileyonlinelibrary.com]

F I G U R E 18 SW0F switch monitoring data related to the ifHCInOctets.13 metric [Color figure can be viewed at wileyonlinelibrary.com]

F I G U R E 19 Binary time series of


the SW0F switch [Color figure can be
viewed at wileyonlinelibrary.com]
DA SILVA ROCHA et al. 19

T A B L E 9 Possible periods found by steps of the SW0F


Step Quantity Possible periods (in minutes)

Periodogram and Density Clustering 23 [98, 115, 135, 176, 208, 244, 291, 356, 467, 561, 695, 807, 916,
1008, 1120, 1259, 1440, 1679, 2015, 2519, 3358, 5038, 10076]
Autocorrelation and concavity test 5 [98, 1259, 1440, 2519, 3358]

F I G U R E 20 Autocorrelation with
possible periods of SW0F switch [Color
figure can be viewed at
wileyonlinelibrary.com]

F I G U R E 21 Time (in minutes) of


each failure with linear regression
related to the SW0F switch [Color figure
can be viewed at wileyonlinelibrary.com]

From the trend test shown in Figure 21, one can see that the linear regression (red line) presents a decay when passing
the failures, but analyzing the angular coefficient (𝛼 = −0.613) at the confidence level of 95%, we cannot confirm nor state
that there is a decreasing trend in failure times as the p-value is 0.378 and, as seen earlier, to reject the null hypothesis
ensuring that there is an increasing or decreasing trend, the p-value must be less than 0.05.
Therefore, we can state that the SW0F switch presents five possible periods of seasonable failures and does not show
an increasing or decreasing trend in its failure intervals.

3.6.2 Scenario B

Since the actual data collected at our lab, considered in the previous scenario, did not present trends, we considered, for
the sake of completeness, a second scenario to perform the tests on synthetic data presenting artificial trends in failure
20 DA SILVA ROCHA et al.

F I G U R E 22 Binary time series of the Scenario B [Color figure


can be viewed at wileyonlinelibrary.com]

T A B L E 10 Possible periods found by steps of Scenario B


Step Quantity Possible periods (in hour)

Periodogram and density clustering 30 [247, 298, 366, 468, 564, 678, 806, 960, 1156, 1375, 1666, 2000,
2356, 2774, 3244, 3823, 4524, 5377, 6364, 7475, 8580, 9533,
10725, 12257, 14300, 17160, 21450, 28600, 42900, 85800]
Autocorrelation and concavity test 8 [468, 564, 5377, 6364, 7475, 12,257, 14,300, 17,160]

time. This scenario is not relevant to our domain environment, but it was close to the data presented in the Figures 15
and 19. The binary time series of failures is synthetically generated following the following steps:

1. We generate a list of 1’s with size 86,400 (equivalent to 60 days of monitoring), that represents the whole time series
without missing data;
2. Next, we generated “missing data situations” (failures) in the time series. We generated a list of 30 integers, following
the uniform distribution, sorted in increasing order that represent failures size, to represent the trending. Based on
this list, we created lists of 0’s that were inserted in the time series created in the previous step;
3. We also created a growing list of integers, also following a uniform distribution, that represents where in the time
series the fault will be inserted. Each number on this list represents the timestamp where the failure occurred; and
4. Finally, we replace the 1’s of the original binary time series by the list of 0’s according to the list of timestamps created.

The synthetic binary time series generated for this scenario can be seen in Figure 22
We defined a collection periodicity for this generic equipment of one minute. Based on this information, we made the
seasonality and trending tests.
From the seasonality test, we found 30 possible periods of failure after the periodogram and density clustering steps,
which are presented in Table 10. Once the concavity tests are performed and each possible period is evaluated, we can
see that only eight periods were selected as possible periods presenting seasonality, as shown in Figure 23.
Figure 24 shows the results of the trend test indicating the time of each failure (blue dots) and the linear regression
(red line). By the statistical test of hypothesis (𝛼 = 99.816 and p-value is 1.114−41 ), at a 95% confidence level, we can affirm
that there is an increasing trending in the interval of failures.

4 RELATED WORK

This section presents a set of existing tools commonly used for log evaluation and for data center monitoring. Since
monitoring tools generate several logs about the data center software and hardware resources, aggregating and extracting
relevant information becomes a complex task. The following existing tools (most of them open source) can be used to
analyze the log data collected from data center monitoring tools, aggregate information from many sources, and present
them for the data center manager. Each tool discussed in this section resembles some module from the DCA System
DA SILVA ROCHA et al. 21

F I G U R E 23 Autocorrelation with possible periods of


Scenario B [Color figure can be viewed at wileyonlinelibrary.com]

F I G U R E 24 Time of each failure with linear regression


related to the Scenario B [Color figure can be viewed at
wileyonlinelibrary.com]

but fails to represent its whole supported functionality. Due to this fact, we present these existing tools according to the
similarity they portray to the modules from our solution.
Logstash is an open-source server side data processing pipeline that ingests data from a multitude of sources simulta-
neously, transforms it, and sends it to one or more destinations, the most popular being Elasticsearch8 . Logstash supports
a variety of inputs from different common sources all at the same time. It works through an event processing pipeline
that has three stages: inputs, filters, and outputs. The inputs stage generates events, filters modify them, and outputs
ship them elsewhere. Inputs and outputs support codecs that enable encoding and decoding data as it enters or exits the
pipeline without having to use a separate filter. The most common used inputs are: file, that reads from a filesystem; sys-
log, that listens from port 514 for syslogd messages; redis, that reads from a Redis server; and beats, that processes events
sent by Beats9 . It also supports several plugins that read from other sources. Filters are the intermediary processing in the
pipeline. The most common ones are: grok, that parses and structures an arbitrary text; mutate, that performs transfor-
mation, like renaming fields; and geoip, that adds geographic information based on IP addresses. Logstash also supports
plugins for filter processing. Output is the final phase, being Elasticsearch the most common one used. Logstash also
supports file output and other plugins.
Filebeat10 was designed to address some of the weacknesses of Logstash. As part of the Beats family, it is a lightweight
log shipper that is usually used to push data to Logstash or directly to Elasticsearch. Filebeat reads and forwards log lines
and if interrupted remembers the location of where it left off when it comes back online. It was designed to be simple and
comes with internal modules, like Apache, Nginx, MySQL and more, that simplify collection, parsing, and visualization
of log formats from these commonly used tools.
Fluentd11 is another shipper that tries to structure data as JSON allowing it to unify all facets of processing log data:
collecting, filtering, buffering, and outputting logs across multiple sources and destinations. Fluentd has a flexible plugin

8
https://www.elastic.co/logstash
9
https://www.elastic.co/downloads/beats
10
https://www.elastic.co/beats/filebeat
11
https://www.fluentd.org/architecture
22 DA SILVA ROCHA et al.

T A B L E 11 Comparative among tools


Tool Data collect Data aggregation Data analysis

Logstash X X
Filebeat X
Fluentd X
Logagent X
Rsyslogd X
Elasticsearch X X
DCA System X X X

system that gives the developer community the means to introduce new functionalities. The plugin programming lan-
guage is Ruby which is a simple language with a soft learning curve. It offers more than 500 community-contributed
plugins with different degrees of maturity. Fluentd is written in the C language in combination with Ruby, requiring little
system resources. For tighter memory requirements, it offers the Fluent Bit module, which is to Fluentd what Filebeat is
to Logstash, while also supporting delegation as Logstash.
Logagent is a shipper from Sematext12 . It is a lightweight, open source tool that comes with out of the box extensible
log parsing, on-disk buffering, secure transport, and bulk indexing to Elasticsearch and Sematext cloud. It can also read
and write from Kafka, an open-source publish and subscribe to streams of records. Logagent also addresses the main
weakness of Logstash, having low memory footprint and CPU overhead, making it suitable to be deployed on edge nodes.
It was designed to be a Logstash alternative. Logagent can mask sensitive data like date of birth, credicard numbers, etc,
and add GeoIP information, like Logstash. It supports pluggable input and outputs, including from many third party
modules. It accepts, for instance Heroku and CloudFoundry logs.
Rsyslogd13 is the default syslog daemon found on most Linux distributions, but it can do more than just picking logs
and writing these messages to text files. Rsyslogd can accept inputs from a wide variety of sources, transform them and
send the output results to diverse destinations, including Elasticsearch, Logstash, and Kafka. It offers high-performance
and strong enterprise focus, but it can also scale down to small systems. Rsyslogd has a modular design. New functionali-
ties can be dynamically loaded from modules which may be written from third party projects14 . Its main classes are: input,
output, parser, and message modification modules. The last one can be used for anonymizing fields and add GeoIP infor-
mation, among other supported functionalities. Rsyslogd can be used as a simple router in resource-constrained machines
on the edge network with limited network bandwidth, supporting delegation as Logstash. It has a grammar-based parsing
module that scales very well, outperforming the regex-based parsers like grok, used by Logstash and Logagent.
Elasticsearch15 is an engine that provides real-time search and analytics for all types of data. Elasticsearch also allows
aggregate information to discover trends and patterns in the data being analyzed. It offers speed and flexibility to han-
dle data in use cases such as: to store and analyze logs, metrics, and security events, automate business workflows and
modeling the behavior of data in real time through machine learning.
We performed a comparative analysis of the found tools and our DCA System. As illustrated in Table 11, Filebeat,
Fluentd, Logagent, and Rsyslogd perform data collection similarly to ACM module. At the same time, Logstash provides
both data collection and aggregation, and Elasticsearch performs data aggregation and analysis. Only the DCA System
shows all features through its three modules: ACM, Aggregator, and Analyzer, delivering a complete service over an
extended processing pipeline. The DCA System benefits from the flexibility in dealing with many data types due to the
ACM design, purposely tailored for adjusting the input to the proposed time series data template. In addition, we believe
that failures in monitored systems should not be treated as isolated problems, but that failures can influence other failures.
Based on this, only our system performs trend and seasonality tests on the detected failures, looking for patterns and
behaviors in the data to assist data center managers.
We found few academic researches in the literature about data center monitoring tools. Krate et al.19 proposed the
Monalytics, which is a system for monitoring and analysis of large data centers. The paper proposed the Monalytics

12
https://sematext.com/logagent/
13
https://www.rsyslog.com/doc/v8-stable/index.html
14
https://www.rsyslog.com/doc/v8-stable/configuration/modules/index.html
15
https://www.elastic.co/pt/elasticsearch/
DA SILVA ROCHA et al. 23

architecture and an implementation. The Monalytics presents templates for data collection and representation (aggre-
gation), similar to DCA System. The authors presented an evaluation for fault detection and reaction with Monalytics,
where a fault injection methodology and the system is able to detect and react to the failure. Besides Monalytics is close
to the DCA System, it not allows higher level data analysis, such time series analysis and other.
The work in Reference 20 presents the use of Application-Layer Traffic Optimization (ALTO) protocol for a
cloud management system. This system can collect data about the network of data centers through the ALTO pro-
tocol for many purposes, such as monitoring delay and loss. The authors highlight that the information collected
can be used for a intelligent virtual machine placement in cloud environments. However, this system is limited
for network monitoring of data centers. Yu et al.21 also propose a tool for network monitoring, named Distributed
and Collaborative Monitoring system which allows switches to collaboratively achieve flow monitoring tasks and
balance load.
Herodotou et al.22 presented a system which applies statistical data mining techniques on large-scale active monitoring
data with the purpose determine a ranked list of suspect causes. The failure detection and localization approach is based
on a series of statistical models to find ranked list of devices and links that can impact the network availability. The authors
evaluated their proposal in the Windows Azure production environment to prove its effectiveness in terms of localization
accuracy, precision, and time to localization using network data of incidents over the past 3 months. Although very close
to the DCA tool, our proposal is extensible to monitoring not only availability of data center, but power consumption,
temperature status, resource utilization, among others. In addition our methodology can monitoring different aspects of
data center infrastructures, such as the power, cooling, and IT subsystem.
The work proposed by Brondolin et al.23 presents a tool for power consumption monitoring of applications running
in containers. The main purpose is to provide power consumption information from application for data center man-
agers. The DCA tool can provide this information for data center managers and many others, only adjusting the ACM for
different power monitoring tools.

5 CO NCLUSIONS AND FUTURE WO RKS

Understanding data center infrastructure behavior is essential for vendors to avoid problems such as service interruptions
and financial losses. For this purpose, many strategies to monitor data center infrastructures emerged in the last years. The
presence of different approaches and techniques within a single environment introduce heterogeneity in the monitoring
context, mainly in terms of resource relationship and subsystem levels.
To solve the heterogeneity problem, some initiatives are working on standardization efforts of monitoring solu-
tions. However, these works remain at an early stage. Thus, data center operators need alternative solutions to deal
with heterogeneity present in monitoring tools available in the market. In this work, we presented the DCA System to
aggregate data center measurement data for availability analysis. This software system attends the three requirements
previously stated. Regarding those requirements, we highlight that, the templates of standard attributes (see Table 1)
are as generic as possible, and therefore, are able to: support multiple data center management tools (first require-
ment); support new management systems (second requirement); and contain all the attributes required to represent
any data center component, being therefore, able to deal with components in the IT, power, and cooling subsystems
(third requirement).
The ACMs also allow DCA System to satisfy the two first requirements, once to support multiple and new management
tools, it will be necessary only to develop an ACM interface to the specific management tool, keeping the remaining
modules (Aggregator and Data Analyzer) without modifications. Besides, as the ACM follows the template standard the
Aggregator and Data Analyzer will be able to aggregate and analyze the collected data, respectively.
The Aggregator also contributes to the first requirement (support multiple data center management tools) once it
aggregates all the collected data of a component monitored by different monitoring systems in a single entity.
Lastly, the Data Analyzer contributes to achieve the third requirement (monitor and analyze wide spectrum of
data (IT, power, and cooling)) once it generates models to represent the interconnection and redundancy of the data
center components, to be ultimately analyzed and provide availability results per component and for the whole
infrastructure.
The DCA System is easily installed if an environment is already configured with Zabbix monitoring tool. With that, it
is necessary to install the modules of the DCA System and adjust the configuration files. If the environment is monitored
by another tool, you will need to develop the ACM module for the respective tool.
24 DA SILVA ROCHA et al.

As future works, we have the following plans: (i) add compression methods to the ACM to decrease the overload impact
in the network in JSON transmissions; (ii) add a database in the ACM module for larger scenarios; (iii) evaluate different
methods such as MessagePack, CBOR, and Smile; (iv) improve the Analyzer module by adding new functions to obtain
more insights about time series data in terms of availability analysis, such as stationarity analysis; and (v) implement
alerts tailored to data center managers’ preferences.

ACKNOWLEDGEMENTS
This work was supported by the Research, Development and Innovation Center, Ericsson Telecomunicações S.A., Brazil.

ORCID
Diego Bezerra https://orcid.org/0000-0002-9933-091X
Patricia T. Endo https://orcid.org/0000-0002-9163-5583

REFERENCES
1. Chircu A, Sultanow E, Baum D, Koch C, Seßler M. Visualization and machine learning for data center management. Paper presented
at: Proceedings of the INFORMATIK 2019: 50 Jahre Gesellschaft für Informatik–Informatik für Gesellschaft (Workshop-Beiträge). Bonn,
Germany; 2019. Gesellschaft für Informatik eV.
2. Endo PT, Santos GL, Rosendo D, et al. Minimizing and managing cloud failures. Computer. 2017;50(11):86-90.
3. Mendonça J, Andrade E, Endo PT, Lima R. Disaster recovery solutions for IT systems: a systematic mapping study. J Syst Softw.
2019;149:511-530.
4. Ferreira L, Endo PT, Rosendo D, et al. Standardization efforts for traditional data center infrastructure management: the big picture. IEEE
Eng Manag Rev. 2020;48(1):92-103.
5. Demetriou DW, Calder A. Evolution of data center infrastructure management tools. ASHRAE J. 2019;61(6):52-58.
6. Petruti CM, Puiu BA, Ivanciu IA, Dobrota V. Automatic management solution in cloud using NtopNG and Zabbix. Paper presented at:
Proceedings of the 2018 17th RoEduNet Conference: Networking in Education and Research (RoEduNet) 2018 September 6. Cluj-Napoca,
Romania; 2018:1-6; IEEE.
7. Renita J, Elizabeth NE. Network’s server monitoring and analysis using Nagios. Paper presented at: Proceedings of the 2017 International
Conference on Wireless Communications, Signal Processing and Networking (WiSPNET) 2017 March 22. Chennai, India; 2017:1904-1909;
IEEE.
8. Sukhija N, Bautista E. Towards a framework for monitoring and analyzing high performance computing environments using Kubernetes
and Prometheus. Paper presented at: Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced &
Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation.
Leicester, UK; 2019.
9. Perspectives Industry Mashup: monitoring data center power and cooling simultaneously; 2017. https://www.datacenterknowledge.com/
archives/2017/03/24/mashup-monitori%ng-data-center-power-and-cooling-simultaneously. Accessed May 19, 2020.
10. Goncalves G, Rosendo D, Ferreira L, et al. A standard to rule them all: redfish. IEEE Commun Stand Magaz. 2019;3(2):36-43.
11. Li Jing, Sun Aixin, Han Jianglei, Li Chenliang. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge
and Data Engineering. 2020;1–1. http://dx.doi.org/10.1109/tkde.2020.2981314.
12. Liang H, Sun X, Sun Y, Gao Y. Text feature extraction based on deep learning: a review. EURASIP J Wirel Commun Netw. 2017;2017(1):1-12.
13. Ji S, Satish N, Li S, Dubey PK. Parallelizing word2vec in shared and distributed memory. IEEE Trans Parall Distrib Syst.
2019;30(9):2090-2100.
14. Likas A, Vlassis N, Verbeek JJ. The global k-means clustering algorithm. Pattern Recog. 2003;36(2):451-461.
15. Grinberg M. Flask Web Development: Developing Web Applications with Python. Sebastopol, CA: O’Reilly Media, Inc; 2018.
16. Vlachos M, Yu P, Castelli V. On periodicity detection and structural periodic similarity. Paper presented at: Proceedings of the 2005 SIAM
international conference on data mining 2005 April 21. Newport Beach, California; 2005:449-460; Society for Industrial and Applied
Mathematics.
17. Puech T, Boussard M, D’Amato, A, Millerand G. A fully automated periodicity detection in time series. Paper presented at: Proceedings
of the International Workshop on Advanced Analysis and Learning on Temporal Data 2019 September 20; 2019:43-54; Springer, Cham.
18. Seabold S, Perktold J. Statsmodels: econometric and statistical modeling with python. Paper presented at: Proceedings of the 9th Python
in Science Conference 2010 June 28 Vol. 57. Austin, Texas; 2010:61.
19. Kutare M, Eisenhauer G, Wang C, Schwan K, Talwar V, Wolf M. Monalytics: online monitoring and analytics for managing large scale
data centers. Paper presented at: Proceedings of the 7th International Conference on Autonomic Computing 2010 June 7. Washington,
DC; 2010:141-150.
20. Scharf M, Voith T, Roome W, Gaglianello B, Steiner M, Hilt V, Gurbani VK. Monitoring and abstraction for networked clouds. Paper
presented at: Proceedings of the 2012 16th International Conference on Intelligence in Next Generation Networks 2012 October 8. Berlin,
Germany; 2012:80-85; IEEE.
21. Yu Y, Qian C, Li X. Distributed and collaborative traffic monitoring in software defined networks. Paper presented at: Proceedings of the
3rd Workshop on Hot Topics in Software Defined Networking 2014 Aug 22. Chicago, IL; 2014:85-90.
DA SILVA ROCHA et al. 25

22. Herodotou H, Ding B, Balakrishnan S, Outhred G, Fitter P. Scalable near real-time failure localization of data center networks. Paper
presented at: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2014; August
24. New York, New York, USA; 2014:1689-1698.
23. Brondolin R, Sardelli T, Santambrogio MD. Deep-mon: dynamic and energy efficient power monitoring for container-based infras-
tructures. Paper presented at: Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW) 2018 May 21. Vancouver, British Columbia, Canada; 2018:676-684; IEEE.

How to cite this article: da Silva Rocha É, G. F. da Silva L, Santos GL, et al. Aggregating data center
measurements for availability analysis. Softw Pract Exper. 2020;1–25. https://doi.org/10.1002/spe.2934

You might also like