You are on page 1of 23

MONITORING PLATFORM ASSESSMENT

ABSTRACT:
In this document is described a brief comparison and assessment for the Monitoring
Platform (from now on called MP), with the goal to have a more clear understanding
possible future requirements and improvements.
The MP Project is indeed an interesting challenge and as a Team we should provide an easy
and accessible solution, fully configurable from the WebUI. The integration with
OpenStack is also an important factor. The MP should comprehend the different
Monitoring branch as:
Availability Monitoring
Performance Metric Monitoring
Security Event Monitoring
Log Analysis Monitoring
The different Monitoring technologies should be merged together from a unified Web
Interface identified as MP Dashboard. Starting from the Dashboard the User should be
driven to the right resources she/he want to monitor. This offer a balance between the
different Monitoring branch usage/priority. Focusing only on one brunch and then build the
other brunch as addons should be avoided as approach as we don't know the final user
priority in respect of the different technologies.

SELF-ASSESSMENT:
Different Strategic aspect should be defined as they can strongly influence the decision.
Common strategic questions are:
Should we develop a Platform to be proposed and hopefully included in

OpenStack?
Should we Install the fastest Plug & Play OSS solution?
What's the short, medium and long term strategic vision for our MP?
Are we looking for a near real-time data processing or is acceptable a data
processing alerting delay?
If we want to develop our own solution from scratch or keep developing the
existing one, is our intention to release the software a OSS?

GENERAL CONSIDERATIONS:
The analysis was made based on giving some useful feedback in just one week. For this
reason the information can be inaccurate and some tool may have been excluded for a
number of reasons like:
Fully developed in a programming language not used by the Team
Inactive community
Close Source Project

Product developed only by one or two people


Strong influence from one single Company.

Some rating from http://www.ohloh.net was taken in consideration as a reference and


source of information.
SWIFT REQUIREMENT:
System level stats
The obvious stuff here; CPU load, memory usage, disk space free. Notification of disk
failure is particularly important; well have a lot of disks and we want failures replaced in
good time. Any standard system (collectd?) will do the job here.
Swift system stats
Swift Recon produces a collection of statistics about the Swift instance that are particularly
important to measure.
Swift Recon works on a pull model; the monitoring system needs to query swift recon to
get the latest stats.
Swift instance stats
The swift code is heavily instrumented with a lot of different statsd metrics. We want all
these to show up in the monitoring platform. Using statsd in some way is also a
requirement; we dont want to alter the Swift code to use a different system, so the
monitoring platform will need to work with statsd.
We want all the metrics to be available in the monitoring platform, preferably without any
manual steps. Well also want some storage of metrics to use when diagnosing problems.
They must also be available for graphing and alerting.
What we want from the monitoring platform
Stats handling an storage
Alerting
We need to know when something has gone wrong. For example, disk or node failures
should trigger an alert in the form of an email, red box on a web page or both. Well also
want to trigger alerts based on trends such as a rise in async-pending operations or high
load averages etc.

GENERAL REQUIREMENTS:
Graphing
A dashboard showing key current statistics about the Swift cluster could be very useful for
the operations teams. Data such as request timings, CPU load etc could provide a useful
at-a-glance health-check for the system.
Also well want some way to graph other metrics for use when diagnosing problems in the
system itself.

Licensing:

- The main core component of the MP should be released under an Open Source License
(*GPL*, BSD, MIT, APACHE).
Development:

- The Monitoring platform should be developed with a programming language that is


already used from the Team. The Dev platform preferred for the Cloud R & D Team are:
Java (preferred)
C
Python (preferred)
Ruby
The Monitoring Platform should be developed with one of the mentioned
programming languages
Platform:

- The MP should use a Unix Flavor (Linux preferred) as OS Platform. The MP should be
able to monitor Windows family OS
Data Collection:
The data from servers/nodes/devices should be collected using one or more of the

following technologies:
SNMP
Agent (Ganglia or technology specific agent)
Syslog
Puppet/Facter/Mcollective
Rely on AMQP technologies (RabbitMQ preferred)
CSV
XML/YAML
Data Storage:
Data Should be preferably stored on a NoSQL DB Technology as HBase,

MongoDB, Redis, OpenTSDB (HBase based). At least 2 different DB


technologies should be supported. One of them should be SQL.
The Data should be replicate easily in both low and high latencies environment
The Platform should be fully redundant at any component level
http://incubator.apache.org/ambari/ it is work to take a look to Ambari (HBase
related)
Would be a great value an abstraction DB Layer to use, so our platform can be
DB Independent (SQLAlchemy, Hibernate).

Data Export:
Export event data (using syslog, AMQP, LogStash)

Web User Interface:


Dynamic Dashboard
Live Search
Log file search
Send commands to multiple hosts
Groups Access Control
HTTP/REST Interface
Inventory (information related to the hardware)
Maps Support
Full Control: everything should be configured through the WebUI.
We should be able to drag and drop a HostGroup from an LDAP query and

automatically import them in the MP or import the whole LDAP specified server
structure.
Availability, Performance and Security Monitoring interface should be as much as
possible unified.
Right-click functionality on map for automatic cloning or moving the cluster in
another Datacenter (interaction with our Deployment Stack)
Integration:
The MP should provide JSON/REST API to interact with other technologies.
It should easily be integrated with Remedy, JIRA and the current SMS platform.
Integration with LDAP Protocol (Active directory support LDAP)
It is fundamental to monitor ever single component of the OpenStack Cloud suite
Technology:
Services/OS AVAILABILITY data collection
Service/OS PERFORMANCE data collection
Preferably, the agent installed on the servers should push the data rather than poll.
Nodes Auto Discovery
Trending
Triggers / Alerts
Average Computation for automatic thresholds
MP Core should be actively developed from at least 5 years (fork from another

project is OK).
MP should be updated and released at least once every year
The MP should have a serious Company behind with professional engineers
working on it.
Notification Flood Prevention
Flapping detection
OSSEC Integration (Security Metrics and analysis)
Plugins should set the thresholds value automatically, computing the maximum

available resources against the current usage resources.


Additional plugins/checks should be developed with Python, Ruby Java or C.
Smartphone APP should be already available or easy to build (using API)
Components should be independent. This mean if we want to use a different
collector or a different technology
Alert triggering per average and/or current absolute values
Log Analysis should be performed as additional source of information
Possibly Compatible / Mentioned in Openstack
Extensive Documentation
The simplest as possible Architecture
Active Community: forum and ticketing systems should be available.
Trigger alert using different threshold from different metrics (which can be
different services or different nodes/clusters).
Ability to query data based on: regexp on metric name, metric value, hostname,
etc)
Ability to compute metrics from different source metrics
Alerting per deployment failure
Ability to specify your own aggregation way (let's say, avg, low, high, current or
custom function) AKA Interpolated data representation.
Exact data representation
There should be a public big repository for plugins, templates and new checks

INFRASTRUCTURE ARCHITECTURE:
Every component should be deployed on a dedicated node
Every component on a dedicated node should be redundant
Distributed Platform: The sharding of monitored hosts groups should be

automatic. This can be arranged per HostGroup (cluster), per lower latency or per
geographic location.
A specific search engine should be used to search any type of information
(Sphinx, any other option?)
The Infrastructure should be scalable to multiple Data Centre on both low and
high latencies environment
Web Services should be easily Load Balanced.
OPEN STRATEGIC TOPICS:
How much services we'll supervise in the next 12 months?
Service check frequency
How the metrics will be collected (protocols definitions, at least 2)
How long do we want to retain performance data?
How long do we want to retain availability and status data?
Passive or Active checks method?

Minimum Resources to Monitor (needs to be extended or reduced):


OS:
memory usage details
CPU, any single state and per core monitoring (usr, sys, idl, wai, hiq, siq)
Context Switch, Interrupt
Page I/O
Procs: Runnable, Blocked, New
Disks: Total R/W
Most expensive memory process
Most expensive cpu process
System Load 1, 5, 15
aio stats (asynchronous I/O)
filesystem stats (open files, inodes)
ipc stats (message queue, semaphores, shared memory)
file lock stats (posix, flock, read, write)
raw stats (raw sockets)
vm stats (hard pagefaults, soft pagefaults, allocated, free)
most expensive block I/O process
process using the most CPU time
process with highest total latency
process with the highest average latency
Alerting dependencies mapping (not be flooded by alerts)
Puppet Integration should be done easily
NETWORK:
Network stats: per interface stats as throughput, errors, per protocol stats,

connections details, number of connections and per IP connections (important for


Country traffic profiling and Business Intelligence)
socket stats (total, tcp, udp, raw, ip-fragments)
tcp stats (listen, established, syn, time_wait, close)
udp stats (listen, active)
unix stats (datagram, stream, listen, active.

SECURITY:
At some point, we'll have to meet the PCI-DSS standard requirement.
File (System) Integrity Checking (PCI-DSS sections 11.5, 10.5.5)
Log Monitoring (Security perspective i.e. Brute Forcing detection, PCIDSS
section 10 in a whole)
Rootkit Detection
Policy Enforcement Checking (weak password detection)

OPENSTACK SWIFT:

Disk utilization
Monitor how much space is available from Swifts perspective; this is distinct to a

system level view


Un-mounted drives
Monitors drive failures; Swift unmounts a drive when it has a problem.
Async-pending
An async-pending happens when a container update listing fails. If these levels are
high then cluster is degraded.
vm stats (hard pagefaults, soft pagefaults, allocated, free)
per disk transactions per second (tps) stats
per disk utilization in percentage
per disk utilization in megabytes
per filesystem disk usage
per disk transactions per second (tps) stats
per disk utilization in percentage
Swift Object Sever: ensure all the server cluster have the same copy of object ring
Swift Dispersion: dispersion analysis and check that all copies of objects are OK
Swift FS check: upload, download and delete a file in a Swift Container to check
that it works correctly

OPENSTACK NOVA:
Check for one or more flavors
Possibility to list servers
One or more images available
One or more security_groups available

OPENSTACK KEYSTONE (Identity Service)


Check if is possible to get a Token and check if there's a public URL declared for

that service.
OPENSTACK GLANCE:
find the minimum images number and images name desired in glance.

TECHNOLOGIES COMPARISON:

Nagios:
Intro:
Is the standard Open Source Software for monitoring.
Pros:
Excellent engine for Availability Monitoring.
Huge plugin/checks publicly available database
Very Active Open Source community (1M users)
Actively developed

Very fast and performant (developed in C)


Mentioned in the OpenStack documentation
Good know how in the Team
Can be integrated with Ganglia
Can monitor easily OpenStack resources
Can Monitor every resource in RabbitMQ
Cons:
No HTTP/REST Interface
No Live Search available
Distributed monitoring and high availability can cab achieved but a lots of work
is required
Achieve Performance and scalability can be difficult
Limited DB choice (natively no NoSQL DB is supported)
No Integration with LDAP/AD for authentication (difficult to achieve)
Text based configuration. Manage multiple monitoring node required a
reasonable effort
Lack of automatic Services Discovery
Poor reports
Closed development scheme
Not developed in one of the preferred Team programming language
What can be taken?
The nagios core daemon is probably the most advanced availability engine in the
community. The plugin database compatibility is a big plus and can save a lots
of effort in the check development process
Icinga
Intro:
The very first Nagios Fork in the OSS Community
Pros:
All the Nagios Pros
Ability to send command to multiple hosts simultaneously
Pretty much everything can be done from the Web Interface
Live Search support
Web interface in AJAX (PHP)
HTTP / REST Interface
Export in XML via REST API
Public GIT Repo
Can be integrated with Ganglia
Can be integrated with Graphite
Can monitor easily OpenStack resources
Mentioned in the Openstack website
Cons:

To the date no NoSQL DB is natively supported (MySQL, PSQL, Oracle)


Not developed in one of the preferred Team programming language
RRDTool data storage (however integrable with Graphite for exact data

representation)
What can be taken?
Live Search
Compound commands
LDAP/AD Interface
HTTP / REST Interface
Export in XML via REST API
Probably no specific Team experience, but as the solution is Nagios based a
relevant know how should be available

Shinken
Intro:
Nagios Fork entirely rewritten in Python.
Pros:
All the Nagios Pros
Ability to send command to multiple hosts simultaneously
Pretty much everything can be done from the Web Interface
Full featured Web Interface
Interface with LDAP/AD
LiveStatus API (JSON)
Can export and store data to MongoDB
Can sent all logs to Splunk, graylog2, LogStach, ElasticSearch, Banana
Public GIT Repo
Can be integrated with Graphite
Developed in Python
Support any Nagios Plugin
Auto Discovery
Distributed and Redundant Architecture
Support Redis, Mustached as data retention module
Can monitor easily OpenStack Resources
mentioned in OpenStack website

Cons:
Not sure if it can be integrated with Ganglia (probably with check_ganglia)
Shinken it is not a Nagios fork, but rather a Nagios rewrite in Python. The project

is relatively new (2010)


Live Search support not supported, but you can export the data to Log Stash or
ElasticSearch and perform searches from there. Also it is possible to use Sphinx
depending on the DB used

What can be taken?


Pretty much everything
Need to be verified if ganglia can be used as collecting method. Probably yes, as

it can be used as Nagios check.


Probably no specific Team experience, but as the solution is Nagios based a
relevant know how should be available
Graphite
Intro:
The OpenSource most used software for exact data representation and graphic

rendering
Pros:
Real-Time Graphing with exact data representation
Developed in Python
Components can be deployed in a distributed way (scalability and flexibility)
Django Web Framework
AMQP can be used for application data routing, but now documentation is
available.
Very I/O Efficient
Integrated with LDAP
Very customizable Dashboard
Advanced graphing (use cairo as rendering engine)
Data can be displayed/exported in different format, including JSON
Easily integrable with other monitoring platform
(https://graphite.readthedocs.org/en/latest/tools.html)
Cons:
Data collection must be done by externals addon (there are a lots)
Not sure how to use different DB technologies for data storing (from the doc only
whisper is supported)
No native alerting support

What can be taken?


Graphite is a great added value as performance graphing component. I'd define

Graphite as part of the MP solution rather then the complete MP solution itself.

Zod Bundle
Intro:
The Current MP used in the R&D Cloud
Pros:
Ability to query current value of metrics
Filter criteria include pattern matching and string manipulation on host, metric,
cluster and value

Ability to do comparisons between metric values / times etc using operators > <

>= <= !=
String manipulation and formatting to generate output of live query data
Ability to generate report tables (using live query) which may be permalinked and
saved for later use
Ability to view graph of historic data of any persisted metric returned by live
query
Alerting notification, filtering, and hiding of alerts
Persistence for up to 1 year
Http user interface
Ability to generate report tables (using live query) which may be permalinked and
saved for later use
Ability to view graph of historic data of any persisted metric returned by live
query
Alerting notification, filtering, and hiding of alerts
Metric Aggregation Operations across a cluster (avg, min, max)
Computed Metrics (from multiple metric across a host)
Add new hosts and services is done through the Deployment Platform (puppet in
our case).
Zod now support Statsd

Cons:
Data produced is limited by the method of interpolation. (only avg: 5 mins for 1

day, 1 hour for 10 days, 1 day for 1 year). This type of aggregation is not relevant
in some cases. i.e. I might want to see the max request time
Very limited historical data analysis
Open Source Community not available
Not Web User Interface for Config Management
Any improvement and bug fix rely on our internal effort
Developed in Clojure
No roles based access (single role)
Limited Reporting (time series, historical downtime or uptime computation)
Zod will only operate on float types for checks / historical aggregation /
computing etc.
Add new report is time consuming as plain html/javascript need to be written.
Trending not available
Limited Documentation Available

What can be taken?


Live Query and all the Pros.
Ganglia
Intro:

Very efficient ad scalable System Monitor Tool. Also can be used as single

component for data collection.


Pros:
Used in the current production monitoring solution
Data exported via XML (over TCP)
Support UDP Multicast and Unicast for external data representation
Distributed Federation Model (scalability and redundancy)
It can be easily extended adding new plugins
Can be integrated easily with HBase (Dedicated Appendix in the book
Monitoring with Ganglia)
It can be engineered to read and export local log files (need to be verified if
would be possible to send local log)
Integrable with sFlow
Single independent components
Widely used in the OSS Community
Active Development
Can be integrated with Graphite (Carbon component)
The agent can be used standalone to collect data and can be used to perform a
minimum of local analysis (Real Time)
Cons:
Use RRDTool for data representation (Interpolated Data Representation, however
can send the data in XML)
Using UDP is not good for Bulk data Transfer (which mean possible reliability
issues to transfer file content)
What can be taken?

Cacti
Intro:

Performance Monitoring Tool.


Pros:
User Role Management
Pretty much everything configurable from the Web UI
Flexible and easy to configure Graphing
Installation and configuration can be done in few minutes
Mentioned in OpenStack (used to monitor OpenStack Public facing Infrastructure
Performance http://cacti.openstack.org/cacti/graph_view.php?
action=tree&tree_id=1&leaf_id=7&select_first=true)
Cons:
Limited DB choice (MySQL)
Limited source collection method (SNMP Only)
Difficult to scale (but with one server you can monitor easily 40k resources)
Difficult to achieve advanced features like discovery

Developed in PHP which is not one f the preferred Team language


Natively doesn't provide graphing (available with Addon)
What can be taken?
User Role

Zenoss
Intro:

Pros:
Support Nagios Plugin
90% developed in Python 10% Java (based on Team experience)
Auto Discovery services and hosts detection
Trending
Trend Prediction
Full control from Web UI
User Access Control
Advanced reporting, graphs and analysis
Easy to implement auto-scaling
Mentioned in OpenStack
Support Distributed monitoring and Inventory
Event Correlation engine
Strong Community
Agent-less
Better oriented for SNMP
Huge amount of ZenPacks (Zenoss plugins to perform checks on target devices)
Active Monitoring only (can be also a Cons)
Using the Amazon Web Service ZenPack you can monitor AWS instances
Cons:
More advanced features (real time picture of complex dynamic network or
automated root cause analysis) are only provided by Non OpenSource modules
Limited support DB (MySQL, ZODB, RRD)
Only two access level (admin and read only)
What can be taken?
Agent-less approach can be an option (however you loose near rel time)
Event Correlation Engineering
Nagios plugin supporting
In general all the mentioned Pros

Zabbix
Intro:
Zabbix is enterprise class Linux monitoring system with impressive list of

capabilities available out of the box

Pros:
Very easy to install and configure
Can track change make by specified files (like /etc/passwd) and send alert
Flexibility on data collection method
Very flexible alerting (Agent, SNMP, IPMI, JMX, SSH, XMPP)
Advanced Web Interface. Pretty much everything can be configured from the UI
Very perfomant as the collector and analyzer engine is in C
Auto Discovery
Distributed Monitoring with centralized web administration
Java Gateway
Strong Community
Actively Developed
Support Trending
Mentioned in OpenStack website
Used to monitor OpenStack
API support
Cons:
Not developed by the Team preferred language (Java, Python)
What can be taken?

OpsView Core
Intro:
Based on Nagios.
Pros:
Everything configurable from the Web Interface
Integrated with other performance monitoring like MRTG and NMIS
All the Nagios Pros
REST API
Cons:
Only MySQL support ad DB
Developed in Perl which is not of the Team preferred language
Scale can be a problem as you may be forced to buy the commercial modules
from Opsera (the company who own Opsview)

What can be taken?


Web Interface very complete and easy to use

OpenNMS
Intro:
Award Winning Enterprise OSS Monitoring
Pros:
Easily integrable with Jira and Ticketing Systems

Very Scalable. Is can collect 1.2M data points in 5 minutes via SNMP .
Can handle Syslog Messages
Developed in Java
Multiple data collection method (SNMP, HTTP, JMX, WMI, XMP, XML,
NSCLient, JBC)
Good reporting. An integration with JasperReports creates high level reports from
database and collected performance
Nodes and Service Discovery
Can be integrated with Provisioning System.
Python utilities are available to match functionalities
Active development
big community support
REST API
Trending support
Cons:
Not Mentioned in OpenStack
Only PostgreSQL is supported as DB (however there's a development to integrate
Java Hibernate, so OpenNMS will be DB application Independent)
What can be taken?
Scalability Scheme
Syslog Handling

Munin
Intro:
Performance Metric Tool
Pros:
Trending
Trend Prediction
Cons:
No auto discovering
No IP SLA reports
No syslog support
Limited alerting/triggering
Use RRD only for data representation
Developed in Perl (not a Team preferred language)
Not scalable
Web UI only for viewing information

What can be taken?


Trending
Trend Alerting

NetXMS:
Intro:
Enterprise multi platform network management and monitoring system
Pros:
Advanced Security features Access Control, Encryption, Different Authentication
method
Native Java and C API
Nagios Plugin Compatibility
Built-in Interface with Help Desk System
Agent and server can be extended with third-party or in-house module
Scalability (Proxy mode support to add other NetXMS agent)
Full Control Web Interface
Map support from UI
Auto-discovery (OSI Layer 2 and 3)
Data collection with SNMP and Agent
Policy Based event processing
Event forwarding to external source
Can send alarms to one or more operators
Event Correlation rule (number of chain alerts reduction)
Built-in scripting for automation and management
Dedicated Management Console
Support Trend and Trend Prediction
Written in C, C++, Java
Can send commands to multiple hosts
Actively Developed
Active community

Cons:
No NoSQL DB support (Oracle, PSQL, MySQL, MSSQL)
What can be taken?
Everything from the Pros

Collectd
Intro:
Data Collection Component
Pros:
Very efficient (is a pure C daemon)
Cryptography support for data non-repudiation
Plugins can be developed in Python and Java
Big quantity of plugins
By official Plugin, can store data in Graphite (Carbon) using TCP (keep
connection open to minimize the connections overhead)

External plugin to store the data point in OpenTSDB (API to HBASE,

https://github.com/dotcloud/collectd-opentsdb/))
Cons:
collectd basically is a poller. It doesn't push data, therefore the architecture need
to be designed accordingly as polling a big number of servers can be more
complicated, as probably multiple instances of collectd should be used to collect
data efficiently
What can be taken?
Cryptography,
Plugins compatibility
store the data point in OpenTSDB (API to HBASE)

Ceilometer
Intro:
Pros:
Cons:
What can be taken?

StatsD
Intro:
A network daemon that runs on the Node.js platform and listens for statistics, like

counters and timers, sent over UDP and sends aggregates to one or more
pluggable back end services
Node.js:
Node.js is a platform built on Chrome's JavaScript runtime for easily building
fast, scalable network applications. Node.js uses an event-driven, non-blocking
I/O model that makes it lightweight and efficient, perfect for data-intensive realtime applications that run across distributed devices.
Pros:
Fully Integrated with OpenStack
Can be embedded easily in other collection software (like collectd)
Simple and understandable Design
Support a big number of back end (Graphite)
Actively Developed
Community Support

Cons:
Developed in Javascript (not sure about scalability)
The quantity of metrics type that can be collected are not so extensive as other

available tools
What can be taken?
OpenStack Integration
Statsd can be taken and embedded in another collector easily

Back-end support

Net-SNMP
Intro:
Data Collection Component
Pros:
Standard for SNMP monitoring
Very portable, can be installed don any OS
Developed in C, very efficient
Support Extension, so everything can be monitored by SNMP
SNMP is the most supported standard for metric collection
Most of the technologies in the market use net-snmp as embedded daemon
Cons:
What can be taken?
A serious MP should be able to collect data using the SNMP protocol, so netsnmp should be included as snmp daemon.

OSSEC
Intro:
OSSEC is a full platform to monitor and control your systems. It mixes together

all the aspects of HIDS (host-based intrusion detection), log monitoring and
SIM/SIEM together in a simple, powerful and open source solution
Pros:
(Near)? Real-Time processing as part of data point processing can be executed
locally (agent)
Scalability, as the processing can be distributed on the servers (agent)
Full control from WebUI and centralized management
File Integrity Check
Log Security Monitoring
Rootkit Detection
Support multiple way for data collection (agent, syslog)
Support hundreds of different technologies
Cons:
It cannot be used as log management solution, as it just analyze local logs and
send events accordingly
What can be taken?
Verify if can be integrated with another technology for data collection (should be
easy as the analysis events are wrote in a alert log file)
Security is very important and a big added value rarely seen in other solutions
All the Pros.

Dstat
Intro:
Data collection component. Combines vmstat, iostat, ifstat, netstat information

and more.
Pros:
Written totally in Python
Enable/order counters as they make most sense during analysis/troubleshooting
you can get useful stats like: per process eating: more cpu, more ram, Blocking
I/O, higher latency, file lock stats, per process opened files etc. Network stats,
Disk stats, and a lots of per application stats (mysql, VmWare ESX, MySQL,
Mail Servers, DNS)
Plugins very extendibles
Can be integrated with other tools like collectd
Can export data locally in CSV format
Cons:
It is just a local tool, it doesn't sent data anywhere
It can only export data to CSV format
Rely on other technologies so send data anywhere
What can be taken?
It the best System Engineering troubleshooting tool so far
Every top resources usage are a great value.
Exact data representation.

Foglight
Intro:
Dell Commercial Monitoring Solution
Pros:
Cons:
What can be taken?

CloudWatch
Intro:
Amazon AWS Monitoring Solution
Pros:
Full control from WebUI
Full control from API
Probably one of the few (if not the only one) fully integrated with the Amazon
World
Auto-scaling support
Support custom metrics
Cons:
It cannot be used outside the Amazon world

What can be taken?


SaaS Business Model
API and WebUI Full Control
Auto-scaling Concept

OpenTSDB
Intro:
OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top

of HBase. OpenTSDB was written to address a common need: store, index and
serve metrics collected from computer systems (network gear, operating systems,
applications) at a large scale, and make this data easily accessible and graphable.
Pros:
Scale on Multiple Datacenters
Rely heavily on HBase, so not SPoF
Easily integrable with Load Balancers (Varnish recommended) for scalability
Strong Business Case (developed by StumbleUpon
a collector is provided. Tcollector (http://opentsdb.net/tcollector.html) is efficient
as it push the metrics to OpenTSDB and there's no poller doing the job.

Cons:
Young project
Need to be integrated with other tools, it just receive (tcollector) and organize

data in a time series way using HBase.


What can be taken?
The project matured some experience using HBase for storing massive quantity
of data at high rates. Analyze concept as idempotent function can be an added
value
For a wide public comparison:
http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems

FINAL CONSIDERATIONS AND OPEN DISCUSSIONS:


The Team share a common vision. Everybody want to create a World Class Quality
Monitoring solution at his excellence. There is a deep know how and strong will on
achieving this.
As a general approach, we should get the best from our past experiences, learn from past
errors and take the most useful components we can find from the Open Source community,
innovate, integrate and improve them. However is important to leverage the efforts,
technologies and platform already available and developed internally.

There's no OpenSource software product that can satisfy all our requirement. According to
our strategy this can be an excellent opportunity for Dell. Also additional time efforts are
required working and extending an Open Source solution to meet the requirements with
the final goal to have a unique world class quality Monitoring Platform.
The MP should support at least 3 way to collect data (i.e. Ganglia, Statsd, Net-SNMP).
Same approach can be used for the API (at least 2 APIs) and for the DB (MySQL, NoSQL,
HBase).
The MP need to be integrated with the Deployment Framework currently used (Puppet).
This is a capability that only the most advanced product have and is fundamental for Auto
Scaling - De scaling
POSSIBLE SCENARIOS:
Use an existing Open Source product:
At Enterprise grade, the highest standard in the market are achieved by the following
solutions:
Zabbix
Zenoss
OpenNMS
Icinga (Nagios based)
Shinken (Nagios based)
NetXMS
OSSEC (For security alerting and monitoring)
If the case the Cloud R&D Team want to propose his own solution to OpenStack, would be
difficult to achieve that using a full featured Open Source well known product.
In case of short term strategy the Open Source product can be a reasonable option.
One of the mentioned product could be used and most probably according to the Team
experiences and most used developed language, a possible choice could be Zenoss or
Shinken. However these software have their own limitation to take in consideration.
Feedback on the current monitoring solution: ZOD
Zod has unique features that are very difficult to find in the market. The Team developed an
impressive know how on the matter and also a big effort and energies are spent in
continuous development and improvement, which makes zod every day a better MP.
However, there are few issues. The main problem of Zod is the lack of pretty basic features
and also other advanced. Adding simple things like new checks and computations is very
time consuming.

The Web User Interface is read only and it lacks of very basic reporting functionalities.
If in the case the Team want to keep going with the Zod development there are a number of
things that really should be improved as the Team spend too resources on implementing
very basics things like reporting and configuration, just as en example, plain HTML and /or
Javascript should be written to files to add reporting.
The main programming language used by Zod is Clojure and the team doesn't have deep
knowledge and skills on it.
The data is stored ad plain RRD files in the Local Filesystem. This can have scalability
issues, unless a common shared storage solution is used to store and retrieve data from
multiple servers.
In general Zod share the best feature from the most advanced MP solution, but also share
the most critical issues with other not so advanced product.
If the best choice for the Company is keep going with Zod, the mentioned issues needs to
be absolutely fixed/improved. Improve the WebUI, store the information in a scalable and
reliable fashion and port the Clojure code to Python or any other Team preferred
programming language is fundamental to achieve a world class quality solution with Zod.
Develop a new solution based on past experiences and learning from past errors:
If the Team think the best for the Company would be to develop a new software from
scratch, the following items need to be taken in consideration:
Dell has already a commercial monitoring product: Foglight
In the same time, Dell is not offering a monitoring Open Source software, so this can

be a good opportunity to do this. Also Releasing Zod or any other product as OSS as
strategy can be less conflictive regarding Foglight.
In case Dell will propose the MP software to OpenStack, it should be released with
an Open Source License. The LGPLv2.1+ license is generally used by Corporation
when releasing code.
The knowledge of the Team should be leveraged in developing the new solution, so
most probably the Development platform could be Python and Java and the storage
platform could be HBase.
The Web Interface should give Full control from the Administering and Operating
side.
As OpenStack is a priority for the Company, the development of new functionalities
and integrations should be consistent with this Company priority
The best Pros from the MP software comparison should be included as features.

The feedback about Zod, could still be valid in this context.

Develop a new solution based on an existing Open Source software:


In this context probably the following consideration should be taken when choosing the
Software:
Should be avoided to develop a software and doing the interests of a single

Company. This is important even more if the main contributor of the software is a
Competitor Company
Time efforts needs to be spent on any available Open Source solution to implement
capabilities that match with the Team requirements
The improvements should be re-released to the Open Source Community. In this case
Dell image and perception will be positively fortified. Also we'll have the Open
Source work force to implement more capabilities, reducing Resources costs.