You are on page 1of 98

Ganglia & Nagios

Maciej Lasyk
11. Sesja Linuksowa
Wrocaw, 2014-04-06

Maciej Lasyk, Ganglia & Nagios 1/25


Ganglia.. what?
Ganglia cluster / group of neurons found outside
the central nervous system

Maciej Lasyk, Ganglia & Nagios 2/25


Just a little about monitoring

- the need for monitoring

Maciej Lasyk, Ganglia & Nagios 3/25


Just a little about monitoring

- the need for monitoring

- measuring availability

Maciej Lasyk, Ganglia & Nagios 3/25


Just a little about monitoring

- the need for monitoring

- measuring availability

- measuring performance

Maciej Lasyk, Ganglia & Nagios 3/25


Just a little about monitoring

- the need for monitoring

- measuring availability

- measuring performance

- gathering additional metrics

Maciej Lasyk, Ganglia & Nagios 3/25


Monitoring is critical for HA
How to measure availability?

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA

Maciej Lasyk, Ganglia & Nagios 4/25


Monitoring is critical for HA

A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)

Maciej Lasyk, Ganglia & Nagios 4/25


What should we monitor?

- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)

Maciej Lasyk, Ganglia & Nagios 5/25


What should we monitor?

- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)

Think dependencies!

Maciej Lasyk, Ganglia & Nagios 5/25


When outage hits us don't panic!

- Notifications

Maciej Lasyk, Ganglia & Nagios 6/25


When outage hits us don't panic!

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security

Maciej Lasyk, Ganglia & Nagios 6/25


When outage hits us don't panic!

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking it should be simple

Maciej Lasyk, Ganglia & Nagios 6/25


When outage hits us don't panic!

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking it should be simple
- What if cell is offline or someone is out?

Maciej Lasyk, Ganglia & Nagios 6/25


Monitoring: notifications issues

- false positives

Maciej Lasyk, Ganglia & Nagios 7/25


Monitoring: notifications issues

- false positives

- major events

Maciej Lasyk, Ganglia & Nagios 7/25


Monitoring: notifications issues

- false positives

- major events

- failover notifications?

Maciej Lasyk, Ganglia & Nagios 7/25


Monitoring: notifications issues

- false positives

- major events

- failover notifications?

- tolerance & critical thresholds

Maciej Lasyk, Ganglia & Nagios 7/25


Monitoring: reporting

- baseline

Maciej Lasyk, Ganglia & Nagios 8/25


Monitoring: reporting

- baseline

- correlation between incidents and

change management

Maciej Lasyk, Ganglia & Nagios 8/25


Monitoring: reporting

- baseline

- correlation between incidents and

change management

- trending info

Maciej Lasyk, Ganglia & Nagios 8/25


Monitoring: reporting

- baseline

- correlation between incidents and

change management

- trending info

- reporting

Maciej Lasyk, Ganglia & Nagios 8/25


Monitoring: good practices

- don't NIH!

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS
- testing envs

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS
- testing envs
- think usability!

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate don't hardcode

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate don't hardcode
- security

Maciej Lasyk, Ganglia & Nagios 9/25


Monitoring: good practices

Last but not least...

Quis custodiet ipsos custodes?

(Who will guard the guards?)

Maciej Lasyk, Ganglia & Nagios 9/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups
- services, service groups

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups
- services, service groups
- templates

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups
- services, service groups
- templates
- time periods

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Host / Services / Contacts


- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
- regular expressions

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Checks and states


- frequencies & thresholds

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Checks and states


- frequencies & thresholds
- scheduling downtimes

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Checks and states


- frequencies & thresholds
- scheduling downtimes
- outages and flapping

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Notifications
- periods

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Notifications
- periods
- groups

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Notifications
- periods
- groups
- which states to be notified about?

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
- custom notifications method

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap

Monitoring remotes

- NRPE daemons

- checks via SSH

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap
Web interface tactical overview

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap
Web interface availability reports

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap
Web interface trends

Maciej Lasyk, Ganglia & Nagios 10/25


Nagios recap
Web interface network maps

Maciej Lasyk, Ganglia & Nagios 10/25


Networking recap
Unicast

Maciej Lasyk, Ganglia & Nagios 11/25


Networking recap
Multicast

Maciej Lasyk, Ganglia & Nagios 11/25


Networking recap
Broadcast

Maciej Lasyk, Ganglia & Nagios 11/25


Ganglia what is it?

Problems of big scale:

20k hosts with zylion metrics probed every 10 seconds

It is fully redundant (until you spoil it)

It is very scalable

Regexp searches and creating of views adhoc :)

Maciej Lasyk, Ganglia & Nagios 12/25


Ganglia architecture

Maciej Lasyk, Ganglia & Nagios 13/25


Ganglia architecture

Maciej Lasyk, Ganglia & Nagios 13/25


Ganglia topologies

Default multicast topology

Maciej Lasyk, Ganglia & Nagios 14/25


Ganglia topologies

Deaf / mute multicast topology

Maciej Lasyk, Ganglia & Nagios 14/25


Ganglia topologies

Unicast topology
Maciej Lasyk, Ganglia & Nagios 14/25
Ganglia topologies

Gmetad topology
Maciej Lasyk, Ganglia & Nagios 14/25
Ganglia topologies

Gmetad HA topology (active - active)


Maciej Lasyk, Ganglia & Nagios 14/25
Ganglia topologies

Gmetad hierarchical topology


Maciej Lasyk, Ganglia & Nagios 14/25
Ganglia RRDcached

Maciej Lasyk, Ganglia & Nagios 15/25


Ganglia sFlow

Maciej Lasyk, Ganglia & Nagios 16/25


Ganglia web (grid view)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (cluster view)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (physical view)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (host view)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (compare hosts)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (events)

Events have API json based

Think integration with whatever app :)


Maciej Lasyk, Ganglia & Nagios 17/25
Ganglia web (dashboards)

- Create view -> apply as dashboard

- Create dashboard from XML

- Generate graphs and add to views

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia web (graphs)

Maciej Lasyk, Ganglia & Nagios 17/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
Maciej Lasyk, Ganglia & Nagios 18/25
Ganglia metrics
- base / extended metrics

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++
- mod_python

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java

Maciej Lasyk, Ganglia & Nagios 18/25


Ganglia metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
Maciej Lasyk, Ganglia & Nagios 18/25
Ganglia and logfiles?

ganglia-logtailer

- https://bitbucket.org/maplebed/ganglia-logtailer

- parser logfiles (realtime)

- pushes data to ganglia (via gmetric)

- yup based on specific log formats

- yet still open source so poke around ;)

Maciej Lasyk, Ganglia & Nagios 19/25


So... Nagios + Ganglia!

3 ways of integration:

- ganglia-web/nagios (PHP & bash based)

https://github.com/ganglia/ganglia-web

- ganglia-nagios-bridge (Python & cron based)


https://github.com/ganglia/ganglia-nagios-bridge

- check-ganglia-metric (Python)
https://github.com/ganglia/ganglia_contrib

Maciej Lasyk, Ganglia & Nagios 20/25


Nagios + Ganglia: ganglia-web/nagios

https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
Maciej Lasyk, Ganglia & Nagios 21/25
Nagios + Ganglia: ganglia-web/nagios

Nagios pulls info from Ganglia via HTTP

Maciej Lasyk, Ganglia & Nagios 21/25


Nagios + Ganglia: ganglia-nagios-bridge

- https://github.com/ganglia/ganglia-nagios-bridge

- Python script run in e.g. in crontab

- pulls data from Ganglia XML via sockets

- parses XML

- send data to Nagios

- Nagios commits only passive checks

Maciej Lasyk, Ganglia & Nagios 22/25


Nagios + Ganglia: check_ganglia_metric

- https://pypi.python.org/pypi/check_ganglia_metric/

- basically Nagios plugin

- pulls data from Ganglia XML via sockets

- check_ganglia_metric.py \
--gmetad_host=gmetad-server.example.com \
--metric_host=host.example.com --metric_name=cpu_idle

Maciej Lasyk, Ganglia & Nagios 23/25


Nagios + Ganglia

Which one integration should I use?

Maciej Lasyk, Ganglia & Nagios 24/25


Nagios + Ganglia

Which one integration should I use?

Seriously try yourself and test

Maciej Lasyk, Ganglia & Nagios 24/25


Freenode #ganglia

https://lists.sourceforge.net/lists/listinfo/ganglia-general
Maciej Lasyk, Ganglia & Nagios 24.5/25
sources?
- Monitoring with Ganglia book
- also nagios.org
- and Web Operations book
- plus some experience ;)

Maciej Lasyk, Ganglia & Nagios 25/25


Thank you :)

Ganglia & Nagios

Maciej Lasyk

11. Sesja Linuksowa

2014-04-06, Wrocaw

http://maciek.lasyk.info/sysop

maciek@lasyk.info

@docent-net

Maciej Lasyk, Ganglia & Nagios 25/25

You might also like