You are on page 1of 5

SystemMonitoringUsingNAGIOS,Cacti,andPrism.

ThomasDavis,NERSCandDavidSkinner,NERSC

ABSTRACT: NERSC uses three Opensource projects and one internal project to providesystemsfaultandperformancemonitoring.ThesethreeprojectsareNAGIOS, Cacti,andPrism.

1.Introduction
Nagios , Cacti, and Prism are part of the core OperationalMonitoringstrategyatNERSC.

2.WhatisNagios?
Nagiosisafaultmonitoringsoftwarepackagewith pluginsupport.Thepluginsupportallowsonetocreate methodsofmonitoringthatarespecifictoasystem,or business/contractlogic. 2.1Where/HowdoIgetit? You can download it from http://www.nagios.org/. SomeLinux distributions (ie,Fedora)alreadyhaveitin therepositories,allowingabinaryinstall. 2.2Howdoesitwork? By using 'plugins', it can monitor, notify, and acknowledgewhenfaultsoccur. Canreturn3possible exit states OK, WARNING, and CRITICAL. The plugincanalso return atext message, statingwhat the conditionmeans. 2.2WhatcanIdowithit? Youcanwatchfornetwork,process,hardwareand softwarefaults.Youcanalsocollectdataforprocessing byotherutilities. Notifications can be sent based on date, time, or severity.

3.1Where/HowdoIgetCacti? Cacti is available from http://cacti.net. Some distributions (ie, Fedora) also supply a version intheir repositories.However,andimportantarchitectureplugin featureofcactihastobepatchedin. 3.2HowdoesCactiwork? Cacti uses Round Robin Databases (RRD) and MySQL database technologies to store collected information. MySQL and PHP is used to provide a graphical, web based interface to the RRD databases. Rrddatabasetechnologywaspopularizedinthewidely knownMRTGgraphingproject. Cacti takes MRTG configuration to a whole new levelitprovidesa'clickandpoint'interfacetoRRD's. Ausercaneasilydefine,graph,andtrackovertimemany different devices without editing a configuration file usinganeditor. 3.3WhatcanIdowithit? Cactiisextendable,anduserdriven.Thelimitsare

4.WhatisPRISM?
Prism is a project that collects data from varios NAGIOS systems using RDF, and then displays the resultsinanaggregatedform. 4.1HowdoIgetPRISM? 4.2Howdoesitwork? UsingRDF/XML/RSSfeedsfromnagios,itcollects, processes,andcreatesacolorcoded,simplifieddisplayof systemstatus,allowingquick,easyvisualnotificationof faults.

3.WhatisCacti?
Cacti is performance monitoring tool based on a LAMP stack (Linux/Apache/MySQL/PHP) and RRD (Round Robin Database). It can collect, manage and displaygraphsofcollecteddata.

CUG2009Proceedings1of5

4.2WhatcanIdowithit? Prism allows the grouping of several NAGIOS (in NERSC'scase,9displays)intoonedisplay.Thisallows 'ataglance'statuschecks.

DDN I/O operations, collected from a DDN controllerport.

5.ExamplesofNagios/Cacti/PRISMusage
5.1NagiosfaultmonitoringofDDNcontrollersand drives.

DDNRead/Write/Verify/Rebuilddataratesgraph

Asnapshotfromnagios,showingtheaDDNdrive fault. Thisinformationiscollectedusingtelnet,driven byaPerl/Expectscript.Thescriptiscapableofdetecting system faults, system temperatures, and other errors. This plugin makes finding and repairing DDN faults easier. 5.2Cactidatacollection,processingandgraphingof DDNcontrolleranddriveperformance. Cacti is used to collect I/O statistics from the controllers. Areasofdatacollection areI/Ooperation statistics,I/Oread/write/verify/rebuilddatarates,andI/O blocksizes,anddrivetemperatures.Foreachcontroller, agraphforeachport,plusagraphfortotalisgenerated. The data is collected using either the DDN api or telnet/command line scraping. Examples of several differentgraphsare:

DDNWriteI/OSize,histogramstyle. 5.3ThermalmonitoringofaCrayXT4system. Several plugins have been created to monitor the thermal environment in themachine room. The nodes monitoredfortemperatureareCraycabinets,airhandling units(AHU),andDDNcontrollersanddrives. The system has a set point that is used to send a WARNING email message when the temperature rises pastacertainpoint,andaCRITICALemailmessageata higher point. These two points are chosen to allow operations/staff to time to react and either fix the temperatureproblemordoagracefulshutdown.Thisis to prevent an automatic overtemp shutdown from occuring. Theothergoal ofthismonitoringistohelp provide information on variances in the machine room floortemperatures,andprovidelongterminformationfor futureplanning.

CUG2009Proceedings2of5

An example of a alert message from NAGIOS, whichismonitoringtheCraycabinetsusingtheXTWM servicefromtheSMW:

Noticeinthisemailmessage,thatitcontainsseveral bitsofinformation: 1) Typeofalert. 2) Sourceofalert 3) Date&Timeofalert. 4) Temperatureofsystem Whenthetemperatureofthesystemreturnsbackto normal,, nagios will send another email message notificationofrecovery:

The system also collects the temperature information, and stores into a MySQL database. This database is queried by Cacti, to generate a graph of temperatures,likethis:

Thisgraphshowstheaveragetemperatureovera1 yearspan,reducedtoa1dayaveragefroma5minute average. Nagios and Cacti arealso combined, togenerate a screenthatallows'ataglance'statusthermalinformation ofthesystem.Thisimageshowsthatonecanseethereis no major temperature problems on the machine floor. However,onecanseethatcertainpartsofthefloorare warmerthanotherparts.

CUG2009Proceedings3of5

the NAGIOS plugin looks at the XTCLI information generatedontheSMWbasedonchassis,andfaultsona nodefailure. Thepluginworksbyusingdesignfeature ofNagios,inthat youcangofrom anOKstate,toan CRITICAL, then to an WARNING state, and only generate one email message. The plugin does the following: 1. AcronjobisrunontheSMW,thatgeneratesa xtclifile.ThisfileistransferedtotheNAGIOS monitoringboxforprocessing. 2. Apluginprocessthisfile,tofindinformationon achassisbasis. 3. Ifanodeflagstatechanges,aCRITICALalertis issued.Thedownnodeisrecordedintoastatus fileforthatchassis. 4. Onthe next pass.The pluginwill now emit a warningstate,unlessanothernodeflagchanges state. All acknowledgements, recoveries, and warning states email notifications are supressed. Only CRITICALstateemailsaresentforthisplugin. Using this method allows the operations center to acknowledge the failed node inNAGIOS. Again, this getsbackto'ataglance'stateinformationtheNagios tatical display will show nonacknowledged node failures. AnexamplefromNAGIOS:

The following is an example of a thermal event. Where a chiller had problems. This was caught, and operations/staff had a chance to correct the problem beforethesystemreachedanovertemperatureshutdown.

5.4MonitoringusingXTCLIaCrayXT4system. This plugin was written to watch for down nodes, andthentologthemtoadatabase,generateanalert,and then also display the status in NAGIOS. Several problemshadtobedealtwith,startingwithhowdoyou monitor9k+nodes?Thesimplifiedansweris,youdon't. Thetechniquethatisusedbythisplugandappearsto workismonitoronachassisbasis. Whatthismeansis

5.5NetworkWeathermappluginofCacti. One of the nagios/cacti systems collects data from severalswitches/routersinthenetwork,andgeneratesa network weather map. This weather map is useful to

CUG2009Proceedings4of5

figure out usage patterns ie, is someone/something transferlargeamountsofdatafromtheNERSCGlobal Filesystem(NGF)tothemassstoragesystem?Ifso,the networkweathermapcanshowthis.Theweathermaphas alsobeenusedinthepasttoshowpoorlyutilizedormis configuredroutesandlinks.

Acknowledgments
The Cray staff at NERSC, for being patient with some of the questions I have asked in trying to get performance and fault monitoring in place. The operationsstaff,fortheiradoptingoftheCacti thermal pageforFranklinbeforeIevencompletedit,andtheir patienceasIbrokeandfixeditonseveraloccasions.

AbouttheAuthors
Thomas Davis works for NERSC, EMail: tdavis@nersc.gov David Skinner works for NERSC, EMail: dskinner@nersc.gov 5.6Prism

ThisimageshowsPrisminacompletelyacknowledged state.

7.Conclusion
Nagios, Cacti, and Prism are 3tools that are used everydayatNERSC. Theyprovidevalubleinsightinto system's status in a standardized way the reduces staff training. These tools have provided data intoareas of NERSC that was assumed to be correctly configured, allowing a more proactive instead of reactive managementstyleofsystemstobeused.

CUG2009Proceedings5of5

You might also like