Professional Documents
Culture Documents
Comparitive Review of OSS NMS Tools
Comparitive Review of OSS NMS Tools
Synopsis
Nutsandboltsnetworkandsystemsmanagementiscurrentlyunfashionable.The emphasisisfarmoreonprocessesthatimplementservicemanagement,drivenby methodologiesandbestpracticessuchastheInformationTechnologyInfrastructure Library(ITIL).Nonetheless,allservicemanagementdisciplinesultimatelyrelyona waytodeterminesomeofthefollowingcharacteristicsofsystemsandnetworks:
Thecommercialmarketplaceforsystemsandnetworkmanagementofferingstendto bedominatedbythebigfourIBM,HP,CAandBMC.Eachhavelarge,modular offeringswhichtendtobeveryexpensive.Eachhasgrowntheirportfoliobybuying upothercompaniesandthenperformingsomelevelofintegrationbetweentheir respectivebrandedproducts.Onecanarguethattheresultingofferingstendtobe marketechturesratherthanarchitectures. ThispaperlooksatOpenSourcesoftwarethataddressesthesamerequirements. OfferingsfromNetdisco,CactiandTheDudeareexaminedbriefly,followedbyanin depthanalysisofNagios,OpenNMSandZenoss. Thispaperisaimedattwoaudiences.Foradiscussiononsystemsmanagement selectionprocessesandanoverviewofthreemainopensourcecontenders,readthe firstfewchapters.Thelastfewchaptersthenprovideaproductcomparison. ForthosewhowantlotsmoredetailonNagios,OpenNMSandZenoss,themiddle sectionsprovideindepthdiscussionswithplentyofscreenshots.
Table of Contents
1DefiningSystemsManagement....................................................................................5 1.1Jargonandprocesses................................................................................................5 1.2SystemsManagementforthispaper....................................................................6 2Systemsmanagementtools.............................................................................................6 2.1Choosingsystemsmanagementtools......................................................................7 2.2TheadvantagesofOpenSource...............................................................................8 3OpenSourcemanagementofferings...............................................................................8 4CriteriaforOpenSourcemanagementtoolselection.................................................10 4.1Generalrequirements.............................................................................................10 4.1.1MandatoryRequirements...............................................................................10 4.1.2DesirableRequirements..................................................................................10 4.2Definingnetworkandsystemsmanagement.....................................................11 4.2.1Networkmanagement.....................................................................................11 4.2.2Systemsmanagement......................................................................................12 4.3Whatisoutofscope?..............................................................................................13 5AquicklookatCacti,TheDudeandnetdisco..............................................................14 5.1Cacti.........................................................................................................................14 5.2netdisco....................................................................................................................17 5.3TheDude..................................................................................................................20 6Nagios..............................................................................................................................21 6.1ConfigurationDiscoveryandtopology................................................................22 6.2Availabilitymonitoring...........................................................................................27 6.3Problemmanagement.............................................................................................32 6.3.1Eventconsole....................................................................................................33 6.3.2Internallygeneratedevents............................................................................37 6.3.3SNMPTRAPreceptionandconfiguration.....................................................39 6.3.4Nagiosnotifications........................................................................................39 6.3.5Automaticresponsestoeventseventhandlers..........................................41 6.4Performancemanagement......................................................................................42 6.5Nagiossummary.....................................................................................................45 7OpenNMS........................................................................................................................46 7.1ConfigurationDiscoveryandtopology................................................................47 7.1.1Interfacediscovery...........................................................................................47 7.1.2Servicediscovery..............................................................................................48 7.1.3Topologymappinganddisplays......................................................................51 7.2Availabilitymonitoring...........................................................................................53 7.3Problemmanagement.............................................................................................59 7.3.1Eventconsole....................................................................................................59 7.3.2Internallygeneratedevents............................................................................62 7.3.3SNMPTRAPreceptionandconfiguration.....................................................65 7.3.4Alarms,notificationsandautomations..........................................................69 3
7.4Performancemanagement......................................................................................76 7.4.1Definingdatacollections.................................................................................76 7.4.2Displayingperformancedata..........................................................................85 7.4.3Thresholding....................................................................................................91 7.5ManagingOpenNMS..............................................................................................97 7.6OpenNMSsummary...............................................................................................98 8Zenoss..............................................................................................................................98 8.1ConfigurationDiscoveryandtopology..............................................................100 8.1.1Zenossdiscovery.............................................................................................100 8.1.2Zenosstopologymaps....................................................................................107 8.2Availabilitymonitoring........................................................................................108 8.2.1Basicreachabilityavailability......................................................................108 8.2.2AvailabilitymonitoringofservicesTCP/UDPportsandwindowsservices ...................................................................................................................................110 8.2.3Processavailabilitymonitoring....................................................................113 8.2.4Runningcommandsondevices.....................................................................120 8.3Problemmanagement...........................................................................................121 8.3.1Eventconsole.................................................................................................122 8.3.2Internallygeneratedevents..........................................................................123 8.3.3SNMPTRAPreceptionandconfiguration...................................................125 8.3.4email/pageralerting....................................................................................126 8.3.5Eventautomations.........................................................................................131 8.4Performancemanagement....................................................................................132 8.4.1Definingdatacollection,thresholdingandgraphs.....................................132 8.4.2Displayingperformancedatagraphs...........................................................138 8.5Zenosssummary....................................................................................................141 9ComparisonofNagios,OpenNMSandZenoss...........................................................142 9.1Featurecomparisons.............................................................................................143 9.1.1Discovery........................................................................................................143 9.1.2Availabilitymonitoring.................................................................................144 9.1.3Problemmanagement....................................................................................144 9.1.4Performancemanagement............................................................................145 9.2Producthighpointsandlowpoints....................................................................146 9.2.1Nagiosgoodiesandbaddies.....................................................................146 9.2.2OpenNMSgoodiesandbaddies...............................................................146 9.2.3Zenossgoodiesandbaddies.....................................................................147 9.3Conclusions............................................................................................................148 10References...................................................................................................................149 11AppendixACactiinstallationdetails.....................................................................149
ServiceSupportwhichincludesthe:
ServiceDeskfunction Incidentmanagementprocess Problemmanagementprocess Configurationmanagementprocess Changemanagementprocess Releasemanagementprocess ServiceLevelmanagementprocess Capacitymanagementprocess ITServiceContinuitymanagementprocess Availabilitymanagementprocess FinancialmanagementforITservices
ServiceDeliverywhichincludesthe:
KeytothecoreofconfigurationmanagementandtheentireITILframeworkisthe conceptoftheConfigurationManagementDatabase(CMDB)whichstoresand maintainsConfigurationItems(CIs)andtheirinterrelationships. Theartofsystemsmanagementisdefiningwhatisimportantwhatisinscope,and perhapsmoreimportantly,whatiscurrentlyoutofscope.Thescienceofsystems managementisthentoeffectively,accuratelyandreliablyprovidedatatodeliveryour systemsmanagementrequirements.Thedevilreallyisinthedetailhere.A comprehensivesystemsmanagementtoolthatdeliversathousandmetricsoutof theboxbutwhichisunreliableand/ornoteasilyconfigurable,issimplyarecipefora projectthatisdeliveredlateandoverbudget.
differentpiecemealtoolsfordifferentprojects,especiallywhenthecostofbuildingand maintainingskillsandeducatingusersistakenintoaccount. Toolintegrationisalargefactorinthesuccessfulrolloutofsystemsmanagement. TheconceptofasingleConfigurationManagementDatabase(CMDB)thatalltools feedanduse,iskeytothis. Agoodtooldeliversusefulstuffeasilyoutoftheboxandprovidesastandardwayto thenprovidelocalcustomisation. Atitsmostbasic,thetoolisacompilerorinterpreter(C,bash,...)andthe customisationiswritingprogramsfromscratch.Atthecomplexendofthespectrum, thetoolmaybealargesuiteofmodulesfromoneofthebigfourcommercial suppliers,IBM,HP,CAandBMC.Atthereallycomplexend,iswhereyouhave severalofthebigcommercialproductsinvolvedinadditiontohomegrownprograms.
Easeofusenotjustwhatdemoswellbutwhatimplementswellinyour environment Skillsnecessarytoimplementtherequirementsversusskillsavailable Requirementsforandavailabilityofusertraining Costallofitnotjustlicencesandtinevaluationtime,maintenance, training,... Supportfromsupplierand/orcommunities Scalability Deployabilitymanagementserver(s)easeofinstallationandagent deployment Reliability Accountabilitytheabilitytosue/chargethevendorifthingsgowrong
V1(1988)stillmostprevalent.Significantpotentialsecurityandperformance issues.
OftheOpenSourcemanagementsolutionsavailable,someareexcellentpoint solutionsforspecificnicherequirements.MRTG(MultiRouterTrafficGrapher) writtenbyTobiOetiker,isanexcellentexampleofacompactapplicationthatuses SNMPtocollectandlogperformanceinformationanddisplayitgraphically.Ifthat satisfiesyourrequirement,don'tlookanyfurtherbutitwillnothelpyouwith definingandcollectingproblemsfromdifferentdevicesandthenmanagingthose problemsthroughtoresolution. AnenhancementofMRTGisRRDTool(RoundRobinDatabaseTool),againfromTobi Oetiker.Itisstillfundamentallyaperformancetool,gatheringperiodic,numericdata anddisplayingitbutRRDToolhasadatabaseatitsheart.Thesizeofthedatabaseis predeterminedoncreationandnewerdataoverwritesolddataafterapredetermined interval.RRDcanbefoundembeddedinanumberofotherOpenSourcemanagement offerings(Cacti,Zenoss,OpenNMS). AfurtherenhancementfromRRDToolisCactiwhichprovidesacompletefrontendto RRDTool.AbackendMySQLrelationaldatabasecanbeusedbehindtheRoundRobin databases;datasourcescanbeprettywellanyscriptinadditiontoSNMP;andthere isusermanagementincluded.Thisisstillaperformancedatacollectionanddisplay package,notamultidiscipline,framework,systemsmanagementsolution. Movingupthescaleoffeaturesandcomplexity,someofferingsareslantedmore towardsnetworkmanagement(netdisco,TheDude);otherstowardssystems management(Nagios). Someaimtoencompassanumberofsystemsmanagementdisciplineswithan architecturebasedaroundacentraldatabase(Nagios,Zenoss,OpenNMS). Someareextremelyactiveprojectswithhundredsofappendstomaillistspermonth (Nagios,Zenoss,OpenNMS,cacti);othershavearegularbutsmallercommunitywith hundredsofmaillistappendsperyear(netdisco). SomearepurelyOpenSourceprojects,typicallylicensedundertheGnuGPL(MRTG, RRDTool,cacti)orBSDlicense(netdisco);somehavefreeversions(againtypically underGPL)withextensionsthathavecommerciallicences(Zenoss).Inadditionto freelicences,severalproductsoffersupportcontracts(Zenoss,Nagios,OpenNMS). MostareavailableonseveralversionsofLinux;MRTG,RRDToolandcactiarealso availableforWindows.TheDudeisbasicallyaWindowsapplicationbutcanrun underWINEonLinux. MosthaveawebbasedGUIsupportedonOpenSourcebrowsers.OpenNMScanonly displaymapsbyusingInternetExplorer.
10
Configuration
Automatic,controllablediscoveryofnetworkLayer3(IP)devices Topologydisplayofdiscovereddevices SupportforSNMPV1,V2andpreferably,V3 Abilitytodiscoverdevicesthatdonotsupportping AbilitytodiscoverdevicesthatdonotsupportSNMP Central,opendatabasetostoreinformationforthesedevices Abilitytoaddtothisinformation Ideally,abilitytodiscoveranddisplaynetworkLayer2(switch)topology Customisablepingtestforalldiscovereddevicesandinterfaces SNMPavailabilitytestfordevicesthatdonotrespondtoping(eg. comparisonofSNMPInterfaceadministrativestatuswithInterface operationalstatus) Simpledisplayofavailabilitystatusofdevices,preferablybothtabularand graphical Eventsraisedwhenadevicefailsitsavailabilitytest Abilitytomonitorinfrastructureofnetworkdevices(eg.CPU,memory,fan) Differentiationbetweendevice/interfacedownandnetworkunreachable Eventstobeconfigurableforanydiscovereddevice Centraleventsconsolewithabilitytoprioritiseevents Abilitytocategoriseeventsfordisplaytospecificusers AbilitytoreceiveandformatSNMPtrapsforSNMPV1,V2andpreferably, V3 Customisationofactionsinresponsetoevents,bothmanualactionsand automaticresponses Abilitytocorrelateeventstofindrootcauseproblems(eg.failureofarouter deviceisrootcauseofallinterfacefailureeventsforthatdevice)
Availabilitymonitoring
Problem
Performance
11
Configuration
Automatic,controllablediscoveryofWindowsandUnixdevices Topologydisplayofdiscovereddevices SupportforSNMPV1,V2andpreferably,V3 Abilitytodiscoverdevicesthatdonotsupportping AbilitytodiscoverdevicesthatdonotsupportSNMP Central,opendatabasetostoreinformationforthesedevices Abilitytoaddtothisinformation Customisablepingtestforalldiscovereddevices Availabilitytestfordevicesthatdonotrespondtoping(eg.comparisonof SNMPInterfaceadministrativestatuswithInterfaceoperationalstatus, supportforsshtests) Abilitytomonitorcustomisableportsonadevice(eg.tcp/80forhttpservers) Ideallytheabilitytomonitorapplications(eg.ssh/snmpaccesstomonitor forprocesses,wgettoretrievewebpages) Simpledisplayofavailabilitystatusofdevices,preferablybothtabularand graphical Eventsraisedwhenadevicefailsanyavailabilitytest AbilitytomonitorbasicsystemmetricsCPU,memory,diskspace, processes,services(eg.theSNMPHostResourcesMIB) Eventstobeconfigurableforanydiscovereddevice
Availabilitymonitoring
Problem
12
Centraleventsconsolefornetworkandsystemsmanagementeventswith abilitytoprioritiseevents Abilitytocategoriseeventsfordisplaytospecificusers AbilitytoreceiveandformatSNMPtrapsforSNMPV1,V2andpreferably, V3 AbilitytomonitorUnixsyslogsandWindowsEventLogsandgenerate customisableevents Ideallytheabilitytomonitoranytestlogfileandgeneratecustomisable events Customisationofactionsinresponsetoevents,bothmanualactionsand automaticresponses Abilitytocorrelateeventstofindrootcauseproblems(eg.singlepointof failurerouterisrootcauseofavailabilityfailureforalldevicesinanetwork) Regular,customisablemonitoringofSNMPMIBvariables,bothstandard andenterprisespecific,withdatastorageandabilitytothresholdvaluesto generateevents AbilitytoimportanyMIB AbilitytobrowseanyMIBonanydevice AbilitytogatherperformancedatabymethodsotherthanSNMP(eg.ssh) Customisablegraphingofperformancedata
Performance
13
5.1 Cacti
Cactiisanichetoolforcollecting,storinganddisplayingperformancedata.Itisa comprehensivefrontendtoRRDTool,includingtheconceptofusermanagement. AlthoughthedefaultmethodofdatacollectionisSNMP,otherdatacollectors, typicallyscripts,arepossible. DatacollectionisveryconfigurableandisdrivenbytheCactiPollerprocesswhichis calledperiodicallybytheOperatingSystemscheduler(cronforUnix).Thedefault pollingintervalis5minutes. DevicesneedtobemanuallyaddedusingtheCactiwebbasedGUI.Basicinformation suchashostname,SNMPparametersanddevicetypeshouldbesupplied.Depending onthedevicetypeselected(eg.ucd/netSNMPHost,CiscoRouter),oneormoredefault graphtemplatescanbeassociatedwithadevicealongwithoneormoredefaultSNMP dataqueries.InadditiontothewebbasedGUI,configurationofCacticanbedoneby CommandLine,usingPHPwhichisageneralpurposescriptinglanguageespecially suitedforwebdevelopment. CactinowhassupportforSNMPV3. Forhighperformancepolling,Spine(usedtobecactid)canreplacethebasecmd.php pollingengine.TheusermanualsuggeststhatSpinecouldsupportpollingintervals oflessthan60secondsforatleast20,000datasources. CactiissupportedonbothUnixandWindowsplatforms. GettheCactiUserManualfromhttp://www.cacti.net/downloads/docs/pdf/manual.pdf. Cactihasaveryactiveuserforumwithhundredsofappendspermonth.Thereisalso adocumentedreleaseroadmapgoingforwardto2ndquarter2009. HereareafewscreenshotsofCactitogiveafeelfortheproduct.
14
Figure1:CactimainDevicespanel
15
Figure2:Cactigraphofinterfacetraffic
16
Figure3:Cactigraphofmemoryfordevicebino
5.2 netdisco
netdiscowascreatedattheUniversityofCalifornia,SantaCruz(UCSC),Networking andTechnologyServices(NTS)department.Itisinterestingasanetwork managementconfigurationoffering.ItusesSNMPandCiscoDiscoveryProtocol (CDP)totryandautomaticallydiscoverdevices.Unlikemostothermanagement offerings,netdiscoisLayer2(switch)awareandcanbothdisplayswitchportsand optionallyprovideaccesstocontrolswitchports. ItprovidesaninventoryofdevicesthatyoucansorteitherbyOSorbydevicemodel, displayingallportsforadevice.Italsohastheabilitytoprovideanetworkmap. Usermanagementisincludedsoyoucanrestrictwhoisallowedtoactivelymanage devices.ThereisgoodprovisionofbothcommandlineinterfaceandwebbasedGUI. netdiscoissupportedonvariousplatformsitwasoriginallydevelopedonFreeBSD;I builtitonaCentos4platform.
17
Ifyourrequirementisstrictlyfornetworkconfigurationmanagementandyour devicesrespondsuitablytonetdiscothenthismightbeworthatry.Ifounditvery quirkyastowhatitwoulddiscover.ItappearsverydependentontheSNMPsystem sysServicesvariabletodecidewhetheradevicesupportsnetworklayer2and3 protocols;ifadevicedidnotprovidesysServicesordidn'tindicatelayer2/3,then netdiscowouldnotdiscoverit.IalsohadveryfewdevicessupportingCiscoCDPso theautomaticdiscoverydidn'tworkwellforme.Althoughthereisafilewhereyou canmanuallydescribethetopology,thiswouldbeahugejobinasizeablenetworkif youhadtohandcraftasignificantamountofthenetworktopology. Thisprojectisnotnearlysoactiveassomeoftheotherofferingsdiscussedhere (around500appendstotheusersmaillistin2007)butthereseemstobeasteadyflow. Buildingthesystemwasafairmarathonbutthedocumentationisreasonablygood. Herearesomescreenshotsofthemaindeviceinventorypanel,plusthedetailsofa routerandthedetailsofaswitch.
Figure4:Netdiscomaindeviceinventorydisplay
18
Figure5:Netdiscodetailsofrouterdevice
19
Figure6:Netdiscodetailsofaswitchdevice,includingports
20
6 Nagios
Nagiosevolvedin2002outofanearliersystemsmanagementprojectcalledNetSaint, whichhadbeenaroundsincethelate1990s.Itisfarmoreasystemsmanagement product,ratherthananetworkmanagementproduct.Itisavailabletobuildonmost flavoursofLinux/Unixandtheinstallationhasbecomemucheasierovertheyears. TheNagiosQuickstartdocumentisreasonablycomprehensive(althoughitmissesa fewprerequisitesthatIfoundnecessarylikegd,png,jpeg,zlib,netsnmpandtheir relateddevelopmentpackages).IdownloadedandbuiltNagios3.0.1onaSuSE10.3 platform(hostnamenagios3),andhaditworkinginsidehalfaday. TostarttheWebInterface,pointyourbrowserathttp://nagios3/nagios/.The Quickstartdocumenthasyoucreatesomeuseridsandpasswordsthedefaultlogon fortheWebconsoleisnagiosadminwiththepasswordyouspecifiedduring installation. HereisascreenshotoftheNagiosTacticalOverviewdisplay.
Figure7:NagiosTacticalOverviewscreen
21
22
Figure8:Nagioshosts.cfgtopleveldefinitions
Hostavailabilityparametersareshowninthescreenshotabove:
23
Figure9:Nagioshosts.cfgshowinghosttemplatedefinitions
24
Figure10:Nagioshosts.cfgfileshowingrealhostdefinitions
Figure11:Nagioshosts.cfghostgroupdefinitions
25
HostgroupsarealsousedintheGUItodisplaydatabasedonhostgroups.
Figure12:NagiosHostgroupsummary
Wheneverchangeshavetakenplacetoanyconfigurationfile,thecommand: /etc/init.d/nagiosreload shouldbeused.ThisdoesnotstopandstarttheNagiosprocesses(usestop|start| restart|statustocontrolthebackgroundprocesses)thereloadparametersimplyre readstheconfigurationfile(s).Thereisalsoahandycommandtoverifythatyour configurationfilesarelegalandconsistent,beforeactuallyperformingthereload: /usr/local/nagios/bin/nagiosv/usr/local/nagios/etc/nagios.cfg AllobjectstobemanagedneeddefiningintheNagiosconfigurationfilesthereisno formofautomaticdiscovery;howevertheabilitytocreateobjecttemplatesandthus anobjecthierarchy,makesdefinitionsflexibleandeasy,onceyouhavedefinedyour hierarchies.
26
Figure13:NagiosStatusmap
othercommunitypluginsavailable,oryoucanwriteyourown.Theofficialplugins shouldbeinstalledalongsidethebaseNagios.Theexecutablescanbefound in/usr/local/nagios/libexec(use<pluginname>helpforusageoneachplugin).The officialpluginsinclude: check_ping check_snmp check_ifstatus check_ssh check_by_ssh check_nt check_nrpe configurablepingtestwithwarning&criticalthresholds genericSNMPtesttogetMIBOIDs&testreturnvalues checkSNMPifOperStatusagainstifAdminStatusforall Administrativelyupinterfaces checkthatthesshportcanbecontactedonaremotehost usesshtoruncommandonremotehost checkWindowsparameters(disk,cpu,services,etc..).Needs NSClient++agentinstalledonWindowstargets checkremoteLinuxparameters(disk,cpu,processes,etc..). NeedsNRPEagentinstalledonUnix/Linuxtarget
Nagioshastwoseparateconceptshostmonitoringandservicemonitoringandthere isaknownrelationshipbetweenthestateofthehostandthestateofitsservices. Hostmonitoringisareachabilitytestandwillgenerallyusethecheck_pingNagios plugin.IfyouhavedevicesthatsupportSNMPbutdonotsupportping(perhaps becausethereisafirewallinthewaythatblocksping),thenthecheck_ifstatusplugin workswelltotestallinterfacesonadeviceandcomparestheSNMPadministrative statuswiththeoperationalstatus.HostmonitoringisdefinedintheNagios configurationfileswiththecheck_commandstanza,wheretypicallythisisdefined atahighlevelofthehostdefinitionhierarchybutcanbeoverriddenforsubgroupsor specifichosts.Forexample,inhosts.cfg:
define host { host_name use parents alias address check_command } group-100-a1 host_172.31.100 group-100-r2 ;Inherits from this parent class ;This is n/w route to device
28
Figure14:NagiosHostDetaildisplay
29
Figure15:Nagiosservice.cfgtoplevelobjects
30
Figure16:Nagiosservices.cfgshowingspecificservices
31
Figure17:NagiosServicedetail
32
Figure18:NagiosEventLog
33
Figure19:NagiosConfigurationforAlertHistogram
34
Figure20:NagiosAlertHistogramforhostgroup100r1
TheAlertSummarymenuoptioncanprovidevariousreports,specifictohostsor services.
35
Figure21:NagiosAlertSummaryconfigurationoptions
Limitingthereporttoaspecifichost,group100r1,producesthefollowingreport.
36
Figure22:NagiosAlertSummaryforgroup100r1
Hostparameters
check_interval retry_interval
default5mins(checkintervalwhenhostOK) default1min(checkintervalwhenhostnonOK)
Serviceparameters
37
max_check_attempts default3(numberofattemptsbeforeHARDevent)
Figure23:NagiosEventLogshowinghardandsoftevents
38
notifications_enabled
globalon/offparameter
Servicenotification_options(w,u,c,r) specifiesnotificationsonservicewarning, unreachable,critical,recoveryevents Host/servicenotification_period Host/servicenotification_interval notificationsonlysentduringthisperiod (eg.24x7,workdays,...) ifnotificationalreadysent,problemstill extantandnotification_periodexceeded thensendanothernotification
39
Figure24:NagiosDefaultcontactdefinition
Figure25:NagiosHostNotifications
40
Thereisaglobalparameter,enable_event_handlerswhichmusttakethevalue1 (true),beforeanyautomationcantakeplace. Therearetwoglobalparameters,global_host_event_handlerand global_service_event_handlerwhichcanbeusedtoruncommandsonallhost/service events.Thesemightbeused,say,tologalleventstoanexternalfile. Inaddition,individualhostandservices(orgroupsofeither)canhavetheirown event_handlerdirectiveandtheirownevent_handler_enableddirective.Notethatif theglobalenable_event_handlersisoffthennoindividualhost/servicewillrunevent handlers.Individualeventhandlerswillrunimmediatelyafterandglobalevent handler. Typically,aneventhandlerwillbeascriptorprogram,definedintheNagios commands.cfgfile,torunanyexternalprogram.Thefollowingparameterswillbe passedtotheeventhandler: ForServices:$SERVICESTATE$,$SERVICESTATETYPE$,$SERVICEATTEMP$ ForHosts: $HOSTSTATE$,$HOSTSTATETYPE$,$HOSTATTEMPT$ Eventhandlerscriptswillrunwiththesameuserprivilegeasthatwhichrunsthe nagiosprogram. Sampleeventhandlerscriptscanbefoundinthecontrib/eventhandlers/subdirectory oftheNagiosdistribution.Hereisthesamplesubmit_check_resultscommand:
41
Figure26:NagiosSamplesubmit_check_resultcommandforeventhandlerfromcontribdirectory
Noteeitherusethecommandparameterfordataprocessingwhenthedata isretrieved,orusethedatafileforlaterprocessing
42
processdatafileevery<n>seconds Nagioscommandtoprocessdata
Figure27:NagiosPerformanceparametersinnagios.cfg
process_perf_data=1
1=datacollectionon,0=datacollectionoff
Figure28:NagiosPerformancedatacollectedinto/tmp/serviceperfdata
Themostrecentperformancedatagatheredforhostsandservicescanalsobeseen fromtheHostDetailorServiceDetailmenuoptions.
44
Figure29:NagiosPerformancedatahighlightedDNSCheckservice
45
ItisalsopossibletorunchecksonremotehostsbyinstallingtheNRPEagent (availableforbothUnix/LinuxandWindowshosts)andtherequiredNagiosplugins, ontheremotesystem.Thecheck_nrpepluginmustalsobeinstalledontheNagios system.ThisallowspluginsdesignedtoberunlocaltotheNagiossystem,toberun onremotehosts.WithNRPEagents,checksarerunonascheduledbasis,initiated fromtheNagiossystem. AnotheralternativeistoinstalltheNSCAaddontoremotesystems.Thispermits remotemachinestoruntheirownperiodicchecksandreporttheresultsbackto Nagios,whichcanbedefinedaspassiveservicechecks. TheeventsubsystemofNagiosislesspowerfulandconfigurablethansomeofthe otherofferingsithaslessfocusonaneventconsolebutincludesmoreinformation abouthostandserviceeventsfromothermenus.Nagioshasnoeasybuiltinwayto collectandprocessSNMPTRAPs. IfyouwantlotsofperformancegraphsthenNagiosaloneisnotgoingtodelivereasily. Insummary,Nagiosseemsgoodformonitoringarelativelysmallnumberofsystems, providedyoudon'tneedhistoricalperformancereporting.
7 OpenNMS
OpenNMSpresentsitselfasthefirstEnterprisegradenetworkmanagement platformdevelopedundertheOpenSourcemodel.ItisaJavaapplicationthatruns underseveralflavoursofLinux.AVMwareVirtualMachine(VM)isalsoavailable withthelatestreleaseofOpenNMS,whichmakesinitialevaluationveryeasywithout havingtogothroughafullbuildprocess.Thereisalsoanonlinedemosystemwhich appearstobemonitoringrealkitwhichgivesagoodfirsttasteoftheproduct. ThefollowingsectionisbasedontheVMdownloadwhichisOpenNMS1.5.93based onMandrivaitworkedveryeasily.TheVMwassetupforDHCPbutImodifiedthe OperatingSystemfilestousealocalfixedaddress,withtheVMnetworkbridgedto mylocalenvironment. ToaccesstheOpenNMSWebConsole,pointyourbrowserathttp://opennms: 8980/opennms/.Thedefaultlogonidisadminwithapasswordofadmin. HereisascreenshotofthemaindefaultwindowofOpenNMS.
46
Figure30:MaindefaultwindowforOpenNMS
47
<end>10.0.0.254</end> </include-range> <include-range > <begin>172.30.100.1</begin> <end>172.30.100.10</end> </include-range> <specific 10.191.101.1/specific> </discovery-configuration>
Intheaboveexample,pingdiscoverywillstart300,000ms(5minutes)after OpenNMShasstartedup;thediscoveryprocesswillberestartedevery86,400,000ms (24hours);1pingwillbesentpersecond;thetimeoutforapingwillbe800msand therewillbe3pingretriesbeforethediscoveryprocessgivesuponanaddress.All devicesontheClassC10.0.0.0networkwillbepolled(withonly2retriesbuta3 secondtimeout).The10devices172.30.100.1through10willbepolledforwiththe defaultcharacteristics.Thespecificnode10.191.101.1willbepolled. Allthatthediscoverprocessdoesistogeneratenewsuspecteventsthatarethen usedbyotherOpenNMSprocesses.Ifthedevicedoesnotrespondtothispingpolling thenitwillnotbeaddedtotheOpenNMSdatabase. Anotherwaytogeneratesuchevents(sayforaboxthatdoesnotrespondtoping),isto useaprovidedPerlscript:
/opt/opennms/bin/sendevent.plinterface<ipaddr> uei.opennms.org/internal/discovery/newsuspect
48
Itispossibletoapplyprotocolstospecificaddressrangesorexcludeprotocolsfrom addressranges(thedefaultisinclusion).
<protocol-plugin protocol="ICMP" class-name="org.opennms.netmgt.capsd.IcmpPlugin" scan="on" user-defined="false"> <protocol-configuration scan="off" user-defined="false"> <range begin="172.31.100.1" end="172.31.100.15"/> <property key="timeout" value="4000"/> <property key="retry" value="3"/> </protocol-configuration> </protocol-plugin>
WhentestingSNMP,capsdmakesanattempttoreceivethesysObjectIDMIB2 variable(.1.3.6.1.2.1.1.2.0).Ifsuccessful,thenextradiscoveryprocessingtakesplace. First,threethreadsaregeneratedtocollectthedatafromtheSNMPMIB2system treeandtheipAddrTableandifTabletables.If,forsomereason,theipAddrTableor ifTableareunavailable,theprocessstops(buttheSNMPsystemdatamayshowupon thenodepage). Second,alloftheIPaddressesintheipAddrTablearerunthroughthecapsd capabilitiesscan.Notethatthisisregardlessofhowmanagementisconfiguredinthe configurationfile.Thisonlyhappensontheinitialscanandonforcedrescans.On normalrescans(bydefault,every24hours),IPaddressesthatare"unmanaged"in capsdarenotpolled. Third,everyIPaddressintheipAddrTablethatsupportsSNMPistestedtoseeifit mapstoavalidifIndexintheifTable.Ifthisistrue,theIPaddressismarkedasa secondarySNMPinterfaceandisacontenderforbecomingtheprimarySNMP interface.
Figure31:OpenNMSnodedetailforaswitchshowingswitchports
50
Thefirststanzaincapsdconfiguration.xmldefinesservicepollingparameters:
<capsd-configuration rescan-frequency="86400000" initial-sleep-time="300000" management-policy="managed" max-suspect-thread-pool-size = "6" max-rescan-thread-pool-size = "3" abort-protocol-scans-if-no-route = "false">
51
Figure32:OpenNMSNodeListofdiscoverednodes
52
Figure33:OpenNMSnodedetailforgroup100r1
<monitor service="DominoIIOP" <monitor service="ICMP" <monitor service="Citrix" <monitor service="LDAP" <monitor service="HTTP" <monitor service="HTTP-8080" <monitor service="HTTP-8000" <monitor service="HTTPS" <monitor service="SMTP" <monitor service="DHCP" <monitor service="DNS" <monitor service="FTP" <monitor service="SNMP" <monitor service="Oracle" <monitor service="Postgres" <monitor service="MySQL" <monitor service="Sybase" <monitor service="Informix" <monitor service="SQLServer" <monitor service="SSH" <monitor service="IMAP" <monitor service="POP3" <monitor service="NSClient <monitor service="NSClientpp
class-name="org.opennms.netmgt.poller.DominoIIOPMonitor"/> class-name="org.opennms.netmgt.poller.IcmpMonitor"/> class-name="org.opennms.netmgt.poller.CitrixMonitor"/> class-name="org.opennms.netmgt.poller.LdapMonitor"/> class-name="org.opennms.netmgt.poller.HttpMonitor"/> class-name="org.opennms.netmgt.poller.HttpMonitor"/> class-name="org.opennms.netmgt.poller.HttpMonitor"/> class-name="org.opennms.netmgt.poller.HttpsMonitor"/> class-name="org.opennms.netmgt.poller.SmtpMonitor"/> class-name="org.opennms.netmgt.poller.DhcpMonitor"/> class-name="org.opennms.netmgt.poller.DnsMonitor" /> class-name="org.opennms.netmgt.poller.FtpMonitor"/> class-name="org.opennms.netmgt.poller.SnmpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.TcpMonitor"/> class-name="org.opennms.netmgt.poller.ImapMonitor"/> class-name="org.opennms.netmgt.poller.Pop3Monitor"/> class-name="org.opennms.netmgt.poller.NsclientMonitor"/> class-name="org.opennms.netmgt.poller.NsclientMonitor"/>
targetinterfaces servicesincludingthepollingfrequency
54
30threadsareavailableforpolling.Thebasiceventthatisgeneratedwhenapoll failsiscalled"NodeLostService".Ifmorethanoneserviceislost,multiple NodeLostServiceeventswillbegenerated.Ifalltheservicesonaninterfacearedown, insteadofaNodeLostServiceevent,an"InterfaceDown"eventwillbegenerated.Ifall theinterfacesonanodearedown,thenodeitselfcanbeconsidereddown,andthis sectionoftheconfigurationfilecontrolsthepollerbehaviourshouldthatoccur.Ifa "NodeDown"eventoccursandnodeoutagestatus=onthenalloftheInterfaceDown andNodeLostServiceeventswillbesuppressedandonlyaNodeDowneventwillbe generated.Insteadofattemptingtopollalltheservicesonthedownnode,thepoller willattempttopollonlythecriticalservice.Oncethecriticalservicereturns,the pollerwillthenresumepollingtheotherservices. Noteinthefollowingscreenshotthatsixserviceshavebeendiscoveredonthe 10.0.0.95interfaceofthenodecalleddeodar.skills1st.co.uk,ofwhichfourare monitored.Thetwointerfacesonthe172.16networkhavebeendetectedthrough SNMPqueriesbutthereisnomonitoringofanyservicesonthesenetworks.There arenocurrentissueswithdeodarandavailabilityhasbeen100%overthelast24 hours. 55
Figure34:OpenNMSnodedetailwithmonitoredservices
OpenNMSincludesastandardsetofAvailabilityreports.Theycanbeselectedfrom theReportsmenu:
56
Figure35:OpenNMSAvailabilityreportsmenu
Hereisasample:
57
Figure36:OpenNMSOverallserviceavailabilityreport
58
Figure37:OpenNMSEventsmenu
TheAdvancedSearchoptionprovidesseveralwaystofilterevents.Bydefault Outstandingeventsaredisplayed(ie.eventsthathavenotbeenAcknowledged).
59
Figure38:OpenNMSAdvancedEventSearchoptions
Notethatifyouwishtosearchonseverity,youhavetospecifyanexactseverity;you cannotspecifyseveritygreaterthan.....
60
Figure39:OpenNMSdisplayofAllevents
61
Figure40:OpenNMSEventdetailforevent139192
alabeltouniquelyidentifytheevent atextlabelfortheeventusedintheWebGUI descriptionoftheevent summaryoftheeventwherethedestparameterisoneof: logtoeventsdatabaseanddisplayinwebGUI logtodatabasebutdon'tdisplayinwebGUI don'tlogtodatabaseorwebGUI don'tlogordisplaybutdopasstootherdaemons(eg.for notification) trapdtodiscardTRAPsnoprocessingwhatsoever createanalarmforthiseventwith
severity alarmdata
reductionkey fieldstocomparetodetermineduplicateevent
62
alarmtype autoclean
Manyofthetagscanusedatasubstitutedfromtheevent.Thesearedocumentedon theOpenNMSwiki:
63
Figure41:OpenNMSeventparametersthatcanbesubstituted
Hereisanexampleeventfromthedefaulteventconf.xml:
64
Figure42:OpenNMSeventdefinitionfornodeLostService
ThedifferentseveritiesavailablecanbeseenbyselectingtheSeverityLegendoption fromthetopofaneventslist.
Figure43:OpenNMSeventseveritylegend
Notethatthereisnoseparatefiletoconfigurealarms;itissimplydonewiththe <alarmtype>tagineventconf.xml. OpenNMScomeswithahugenumberofeventspredefined.Tomakeeventconf.xml muchmoremanageable,inclusionfilescanbespecifiedattheend,suchas: <eventfile>events/NetSNMP.events.xml</eventfile> Theeventssubdirectorycurrentlyhasaround100filesinit!Forperformancereasons, itmakessensetoediteventconf.xmlandremoveany<eventfile>stanzasthatarenot relevantforyourorganisation. AlsonotethatthewholeOpenNMSsystemmustberecycledinorderforchangesto eventconf.xmltotakeeffect!
65
Figure44:OpenNMSUnknowntrapappearsintheEventslist
ClickingontheeventIDgivesthedetailoftheeventwhichshowsalltheinformation thatarrivedwiththeTRAP.
Figure45:OpenNMSEventdetailforanunformattedTRAP
66
Figure46:OpenNMSDefinitionindefault.events.xmlforanunknownspecifictrap
uei source host snmphost nodeid interface service id(OID) specific generic
67
Theabovecodesnippetwillmatchifthethirdparameterhasavalueof"2"or"3"and thefourthparameterhasavalueof"2"or"3".Itisalsopossibletouseregular expressionswhenmatchingvarbindvalues. Again,notethattheorderinwhicheventsarelistedisveryimportant.Putthemost specificeventsfirst. Hereisanexampledefinitionthatincludesmatchingavarbindwitharegular expression.Notethe<vbvalue>matchesanystringthatcontainseitherBadorbad. Extrastanzashavealsobeenaddedfor<operinstruct>help(whichprovidesaweb linkononelineandplaintextonthesecond),a<mouseovertext>tag(whichdoesn't appeartowork)andatagtorunanautomaticaction(ashellscript)wheneverthis eventoccurs.
68
Figure47:OpenNMSConfigurationofspecificTRAPwithvarbindmatchingaregularexpression
69
Figure48:OpenNMSAlarmsdisplay
Alarmsaredefinedaspartofaneventdefinitionineventconf.xmlanditsincludefiles. Itusesthe<alarmdata>tagwhere:
reductionkey alarmtype
70
Atpresent(July10th,2008),acknowledgingeventshasnoeffectonrelatedalarms, andviceversa.NotethattheconceptsofAcknowledgingandClearingare completelydifferent.Anoperatorcanacknowledgeaneventoranalarm,andthen ownsit.Thisdoesnotcleartheevent(ie.removeitentirelyfromtheevents database). Automaticactionscanbeconfiguredforaneventusingthe<autoaction>tagbutthis canonlyrunanexecutableanditrunsoneveryoccurrenceoftheevent(whichmay notbewhatyouwant!). OpenNMS'sconceptofautomation,however,istriggeredfromalarmsratherthan events.Automationistheconceptofactionsbeingperformedonascheduledbasis, providedthecorrecttriggersexist.An<automation>tagincludes:
71
Figure49:OpenNMSDefaultdefinitionsforautomationsinvacuumd.xml
Figure50:OpenNMSDefinitionofselectResolverstriggerinvacuumd.xml
...andtheclearProblemsaction:
72
Figure51:OpenNMSDefinitionofclearProblemsactioninvacuumd.xml
ThetriggeriskeyedonthefieldalarmType=2.Notethatthefirstversionofthe actioniscommentedouttheclearueielementisnowdeprecatedinthe<alarm data>tagandonlytheclearkeyelementonthegoodnewseventisusedtomatch againstthereductionkeyelementofthebadnewsevent,settingtheseverityto2 (ie.Cleared).Alsonotefromthe<automation>tagthatcosmicClearwillrunevery30 seconds. IfusersneedtobenotifiedofaneventthenOpenNMSprovidesemailandpager notificationsoutofthebox,runbythenotifddaemon.Itisalsopossibletocreate othernotificationmethodssuchasSNMPTRAPsoranarbitraryexternalprogram. Thereareseveralrelatedconfigurationfilesin/opt/opennms/etc:
73
Figure52:OpenNMSExampleentriesindestinationPaths.xml
The<name>tagspecifiesauserorgroupofusersdefinedinOpenNMS.The <command>tagspecifiesamethodthatmustbedefinedin notificationCommands.xml.Notethatescalationsarepossible. Whenaneventisreceivedforwhichanotificationisrequired,OpenNMS"walks"the destinationpath.Wesaythatthedestinationpathis"walked"becauseitisoftena seriesofactionsperformedovertimeandnotnecessarilyjustasingleaction(although itcanbe).Thedestinationpathcontinuestobewalkeduntilallnotificationsand escalationshavebeensentorthenotificationisacknowledged(automaticallyorby manualintervention). Outofthebox,theonlydestinationPaththatisconfiguredisforjavaEmailtothe Admingroupofusers. Thenotifications.xmlfilespecieswhateventstriggernotificationsandtowhom.Here isanexamplefromthedefaultfile:
74
Figure53:OpenNMSExtractofnotificationsfromnotifications.xml
Figure54:OpenNMSnotifdconfiguration.xmlwithautoacknowledgementsfornotifications
76
Notethatifadevicehasseveralinterfacesthat:
Figure55:OpenNMScollectdconfiguration.xmlasshipped
statementperpackageandcanthenusemultiple<specific>,<includerange>and <excluderange>statementstodefinewhichinterfacesthispackageappliesto.You canalsousethe<includeurl>tagtospecifyafilewithalistofinterfaces. ThereisonlyonedatacollectionservicedefinedforOpenNMSoutofthebox,in collectdconfiguration.xmltheSNMPservice.Itwillrunevery5minutes(300,000 ms)andwillcollecttheMIBvariablesspecifiedinthecollectioncalleddefault, specifiedindatacollectionconfig.xml.The<service>stanzacanalsospecifyvaluesfor SNMPtimeouts,retriesandportnumberwhichwouldoverridethedefaultvaluesin snmpconfig.xml. Thepackagedefinitioncanalsousethe<outagecalendar>tagtospecifyscheduled downtimefordevices,duringwhichdatacollectionwillbesuspended.Thisshouldbe usedtopreventlotsoffailedSNMPcollectionevents.Outageperiodsaredefinedin thepolloutages.xmlfile. Obviouslyyoucanspecifydifferentpackageswithdifferentaddressranges,collection intervalsandwithdifferentcollectionkeys.Youcanalsospecifydatacollectorsother thanSNMP,suchasNSClient,JMXandHTTP.Seehttp://blogs.opennms.org/?p=242 foranoteonusinganHTTPdatacollector. Thedatacollectionconfig.xmlfiledefinesoneormoreSNMPdatacollectionsthat TarusBalog(theprimedeveloperbehindOpenNMS)callsa"scheme",todifferentiate itfromthepackagedefinedinthecollectdconfigurationfile.Theseschemesbring togetherOIDsforcollection,intogroupsandthegroupsaremappedtosystems.The systemsaremappedtointerfacesbyadevice'ssystemOID.Inaddition,each"scheme" controlshowthedatawillbecollectedandstored. Fundamentally,OpenNMSusesRRDTool(RoundRobinDatabaseTool)tostore performancedata.ThispaperisnotatutorialonRRDToolsopleasefollowthe referencetoRRDattheendofthispaperformoreinformation. ThebasisofRRDisthatafixedamountofspaceisallocatedforagivendatabase whenitiscreated.Itholdsdataforagivenperiodoftime,say1month,1year,etc. Thesamplingintervalisknownsoyouknowhowmanydatapointswillgointothe databaseandhencehowmuchspaceisrequired.Oncethedatabaseisfull,newer datapointswillreplacetheoldestones,cyclingaround.
Figure56:OpenNMSdatacollectionconfig.xmlcollectionandRRDparameters
78
The<rrd>stanzaspecifieshowdatawillbestoredinaRoundRobinArchive(RRA). Thesnapshotshowninthefigureabovespecifies:
<rrdstep="300">
datatobesavedevery5minutes,perstep createanRRAwithvaluesAVERAGE'dover1step(ie.thisdataisraw, notconsolidated).TheRRAwillhave2016rowsrepresenting7daysofdata (5minutesteps=12/hour*24hours*7days=2016).Consolidatethe samplesprovided0.5(half)ofthemarenotUNKNOWN(otherwisethe consolidatedvaluewillbeUNKNOWN) createanRRAwithvaluesAVERAGE'dover12steps(ie.thisdatais consolidatedover1hour).TheRRAwillhave1488rowsrepresenting2 monthsofdata(1hourconsolidations*24hours*62days=1488). Consolidatethesamplesprovided0.5(half)ofthemarenotUNKNOWN (otherwisetheconsolidatedvaluewillbeUNKNOWN) createanRRAwithvaluesAVERAGE'dover288steps(ie.thisdatais consolidatedover288*5minsteps=1day).TheRRAwillhave366rows representing1yearofdata(1dayconsolidations*366days=366). Consolidatethesamplesprovided0.5(half)ofthemarenotUNKNOWN (otherwisetheconsolidatedvaluewillbeUNKNOWN) createanRRAwithMAXvaluesaverageddailyandkeep1yearofdata createanRRAwithMINvaluesaverageddailyandkeep1yearofdata
RRA:AVERAGE:0.5:1:2016
RRA:AVERAGE:0.5:12:1488
RRA:AVERAGE:0.5:288:366
RRA:MAX:0.5:288:366
RRA:MIN:0.5:288:366
79
Figure57:OpenNMSAssetInfolinkforadevice
TheresultingpageincludestheNodeIDatthetop.
80
Figure58:OpenNMSAssetinformationpage,includingNodeID
81
Figure59:OpenNMSGUIAdminpageforspecifyinginterfacestocollectdatafrom
Mostofthecontentsofdatacollectionconfig.xmlisdefininggroupsandsystems:
groups systems
82
Figure60:OpenNMSgroupdefinitionsindatacollectionconfig.xml
UnfortunatelyOpenNMSdoesnothaveaMIBcompilersoallMIBOIDsneedtobe manuallyspecifiedinthisfile(thegoodnewsisthattherearelotsthereoutofthe box).OncegroupsofMIBvariablesaredeclared,systemstanzassaywhichgroup(s) aretobecollectedforanydevicewhosesystemOIDmatchesaparticularpattern. EachSNMPMIBvariableconsistsofanOIDplusaninstance.Usually,thatinstance iseitherzero(0)oranindextoatable.Atthemoment,OpenNMSonlyunderstandsa smallnumberoftableindices(forexample,theifIndexindextotheifTableandthe hrStorageIndextothehrStorageTable).Allotherinstanceshavetobeexplicitly configured. TheifTypeparametercanbeusedtospecifythesortofinterfacestocollectfrom. Legalvaluesare:
all
collectfromallinterfacetypes
83
ignore
usedwhenthevaluewouldbethesameforallinterfaceseg. CPUutilisationforaCiscorouter
OpenNMSunderstandsfourtypesofvariablestocollectongauge,timeticks,integer, octetstring.NotethatRRDonlyunderstandsnumericdata.
Figure61:OpenNMSsystemsdefinitionsindatacollectionconfig.xml
84
Figure62:OpenNMSReportcategoriesavailableoutofthebox
KSCPerformance,Nodes,Domains
FollowingtheResourceGraphslinkprovidesaccesstomanystandardreports.
85
Figure63:OpenNMSStandardperformancereports
Thestandardperformancereportsdisplayvariouscollectedvaluesforoneparticular nodewhichyouchoosefromthemenuprovided.Thedifferentcategoriesprovide:
86
Figure64:OpenNMSStandardResourcegraphsavailableforaselectednode
Hereispartofthenodelevelperformancedatasetofgraphs.
87
Figure65:OpenNMSpartialdisplayofthenodelevelperformancedatagraphs
88
Figure66:OpenNMSKSCReportsmenu
SelectinganodeandclickingViewchildresourcesresultsinamenuofreport categories.
89
Figure67:OpenNMSReportcategoriesavailableforcustomisedreports
IfyouselecttheNodelevelPerformanceDataoptionandtheChoosechildresource buttontheneachoftheMIBvariablescollectedcanbedisplayedandselected.
90
Figure68:OpenNMSSelectingprefabricatedreportstoincludeinacustomisedreport
7.4.3 Thresholding
ThethresholdingcapabilityinOpenNMShaschangedfairlysignificantlyovertime seehttp://www.opennms.org/index.php/Thresholding#Merge_into_collectd.foragood explanation. PreOpenNMS1.3.10,collectdcollecteddataandthreshdperformedthresholding twoseparateprocesses.Thisdesignusedarangeparameterinthreshd configuration.xmltogetaroundproblemscausedbytheasynchronousmannernature ofcollectdandthreshd. OpenNMS1.3.10mergedthethresholdingfunctionalityintocollectdandintroduceda newparameterintocollectdconfiguration.xml:
<parameterkey=thresholdinggroupvalue=defaultsnmp/>
<parameterkey=thresholdingenabledvalue=true/>
Hereisthedefaultcollectdconfiguration.xml:
Figure69:OpenNMSDefaultcollectdconfiguration.xml
Thelackofanythresholdingparameterimpliesthatthresholdingisdisabled. ...andthedefaultthreshdconfiguration.xml:
Figure70:OpenNMSDefaultthreshdconfiguration.xml
92
Figure71:OpenNMSModifiedcollectdconfiguration.xmltoenablethresholds
threshdconfiguration.xmlcanbemodifiedwithdifferentpackagesofthresholdingto applytodifferentrangesofnodes.
Figure72:OpenNMSModifiedthreshdconfiguration.xml
93
Figure73:OpenNMSModifiedthresholds.xmlforCCsnmpgroupandraddlesnmpgroup
Theattributesofathresholdare:
type:A"high"thresholdtriggerswhenthevalueofthedatasourceexceedsthe "value",andisrearmedwhenitdropsbelowthe"rearm"value.Conversely,a "low"thresholdtriggerswhenthevalueofthedatasourcedropsbelowthe "value",andisrearmedwhenitexceedsthe"rearm"value."relativeChange"is forthresholdsthattriggerwhenthechangeindatasourcevaluefromone collectiontothenextisgreaterthan"value"percent. expression:Amathematicalexpressioninvolvingdatasourcenameswhichwill beevaluatedandcomparedtothethresholdvalues.Thisisusedin"expression" thresholding(supportedfrom1.3.3). dsname:Thenameofthevariabletobemonitored.Thismatchesthenamein thealiasparameteroftheMIBstatementindatacollectionconfig.xml. dstype:Datasourcetype.nodefornodeleveldataitems,and"if"for interfacelevelitems. dslabel:Datasourcelabel.Thenameofthecollected"string"typedataitemto useasalabelwhenreportingthisthreshold.Note:thisisadataitemwhose valueisusedasthelabel,notthelabelitself. value:Thevaluethatmustbeexceeded(eitheraboveorbelow,dependingon whetherthisisahighorlowthreshold)inordertotrigger.Inthecaseof relativeChangethresholds,thisisthepercentthatthingsneedtochangein ordertotrigger(e.g.'value="1.5"'meansa50%increase). rearm:Thevalueatwhichthethresholdwillresetitself.Notusedfor relativeChangethresholds.
94
Figure74:OpenNMSThresholdeventsfromvariousdevicesintheraddlenetwork
ForthosewhoprefernottoeditXMLconfigurationfiles,theOpenNMSAdminmenu providesaGUIwaytocreateandmodifythresholds.
95
Figure75:OpenNMSAdminmenu
SelectingtheManageThresholdsoptiondisplaysallthresholdscurrentlyconfigured inthresholds.xml.
96
Figure76:OpenNMSConfiguringthresholdsthroughtheAdminmenu
UsingtheEditbuttonpermitsmodificationofanexistingthreshold.
Figure77:OpenNMSModifyingthresholdsthroughtheAdminGUI
toFigure75:OpenNMSAdminmenuforalistoftheareaswhichcanbeconfigured thisway.
8 Zenoss
ZenossisathirdOpenSource,multifunctionsystemsandnetworkmanagementtool. UnlikeNagiosandOpenNMS,thereisafree,coreoffering(whichdoesseemtohave mostthingsyouneed),andZenossEnterprisethathasextraaddongoodies,high availabilityconfigurations,distributedmanagementserverconfigurationsandvarious
98
supportcontractofferingswhichincludessomeeducation.Foracomparisonofthe freeandfeealternatives,tryhttp://www.zenoss.com/product/#subscriptions. Zenossoffersconfigurationdiscovery,includinglayer3topologymaps,availability monitoring,problemmanagementandperformancemanagement.Itisbasedaround theITILconceptofaConfigurationManagementDatabase(CMDB),theZenoss StandardModel.ZopeEnterpriseObjects(ZEO)isthebackendobjectdatabasethat storestheconfigurationmodel,andZopeisthewebapplicationdevelopment environmentusedtodisplaytheconsole.TherelationalMySQLdatabaseisusedto holdcurrentandhistoricalevents. Zenoss2.2hasrecentlybeenreleasedwhichprovidesstackbuildscomplete bundlesincludingZenossandallitsprerequisites.Thesestackinstallersare availableforawidevarietyofLinuxplatforms;standardRPMandsourceformatsare alsoavailable.Foreasyevaluation,aVMwareappliancecanbedownloaded,readyto go. ItriedboththeVMwarebuildandthe2.2stackinstallforSuSE10.3;bothwere relativelypainless.Therestofthissectionisbasedonthe2.2stackinstallationona machinewhosehostnameiszenoss. ToaccesstheWebconsole,pointyourbrowserathttp://zenoss:8080.Thedefaultuser isadminwithapasswordofzenoss.Thedefaultdashboardiscompletelyconfigurable butthisscreenshotisclosetothedefault.
99
Figure78:Zenossdefaultdashboard
100
Figure79:Zenossdeviceclasses
101
Figure81:ZenossLinuxServerdevices Figure80:ZenossServerDeviceclasses
102
Figure82:ZenosszPropertiesfortheDeviceclass(part1)
103
Figure83:ZenosszPropertiesfortheDeviceclass(part2)
104
Figure84:ZenosszPropertiesfortheDeviceclass(part3)
ThelefthandmenusofthewebconsoleprovideanAddDeviceoption(nothingis discoveredautomatically,outofthebox).
Figure85:ZenossAddDevicesdialogue
Figure86:ZenossNetworksclasswithdropdownmenu
Oncethepresenceofanetworkhasbeendiscovered,devicescanautomaticallybe discoveredonthatnetworkthisusesaspraypingmechanism.Thereisadropdown menufromthetopleftcorneroftheNetworkspage(whichworksfineforsimpleClass Cnetworks).AlthoughtheGUIdoesmanagetodisplaysubnetworksaccurately,even ifthesubnetmaskisnotonabyteboundary,theDiscoverDevicesmenudoesnot honourthesubnetmask.However,agoodfeatureofZenossisthatthereisa commandline(CLI)forvirtuallyeverythingandtheCLIfordevicediscoveryona networkdoeshonoursuppliednetmasks.Forexample: zendiscrunnet10.0.0.0/24 NotethattheZenossdiscoveryalgorithmisverydependentongettingroutingtables usingSNMPandtheZenossservermustsupportSNMPitself. FordevicesthatdonotsupportpingbutdosupportSNMP,theycanbeadded manuallywiththeAddDevicemenu.ThezPropertiesofthedevice(orclassof
106
Figure87:ZenossNetworkMapshowing4hopsfromgroup100r1
107
pingtests
servicetests
processtestsandWindowsServicestests
108
Figure88:ZenossCollectors(Monitors)overview
Thedevicesbeingmonitoredareshownatthebottomofthescreen.Tochangeanyof theseparameters,usetheEdittab.Thedefaultsforavailabilitymonitoringare:
109
8.2.2 Availability monitoring of services - TCP / UDP ports and windows services
ServicemonitoringforTCP/UDPportsandWindowsservices,isconfiguredthrough theServicesmenu.
Figure89:ZenossServicesmenu
110
Figure90:ZenossWindowsservices
EvenmoreIPservicescomeconfiguredoutofthebox.TherearetwosubclassesofIP servicesPrivilegedandRegistered;eithercanmonitoreitherTCPorUDPports.
111
Figure91:ZenossPrivilegedIPservices
Again,notetheCountcolumn.Clickingontheservicenameshowswherethe servicehasbeendetected:
Figure92:Zenossdevicesrunningthedomain(DNS)serviceonTCP53orUDP53
112
Thefactthataservicehasbeendetecteddoesnotimplythatitisbeingmonitoredfor availability(thedefault,outofthebox,isthatnothingismonitored).TheMonitor columnfordevicesshowswhetheractivemonitoringistakingplace(andhenceevents potentiallybeinggenerated).TheMonitorfieldinthetoppartofthewindowshows theglobaldefaultforthisservice. Toturnonservicemonitoringgloballyforaparticularservice,usetheServicesmenu tofindtheserviceinquestion.YoucanthenuseeitherthezPropertiestaborthe EdittabtochangetheMonitorglobaldefaulttoTrue(thedefault,asshipped,is False). Toturnonservicemonitoringforaspecificdevice,accessthemainpageforadevice andopentheOStab.UndertheIPServicessection,clickontheNamecolumn headertoseeservicesdetected.Clickontheservicenamewhichbringsuptheservice statuswindowforthedevicewheretheMonitorfieldcanbechangeddon'tforgetto clicktheSavebutton.NotethattheMonitoredboxintheIPServicesheadingbar canbeusedtotogglethedisplaybetweendetectedservicesandmonitoredservices. NotethatthedropdownmenutoAddIpServiceisdrivenbytypinginapartial matchoftheservicenameyouwantthesubsequentdropdownthenshows configuredservicesthatmatchyourselection.
113
Figure93:ZenossProcesseswithdropdownmenu
Supplyaprocessnameanditwillbeaddedtothelist.Tomodifythedefinitionofthe process,clickontheprocessnameandselecttheEdittab.
Figure94:Zenossdialogueformodifyingprocessdefinition
TomodifythezPropertiesofaprocess,usethezPropertiestab.
114
Figure95:ZenosszPropertiesforthefirefoxprocess
Figure96:ZenossAddOSProcessmonitoringtoaspecificdevice
115
Notethattherearecurrently(July4th,2008)acoupleofbugstodowithprocess monitoringwherebyprocessesdisappearfromtheOStabofadeviceand/orshowthe wrongstatus(tickets#3408,#3399,#3270).Tomitigateagainstthese,thezenprocess daemonshouldbestoppedandrestartedwhenevermodificationshavebeenmadeto dowithprocesses.YoucanusetheGUIbychoosingSettingsandselectingthe Daemonstab. Temporarily,itwouldalsobewisetousethemenufortheprocessandselecttoLock theprocessfromDeletion. Moresophisticatedavailabilitymonitoringcanbeimplementedusingstandard zCollectorPluginsnotethatthesearemodellingpluginsasdistinctfrom performanceplugins.zCollectorpluginsareappliedtodeviceclassesordevices throughthezPropertiestabusetheEditlinkalongsidezCollectorPluginstoshow ormodifythepluginsappliedandavailable.
Figure97:ZenosszCollectorPlugins
116
Figure98:Zenossdefaultpluginsforclass/Devices/Server/Windows
117
Figure99:ZenosszCollectorPluginsfordevicegroup100r1.class.example.org
118
Figure100:ZenossDeviceMoresubmenu
Figure101:ZenossDeviceManagesubmenu
119
Figure102:ZenossCommandsprovidedoutofthebox
Fromadevice'smainpage,thereisasubmenutoRunCommands.
Figure103:ZenossRunCommandsforaparticulardevice
120
121
Figure104:ZenossEventManagerconfiguration
122
Figure105:ZenossEventConsole
123
Figure106:ZenossEventclassesandsubclasses
Tomodifythecontextofanyevent,selecttheeventandusethezPropertiestab.
Figure107:ZenosszPropertiesfortheeventclass/Event/Status/OSProcess
124
Tochangetheeventmapping,selecttheeventclassandusetheMappingstab.
Figure108:ZenossEventmapping
TheEdittaballowseditingofanyofthesefields.
125
Figure109:Zenossmenutocreateaneweventclass
126
Figure110:ZenossmenutocreateAlertingRule
127
Figure111:ZenossEditingalertingrule
TheemailorpagermessageoftheAlertingRuleisconfiguredbytheMessagetab andtheScheduletabcanbeusedtocreatedifferentalertingrulesatdifferenttimes.
128
Figure112:ZenossAlertingrulemessageformat
Globalparametersforemailandpaging,alongwithotherusefulparameters,canbe definedfromtheSettingslefthandmenu.
129
Figure113:ZenossSettingsparameters
TheoutoftheboxemailnotificationsprovidehandylinksbacktoZenossto manipulatetheeventthatisbeingreportedon.
130
Figure114:Zenossemailgeneratedbyeventnotification,includinglinks
Figure115:ZenossEventCommanddefinition
131
Figure116:ZenossAllTemplatesshowingalldefinedperformancetemplates
Figure117:ZenosszPropertiesshowingzDeviceTemplate
Thedefault,outofthebox,isthatthedevicetemplatecalledDeviceisboundtoeach devicediscovered.Asnotedinthepreviousscreenshot,thereareseveraltemplates calledDevice.TheDevicetemplatefortheclass/DevicessimplycollectssysUpTime. ThetemplatecalledDevicefor/Devices/Servercollectsanumberofparameters supportedbythenetsnmpMIB.ThetemplatecalledDevice for/Devices/Server/WindowscollectsvariousMIBvaluesfromtheInformantMIB. ForeachtemplatenameZenosssearchesfirstthedeviceitselfandthenuptheDevice Classhierarchylookingforatemplatewiththatname.Zenossusesthefirsttemplate thatitfindswiththecorrectname,ignoringotherswiththesamenamethatmight existfurtherupthehierarchy.
133
Figure118:ZenossBindTemplatesmenu
Beawarethatwhenselectingtemplatestobind,youneedtoselectallthetemplates youwantbound(usetheCtrlkeytoselectmultiples). So,whatdothesetemplatesactuallyprovide? Templatescontainthreetypesofsubobjects: Datasources whatdatatocollectandmethodtouseeg.MIBOID Thresholds expectedboundsfordataandeventstoraiseifbreached Graphdefinitions howtographthedatapoints
134
Figure119:ZenossDevicetemplatefor/Devices/Server
Figure120:ZenossDataSourcememAvailReal
135
NotethatthereisausefulTestbuttontocheckyourOIDagainstanodethatZenoss knowsabout.However,bewarethatthisTestbuttonappearstousesnmpwalkunder thecoverssoifaMIBOIDhasmultipleinstancesthenthesnmpwalkwillreturn valuessuccessfully.Whenzenperfsnmpactuallycollectsdata,itrequiresthecorrect instanceaswellasthecorrectMIBOID.Ifyourtestissuccessfulbutyou subsequentlyseeemptygraphswithamessageofMissingRRDfilethenthe problemislikelytobethattheMIBinstanceisincorrect. DatasourcescanbeaddedordeletedwiththedropdownAddDataSourceand DeleteDataSourcemenus. Thresholdscanbeappliedtoanyofthedatapointscollected,alongwitheventsto generateifthethresholdisbreached.
Figure121:ZenossThresholdonCPUcollecteddata
136
Figure122:ZenossDropdownmenufordatathresholds
137
Figure123:ZenossPerformancetemplategraphdefinition
138
Figure124:ZenossPerformancegraphsforeth1interfaceonbino
139
Figure125:ZenossPerformancegraphsavailableunderthePerftabforbino
NotethattheReportslefthandmenualsoprovidesaccesstovariousreports, includingperformancereports.
140
Figure126:ZenossReportsmenu
FollowingthePerformanceReportslinkprovidesaccesstoallperformancereports foralldevices.
Figure127:ZenossPerformanceReportsmenu
ZenosswilluseSNMPtogainstatusandperformanceinformationfromadevicebutit alsohassshandtelnetasalternatives,forthosedeviceswhereSNMPis inappropriate. TheQuickStartGuidegetsyourunningfastandtheAdminGuideprovideswhatit saysareasonablecomprehensiveAdministrator'sGuide.Thereisalsoabookby MichaelBadger,publishedJune2008,ZenossCoreNetworkandSystem Monitoring,whichiswellworththeinvestment(availablebothinpaperandin electronicformat).However,onefeelsthatthereissomuchmoreinthedetailof Zenossthatoneneedstoknowandcanfindnoinformationon! MyonlyrealnegativecommentonZenoss,otherthanthelackofdetailedtechnical information,isthatitisarapidlyevolvingproductanditfeelsratherbuggy.The current(August2008)pollonthezenossusersforumforinputtoZenoss2.3,has manyrequesterswithcodereliabilityandbetterdocumentationatthetopoftheir lists!
142
Thereareadvocatesforandagainstagentlessmonitoring.Personally,Idon't believeinagentless.Onceyouhavegotpastpingthenyouhavetohavesomeform ofagenttodomonitoring.Thequestionis,shouldamanagementparadigmusean agentthatistypicallypartofaboxbuild(likessh,SNMPorWMIforWindows),or shouldthemanagementsolutionprovideitsownagent,likeNagiosprovidesNRPE (andmostofthecommercialmanagementproductscomewiththeirownagents).If yourmanagementsystemwantsitsownagents,youthenhavethehugeproblemof howyoudeploythem,checktheyarerunning,upgradethem,etc,etc.OpenNMSand ZenosshaveastrongdependencyonSNMPalthoughZenossalsosupportssshand telnetmonitoring,outofthebox(ifyourenvironmentpermitsthese).SNMPmaybe oldandSimple,butallthreeproductssupportSNMPV3(forthosewhoareworried aboutthesecurityofSNMP)andvirtuallyeverythinghasanSNMPagentavailable. Theotherformofagentlessmonitoringbasicallycomesdowntoportsniffingfor services.Whilstthiscanworkfineforsmallerinstallations,thensquarednatureof lotsofdevicesandlotsofservicesdoesn'tscaletoowell.Allthreeproductsdoport sniffingsoitcomesdowntohoweasyitistoconfigureeconomicmonitoring.
9.1.1 Discovery
Nagios OpenNMS Zenoss GUI,CLIandbatch importfromtextor XMLfile Nodediscovery Configfileforeach Configfilewith node include/exclude ranges Automatic discovery Interface discovery No Possiblethrough configfile
Yesnodeswithin Yesnetworks&nodes configuredn/wranges Yesincludingswitch ports Yessend_event.pl Yesincludingswitch ports YesuseSNMP,sshor telnet mySQL&ZopeZEO YesTCPandUDP Yeswithssh, zenPacksorplugins
Discovernodes Yesuse thatdon't check_ifstatus supportping plugin SQLDatabase No Service(port) discovery Application discovery Yesuseplugin (TCP,UDP,....)
143
Nagios Supports NRPE/ NSClient L3topology map L2topology map Yes Yes
OpenNMS Possible
Zenoss
V1,2&3 No No
Agenttechnology Generallyrelies SNMPoutofthebox; SNMP,sshclient, onNagiosplugins customisedplugins WMIforWindows, deployed possible ZenPackstobe deployed Availabilityreports Yes Yes Yes
144
autoactionson events
goodnews/badnews goodnews/badnews correlationonalarms correlationonevents andnotifications andnotifications Deduplication Noautomaticrepeat Yes countmechanismbut eventsdonotcontinue toberaisedfor existingproblems Service/host dependencies Rootcause analysis Yes UNREACHABLE Outages/Path statusfordevices outages behindnetworksingle pointoffailure. Also,host/service dependencies. Yes
No No
NSClient,JMX, HTTP
145
OpenNMS Yes
Zenoss
Yeslotsprovided OOTB No No
Nopdfdocumentation.Wikihardtofind detailedinformation.
Architecturegoodbasedaroundobject Implementationfeelsbuggy orientedCMDBdatabase Topologymap(upto4hops) Lotsofplugins&zenPacksavailable emailnotificationsincludeURLlinks backtoZenoss Commercialversionavailable GoodQuickStartmanual, Administratorsmanualandbook SupportsNagios&Cactiplugins NoMIBbrowser Nowaytochangecoloursofevents Commercialversionavailable Lotsofthingsundocumentedwhenyouget downtodetails
147
9.3 Conclusions
Whattochoose?Backtoyourrequirements! Forsmallish,systemsmanagementenvironments,Nagiosiswelltestedandreliable withahugecommunitybehindit.Foranythingmorethansimplepingchecksplus SNMPchecks,bearinmindthatyoumayneedawaytoinstallremotepluginson targethosts.Notificationsarefairlyeasytosetupbutifyouneedtoproduceanalysis onyoureventlogthenNagiosmaynotbethebestchoice. OpenNMSandZenossarebothextremelycompetentproductscoveringautomatic discovery,availabilitymonitoring,problemmanagementandperformance managementandreporting.Zenosshassometopologymappingandhasbetter documentationbutthecodefeelslessreliable.OpenNMScurrentlyhasarather messyarchitecturearoundevents,alarmsandnotifications,thoughthisissaidtobe underreview.Ialsostruggletobelievethatyouhavetorecyclethewholeof OpenNMSifyouhavechangedaconfigurationfile!Thecodefeelsverystablethough. Mychoice,hopingferventlythatcodereliabilityanddocumentationimproves,is Zenoss.
148
10 References
1. itSMFPocketGuide:ITServiceManagementaCompaniontoITIL,IT ServiceManagementForum 2. MultiRouterTrafficGrapher(MRTG)byTobiOetiker, http://oss.oetiker.ch/mrtg/ 3. RRDtoolhighperformancedataloggingandgraphingsystemfortimeseries datahttp://oss.oetiker.ch/rrdtool/ 4. netdisconetworkmanagementapplicationhttp://www.netdisco.org/ 5. TheDudenetworkmonitorbyMicroTik,http://www.mikrotik.com/thedude.php 6. nagioshost,serviceandnetworkmonitoringprogramhttp://www.nagios.org/ 7. Zenossnetwork,systemsandapplicationmonitoringhttp://www.zenoss.com/ 8. OpenNMSdistributednetworkandsystemsmanagementplatform http://www.opennms.org/ 9. cactinetworkgraphingsolutionhttp://www.cacti.net/ 10. SNMPRequestsForComment(RFCs)http://www.ietf.org/rfc.html 11. 12. 13. V1RFCs1155,1157,1212,1213,1215 V2RFCs2578,2579,2580,3416,3417,3418 V3RFCs25782580,341618,3411,3412,3413,3414,3415
149
MySQL(5.0.4522)
NoPHP5configurationwasdoneasthefilesdocumentedintheinstallation guidedidnotexist ConfigurationofApache2requirednomodifications in/etc/apache2/conf.d/php5.conf CactiwasinstalledusingthestandardSuSEYastmechanism CreatetheMySQLdatabaseby: cd/usr/share/cacti mysqluser=rootp(andsupplytherootpasswordwhenprompted) createdatabasecacti; sourcecacti.sql; GRANTALLONcacti.*TOcactiuser@localhostIDENTIFIEDBY
YouneedtomanuallycreatetheOperatingSystemusercactiuserwith passwordcacti Whenpointingyourwebbrowserathttp://<yourserver>/cacti/ensurethatyou includethetrailingslash.Useaweblogonofadmin,passwordadmin. Ensurethatapache2andmysqlareeithermanuallystarted(/etc/init.d/<name> start)orstartthemautomaticallyatsystemstartusingchkconfig Ensurethatthecactiuseruseridcanexecutethe/usr/share/cacti/poller.php scriptthatisrunby/etc/crontab. AlsoensurethatthedirectorythattheRRDdataiswrittento(/var/lib/cacti)is writeablebythisuser. cacti.logisin/var/log/cacti Ifound(through/var/log/messages)thatpoller.phpwasbeingruntwice,oncein /etc/crontabascactiuserandoncein/etc/cron.d/cactiasuserwwwrun commentoutthelinein/etc/cron.d/cactiandcheckagainthatcactiusercan writetothedatafilesin/var/lib/cacti.
150
Theinitialconsolepageisagoodstartingpointtoadddevicestomonitorand associatedgraphs.
151