Nagios, configuration

Revision 01 01 02 02 Date 30th of August 05 By T. Sluyter Changes Initial creation Reviewed Reviewed

Summary
This document provides detailed information on all the intricacies of configuring Nagios. It tells you about setting up the basic server, as well as configuring the server for the monitoring of various clients. For an explanation of the basic functioning of Nagios, please refer to “NAGIOS basic guide”, which is available on Sharepoint.

Table of Contents
Installing the server software................................................................................... ............2 Restoring the server software.................................................................................... ..........2 The Nagios user................................................................................................................ ...3 Configuration files................................................................................................. ...............3 Building hosts and hosts groups................................................................... ......................5 Setting up contacts and notification..................................................................................... 7 Setting up the clients.......................................................................................................... ..8 Configuring for each UNIX client ................................................................................ ........8 Writing your own NRPE scripts.................................................................... .....................11 Configuring for each Windows client ........................................................................... .....14

Installing the server software
You should never have to reinstall the Nagios server, unless the following scenario ever takes place: 1. The current server dies in such a horrible way that neither its boot disk, nor the mirror are usable. 2. The backups of the current server have become completely unreadable. 3. The off-site backup copies have all been destroyed by a hurricane, a tidal wave or the Apocalypse (in which case you will probably not need the Nagios server anyway). However, should you ever be interested in how the original server was built, I would like to nudge you towards the documentation provided along with Nagios. It provides excellent descriptions on how to compile all of the software and how to get things up and running. Unfortunately there were a few caveats which I experienced while following their procedures, so I will include them here: • Compiling the Nagios base software requires four additional libraries to be installed on your server: GD (available from boutell.com/gd as source code) libpng (available from Sunfreeware.com as a package) libjpg (available from Sunfreeware.com as a package). cgiwrap-3.9 (available from Sunfreeware.com as a package).

Compiling GD may run into problems with ar. This is the library archiver. Edit the file called libtool and change the line which sets $AR to /usr/ccs/bin/ar. When compiling Nagios the configure script may complain that it still cannot find the libraries for GD, jpeg and png. To solve this you need to logout and back in. Compiling the Nagios plugins also complains about ar. In this case the problem is easily solved by adding /usr/ccs/bin to your $PATH variable.

• •

Restoring the server software
Luckily restoring the server software is a lot easier. Just make sure that your new server has GD, libpng, libjpg and cgiwrap installed, along with Apache (naturally). The restore / usr/local/nagios from tape. This contains all the files required for Nagios. Once that is done, recreate the nagios user account (see next chapter) and you should be set!

The Nagios user
The Nagios software requires a separate, non-root user account for its operation. This user account will be used to perform all of the checks and to communicate across the network. The nagios account is automatically created during a Jumpstart installation of a new server. The account’s password should also be set automatically, but if it does not work, the password should be set to the value store in Password Safe, using the passwd command.

!

If the user account does not exist, you may add it manually to the system. Add the following line to /etc/passwd: nagios:x:1550:1550:Nagios monitoring:/usr/local/nagios:/bin/bash Add the following lines to /etc/group: nagios::1550: nagiocmd::1551:nagios,nobody Add the following line to /etc/shadow: nagios:WeF3zX7JlpZVk:12991:7:56:7:::

Configuration files
First off, let me say that there are three things you should really mind when modifying the Nagios configuration files. 1. Always work as the user nagios, or else you'll be screwing up file permissions. 2. Always make backup copies of the file you're working in by copying it to a file called the same, but with the date appended. For exampled: services.cfg.20050823. 3. After making changes and before restarting Nagios to activate your changes, make sure that you validate the new configuration files. Run $NAGIOS-BIN/nagios -v / usr/local/nagios/etc/nagios.cfg. All configuration files are located in /usr/local/nagios/etc. A number of these files you will only edit during the initial setup of the Nagios server, after which you'll most probably not touch them again. The remaining files are used for monitoring.

File cgi checkcommands contactgroups contacts dependencies escalations hostextinfo

Moni? N Y Y Y Y Y Y

Purpose Configures the dynamic web pages behind the Nagios web interface. Defines the Nagios commands which are used in the check_command field in services.cfg. Builds groups from the people defined contacts.cfg, to allow for detailed notification. in

Defines people and administrators who can be contacted through e-mail or pagers. Required for notification. Defines dependencies between services and hosts. If the top level service goes down, so do its dependent services. Not used. Provides extended information for each host defined in hosts.cfg. Includes coordinates for the Status Map feature. Builds groups from the hosts defined in hosts.cfg. Allows for nicer division of the status screens. Defines the hosts you would like to monitor. Defines the user accounts which have access to the web interface. Defines commands which Nagios can use under water. For instance used to define notification commands. The main configuration file for the Nagios processes. Setting up user macros, like $USER1$. Defines all of the metrics you would like to monitor. Creates time windows which can be used in multiple locations. These are used to define when certain actions should or should not be undertaken.

hostgroups hosts htpasswd.users misscommands nagios resource services timeperiods

Y Y N N N N Y Y

The following graphic summarizes all the cross-dependencies between the monitoringrelated configuration files. The arrows between the files point in the direction in which information 'flows'. For instance, information from hosts.cfg is reused in hostgroups.cfg.

In the following chapters I will cover the basic configuration of Nagios monitoring using these files. However, almost every file will allow you to make more modifications that the ones I'm describing. Please read through the Nagios on-line documentation for more details. Each configuration file is covered in excruciating detail over there.

Building hosts and hosts groups
In order to set up basic monitoring you will first need to define the clients you would like to keep an eye on. The minimal requirement for this is an entry in the hosts.cfg file. The definition for a basic host looks like this: # 'nl-ams99a-das02' host definition define host{ use generic-host host_name nl-ams99a-das02 alias DAS server Lab A, IPTV project address nl-ams99a-das02-01 check_command check-host-alive max_check_attempts 10 notification_interval 120 notification_period 24x7 notification_options d,u,r } The first line, starting with the hash, is a standard comment line used to indicate the start of a new host definition. These are not required, but make the file more easily read. Since we're defining objects in a similar way to a lot of object oriented languages, each host's definition must be contained within accolades and be preceded by a directive of some sorts (in this case define host). The use keyword indicates that Nagios should include any parameters used in the generic-host definition with this one. For more information on this keyword, look up the section on “inheritance” of the on-line documentation.

The host_name keyword is used to define the name that will be used to indicate this specific host in the Nagios GUI. It does not have to be the system's real host name (as known in DNS) since it is not used for communicating with the host. The address keyword however is used for this purpose and thus should contain either the host's IP address or its real host name. Also, the alias can be used as a form of comment field to add extra information about the host (it does not need to resemble any form of hostname). The check_command field defines which Nagios command should be used to ascertain whether a host is available on the network. This consists of nothing more than a simple ping, so it provides no real information. Finally the last four lines of the definition indicate how often the check_command should be performed, during which time window it should run and how many times the check should be retried in the case of failure. In the case of the example above, the respective values are: once every two minutes with a maximum of ten retries, during the whole day. The notivication_options field defines that notification should be sent if the host status changes to Down, Unknown or Recovered. More options are available and details can be found in the online documentation. Now that we've defined our hosts, we can continue by grouping them into logical divisions. In our case I've chosen to group the servers by the Lab they are in, for example: # 'Lab-Z' host group definition define hostgroup{ hostgroup_name Lab-Z alias Lab Z contact_groups dtv-admins members nl-ams99z-jst01,nl-ams99z-fwm, nl-ams99z-a02,nl-ams99z-a01 } The hostgroup_name and alias fields function quite like the comparable fields in hosts.cfg. They respectively define the name to be used within the Nagios GUI and provide additional space for comments. The contact_groups field defines which people to notify in the case of trouble with any of the group's members. Which brings us to the members field, which is a comma separated list of host_name entries from hosts.cfg. This line indicates which hosts should be considered part of the host group in question. In case you were wondering: it is entirely possible for a host to be part of multiple host groups.

Setting up contacts and notification
Just to start with a little extra information that will help you correlate things: the services.cfg file contains a contact_groups field for every service defined, quite like the example below. This field defines which (groups of) people should be alerted in case of a failure. This information, coupled with the notification_X fields tells Nagios who to call, during which times of the day. contact_groups notification_interval notification_period notification_options dtv-admins 240 24x7 c,r

In the example above you will see that the group of people in dtv-admins will be warned about services going into Critical or Recovered state, during the full 24 hour day and that they will be reminded about the failure every four hours. The contact_groups field relates to an entry in the contactgroups.cfg file. In this case the entry referred to is the following. define contactgroup{ contactgroup_name alias members } dtv-admins DTV Lab Administrators tsluyter,rvvloten

Predictably the name field defines the label which is called from other configuration files within Nagios and once again the alias field is a simple commentary field. The members list contains names defined in contacts.cfg. For example: define contact{ contact_name alias service_notification_period host_notification_period service_notification_options host_notification_options service_notification_commands host_notification_commands email } tsluyter Thomas Sluyter 24x7 24x7 c,r,w,u d,r,u notify-by-email host-notify-by-email nvrossum@ugceurope.com

The definition above myself and tells Nagios that I should be notified by e-mail (at nvrossum@ugceurope.com), during the full 24 hour day. Notification should be sent out in case a service becomes Critical, Warning, Recovered or Unknown, or when a host goes Down, Recovered or Unknown. You will notice that both the contact and the service configuration files contain lines pertaining to notification_options. These allow you to set up flexible notification. For example, let's say that service S sends out notifications when it's enters the following states: Warning, Critical and Recover. Now, let's assume that you only want person X to receive notifications for Critical alerts. In that case the lines involved would become: S: service_notification_options X: notification_options c,r,w c

Setting up the clients
UNIX clients are monitored using the NRPE software, which receives requests from the Nagios server and performs them locally. This requires that both the Nagios check scripts and the NRPE package are installed. Detailed instructions for the installation of this software can be found in the “NAGIOS unix plugin” manual. Windows clients require the NSClient software, which basically is a Windows version of NRPE and the Nagios check scripts rolled into one. Unfortunately it is not nearly as versatile as the UNIX software. Detailed instructions on the installation of the software can be found in “NAGIOS windows plugin” manual.

Configuring for each UNIX client
Currently, each new UNIX client added to Nagios will get at least a dozen different monitors assigned to it. In order to get all of this set up we'll need to add one code block, per monitor, per host to the services.cfg file. It's not possible to discuss all of the monitors that we use at the DTV Labs, since there are just too many. Instead I will cover a number of them, with at least one from each category. For a list of all monitors currently set up in Nagios, please refer to the “NAGIOS monitored metrics” document. Our services.cfg file is set up in such a way that each code block reuses a number of characteristics from a basic template (which is located at the top of the file). This template is called generic-service and is called using the use field. For example: define service{ use host_name service_description is_volatile check_period max_check_attempts normal_check_interval retry_check_interval contact_groups notification_interval notification_period notification_options check_command } generic-service nl-ams99z-fwm WEBMIN 0 24x7 3 5 1 dtv-admins 240 24x7 c,r check_http!10000

The meaning of some of the fields in each monitor definition should now be obvious to you, so I'll explain the “new” ones. The service_description field carries the label that will be used in the Nagios GUI (and which is also referenced in the “NAGIOS monitored metrics” document). The is_volatile field is rarely set to 1, since this is only used for very specialistic monitors. max_check_attempts defines how many times Nagios checks the metric, in case the first attempt returns a state other than OK. The fields normal_check_interval and retry_check_interval define how often (in minutes) the normal check is run and how quickly the aforementioned retries follow each other should a NOK be detected. Finally there is the check_command field which defines how Nagios should verify the functionality of the service. We use three different categories of monitors: 1. Scripts that run on the Nagios master server. 2. Scripts that run on the Nagios client. 3. Scripts that run on the Nagios client and which require parameters.

1. Scripts that run on the master Monitors that run locally on the Nagios master server are the easiest to configure since you don't need to setup NRPE or figure out a lot of parameters. In most cases these monitors involve the checking of TCP/IP ports and such. For example: check_command check_http!10000

In this case the Nagios process on the master server calls the check_http script which is located in /usr/local/nagios/libexec. The exclamation mark is used to separate the command and its parameters, so Nagions can properly process them. In this case the first parameter sent to check_http (compare it to $1 in shell scripting) is 10000, which in this case refers to a port number. So Nagios checks whether client A has a webserver running on port 10000. 2. Scripts that run on the client Most monitors that get written by ourselves take this form. We'll build a shell script which performs a number of tests on application processes and that in the end returns a specific Nagios exit code. More information on writing these scripts can be found in later chapters. For example: check_command check_nrpe!check_fwm

This line in /etc/services.cfg refers to a line in another configuration file, namely / usr/local/etc/nrpe.cfg on the remote client system. In this case the line in nrpe.cfg would look like this: command[check_fwm]=/usr/local/nagios/libexec/check_fwm It tells the NRPE daemon about a command called check_fwm which may be called from the Nagios master. Whenever NRPE receives a request to run command check_fwm, it should run /usr/local/nagios/libexec/check_fwm and return the exit code back to Nagios. 3. Scripts that run on the client and need parameters These scripts are usually part of the set that came with Nagios. They allow you to monitor a certain metric and configure the warning and critical levels. It must be noted that almost every Nagios script comes with some form of usage guide. Just run the script with the -h or --help parameter to read it. Let's say that the services.cfg file refers to the following commands which are configured in a client's nrpe.cfg file: command[check_load]=/usr/local/nagios/libexec/check_load -w 2.00,1.30,1.30 -c 2.50,1.80,1.80 In this case the check_load script is called to verify the taxation of the CPU. The -w flag is used to set the values for the Warning level (in this case a 1 minute average of 2.00, and 5 and 10 minute averages of 1.30). The -c flag is used in a similar sense to set the Critical levels (in this case 2.50 over a 1 minute period and 1.80 over periods of 5 and 10 minutes).

command[check_backup_run]=/usr/local/nagios/libexec/ check_file_age -w 86400 -c 172800 -f /var/log/backup.log The check_file_age script predictably checks whether a certain file has been updated with the defined time frame. In this case -w and -c take numbers which represent seconds of file age. In this case the script should issue a Warning if the log file is older than 24 hours and a Critical should it not be updated in 48 hours. command[check_backup_ok]=/usr/local/nagios/libexec/check_log -F /var/log/backup.log -O /var/log/backup.log.processed -q NOK We are very lucky that Nagios comes with a log file processor (called check_log). It allows you to keep track of a log file and see whether nasty errors pop up. In this case we are monitoring a log for one of our own applications. The application was modified in such a way that it would start each line with either a “OK” or a “NOK”, depending on whether everything was okay or not. The -q flag in this case stands for “query”.

Writing your own NRPE scripts
In the future you may be asked to add a monitor for some new application. In that case you'll probably have to write your own command script, since you're probably the first person to want to monitor said application. As an example I will include one of our own scripts and provide explanations throughout the text. We will be taking a look at check_fwm which was mentioned earlier. #!/usr/bin/bash # # Firewall-1 process monitor plugin for Nagios # # You may have to change this, depending on where you installed # your Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh We start off by setting some environment variables. The utils.sh script, which is part of Nagios, sets a large number of these, among which are $STATE_OK and $STATE_CRITICAL. These are all required in order to make our own script report the proper status back to Nagios. print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Firewall-1 monitor plugin for Nagios" echo "" echo "This plugin not developed by the Nagios group." echo "Please do not e-mail them for support on this plugin"

echo "" }

echo "For contact info, read the plugin itself..."

while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done Then, out of courtesy, we add a few details on the basic usage of this script. This will help future administrators of our environment to understand what we have built. check_processes() { PROCESS="0" # PROCLIST="cpd fwd fwm cpwd cpca cpmad cplmd cpstat cpshrd cpsnmpd" PROCLIST="cpd fwd fwm cpwd cpca cpmad cpstat cpsnmpd" for PROC in `echo $PROCLIST`; do if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then PROCESS=1;fi done if [ $PROCESS -eq 1 ]; then echo "FWM NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi

}

Finally, we get to the first checks! We define the subroutine check_processes which can be called later on in the script. Basically, it takes a list of process names and checks whether they are all up and running. If one of them appears to be missing a trigger is set and the script exits with a Critical state. check_ports() { PORTS="0" PORTLIST="256 257 18183 18184 18187 18190 18191 18192 18196 18264" for NUM in `echo $PORTLIST`; do if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi done if [ $PORTS -eq 1 ]; then echo "FWM NOK - One or more ports not listening." exitstatus=$STATE_CRITICAL exit $exitstatus fi

}

Along the same train of thought we also define the check_ports subroutine, which takes a list of port numbers and checks whether they are available and listening for new connections. If one happens to be unavailable the script returns a Critical message back to Nagios. check_processes check_ports echo "FWM OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus And at the end of the script we actually reach the beginning. Both subroutines are called and run and if neither of them fails, an OK status is returned to Nagios. So, after all of that we can come to the following conclusion. In order to monitor something we need to:

!

1. 2. 3. 4.

Define a number of characteristics which can be regularly checked. Concoct a method to check these characteristics. If such a check fails, we need to exit with code $STATE_CRITICAL. If all checks are okay, we need to exit while returning $STATE_OK.

Configuring for each Windows client
Currently, each new Windows client added to Nagios will get at least five different monitors assigned to it. In order to get all of this set up we'll need to add one code block, per monitor, per host to the services.cfg file. The setup of monitors for Windows hosts is exactly the same as the monitors for UNIX boxen. So if you need more details, please refer to the chapter “Configuring for each UNIX client”. Instead of using NRPE, the Windows clients rely upon NRPE_nt which is a Windows port of the same software. Unfortunately NRPE_nt cannot communicate with check_nrpe 1.9 (used per default for the UNIX clients), so we had to add version 2.0 of the binary as well. This gets called through the check command check_nrpe2. 1. Scripts that run on the master For example: check_command check_tcp!5666

In this case the Nagios process on the master server calls the check_tcp command to verify that port 5666 is open and awaiting connections (which is the port used by NRPE and NRPEnt). 2. Scripts that run on the client (with/without parameters) For example: check_command check_nrpe2!check_load

This line in /etc/services.cfg refers to a line in another configuration file, namely C:\NRPEnt\nrpe.cfg on the remote client system. In this case the line in nrpe.cfg would look like this:

command[check_load]=C:\NRPEnt\plugins\cpuload_nrpe_nt 70 85 It tells the NRPE_nt daemon about a command called check_load which may be called from the Nagios master. Whenever NRPE_nt receives a request to run command check_load, it should run C:\NRPEnt\plugins\cpuload_nrpe_nt and return the exit code back to Nagios.

Sign up to vote on this title
UsefulNot useful