Understanding distributed monitoring using Nagios

There are many situations in which you may want to have more than one Nagios instance monitoring your IT infrastructure. One of them can be because of firewall rules that force checks to be made within local networks. Another reason could be the need to load balance all checks across machines due to latency or the number of checks. Others may need to monitor machines in different physical locations from separate machines to check what is wrong within the local infrastructure, even if the links to the central servers are temporarily down.

Regardless of the reason you may want or need to have the execution of checks split across multiple computers. This type of setup might sound complicated and hard to configure, but it is not as hard as it seems. All that’s necessary is to set up multiple Nagios instances along with the NRDP agents or daemons.

There are subtle differences in how various instances need to be configured. Usually, there are one or more Nagios instances that report information to a central Nagios instance. An instance that reports information to another Nagios machine will be referred to as a slave. A Nagios instance that receives reports from one or more slaves will be called a master.

Let’s consider a simple organization that has four branch offices and a headquarters. Each branch office is connected to the main office and has a local set of computers. A typical scenario is that a local instance of Nagios monitors the computers and routers in a single branch. The results are then sent to the central Nagios server over an NRDP protocol. These are instances of slave Nagios. If a connection to one of the branches is broken, the local administrators will continue to have access to the status of the local machines. This information is not propagated to the master Nagios server. Setting up the services on the central Nagios server to use freshness checks will cause the central Nagios server to generate an alert when no results are received within a predetermined time frame. Combining this with parent configurations will allow Nagios to accurately determine the root cause of the problems.

The following diagram shows how a typical setup in a multiple branch configuration is done. It shows the network topology as: which machines are checked by which Nagios servers, and how this information is reported to the central Nagios server.

 

 

In this example, each branch has a Nagios slave server that monitors and logs information on the local computers. This information is then propagated to the master Nagios server.

Introducing obsessive notifications

Monitoring IT infrastructure using multiple Nagios instances requires a way to send information from slave servers to one or more master servers. This can be done as event handlers that are triggered when a service or a host state changes, however, this has a huge drawback; it requires the setup of an event handler for each object host and service. Another disadvantage is that the event handlers are only triggered on actual changes and not after each test is done.

Nagios offers another way to do this through obsessive notifications. These provide a mechanism to run commands when a host or service status is received, regardless of whether it is a passive or an active check result. This functionality is also set up across the system, which means that the object definitions do not need to be changed in any way for Nagios to send information about their status changes.

Setting up obsessive notifications requires a couple of changes in your configuration. The first one is to define a command that will be run for each notification. An example of this is shown as follows:

define command 
{ 
    command_name  send-ocsp 
    command_line  $USER1$/send-ocsp 192.168.0.1 $SERVICESTATE$ 
                  $HOSTNAME$ '$SERVICEDESC$' '$SERVICEOUTPUT$' 
} 

The code needs to be entered in a single line in your configuration file. Also, put the actual IP address of the central Nagios server instead of 192.168.0.1 in the preceding example.

We now need to write commands that simply pass the results to the other server over NRDP.

A sample script is as follows:

#!/bin/sh 
 
# args: nrdp-server hostname svcname output 
 
URL=http://$1/nrdp/ 
TOKEN=cu8Eiquasoomeiphahpa 
 
# map status to exit code 
STATE=3 
case "$2" in 
  OK) 
    STATE=0 
    ;; 
  WARNING) 
    STATE =1 
    ;; 
  CRITICAL) 
    STATE=2 
    ;; 
esac 
 
/opt/nagios/bin/send_nrdp.php \ 
  --url=$URL --token=$TOKEN -host="$3" --service="$4" \ 
  --state=$STATE --output="$5" 
 
exit 0 

The script passes information to send to the Nagios master instance to the send_nrdp.php script. This requires the NRDP client to be set up on the Nagios slave machine. Installing the NRDP client is described in more details in Chapter 9, Passive Checks and NRDP.

The TOKEN variable in the script should be set to a valid token defined in the NRDP server configuration file.

The script first converts the status from text (OK, WARNING, or CRITICAL) to exit code that is required by the send_nrdp.php script. It also passes the name of the host and service as well as output from the script.

The following are the required parameters along with the sample values that should be set in the main Nagios configuration file (nagios.cfg):

obsess_over_services=1 
ocsp_command=send-ocsp 

The command name should match the name in the command definition.

That’s it! After reloading your Nagios configuration, the send-ocsp script will be run every time a check result comes in.

Configuring Nagios to send host status information is very similar to setting up a service status to be sent. The first thing to do is to set up the command that will be run for each notification, which is as follows:

define command 
{ 
    command_name  send-ochp 
    command_line  $USER1$/send-ochp 192.168.0.1 
    $HOSTSTATE$ $HOSTNAME$ '$HOSTOUTPUT$' 
} 

Note that the command_line directive in the preceding example needs to be specified in a single line.

The script to send information will look exactly like the one for sending the host status information, except that the actual command sent over NRDP will be generated a bit differently. It also converts the status from text to exit codes and passes the hostname (without the service name), exit code, and output from the check to the send_nrdp.php script that sends it to Nagios by sending only the hostname to indicate that it’s a host check result:

/opt/nagios/bin/send_nrdp.php \ 
  --url=$URL --token=$TOKEN -host="$3" \ 
  --state=$STATE --output="$4" 

In order for Nagios to send notifications to another Nagios instance, we need to enable obsessing over hosts and specify the actual command to use.

Here are some sample directives in the main Nagios configuration file (nagios.cfg):

obsess_over_hosts=1 
ochp_command=send-ochp 

Restart Nagios after these changes have been made to the configurations. When it restarts, Nagios will begin sending notifications to the master server.

A good thing to do is to verify the nagios.log file to see if notifications are being sent out after a check has been made. By default, the file is in the /var/nagios directory.

If the notifications are not received, it may be a good idea to make the scripts responsible to send messages to log this information in either the system log or in a separate log file. This is very helpful when it comes to debugging instances where the notifications sent out by slave Nagios instances are lost.

Writing information to the system log can be done using the logger command (for more details, refer to http://linux.die.net/man/1/logger). The following example shows how to write information to the log:

logger --priority info --tag nagios \ 
  "Sending host $3 state $STATE ($4)" 

This code will log all of the data that would also be sent using the send_nrdp.php script to the log, so it can be found if needed.

Configuring Nagios instances

Setting up multiple servers to monitor infrastructure using Nagios is not easy, but it is not too hard either. It only requires a slightly different approach as compared with setting up a single machine. That said, there are issues with the configuration of hosts and services. It is also necessary to set up all slave and master servers correctly and in a slightly different way.

Distributed monitoring requires a more mature change control and versioning process for Nagios configurations. This is necessary because both the central Nagios server and its branches need to have a partial or complete configuration available, and these need to be in sync across all machines.

Usually, it is recommended that you make the slave servers query both the service and the host status. It is also recommended that you disable service checks on the master Nagios server, but keep host checks enabled. The reason is that host checks are not usually scheduled and are done only when a service check returns a WARNING, CRITICAL, or UNKNOWN status. Therefore the load required to only check the hosts is much lower than the load required to perform regular service checks. In some cases, it is best to also disable host checks. Either the host checks need to be performed regularly or the security policies should disallow checks by the central server.

To maintain Nagios configurations, we recommend that you set up a versioning system such as Git (http://git-scm.com/), Subversion (http://subversion.tigris.org/), or Mercurial (http://www.mercurial-scm.org/). This will allow us to keep track of all the Nagios changes and make it much easier to apply configuration changes to multiple machines.

We can store and manage the configuration similar to how we had done it previously. Hosts, services, and the corresponding groups should be kept in directories and separate for each Nagios slave—for example, hosts/branch1 and services/branch1. All other types of objects, such as contacts, time periods, and check commands, can be kept in global directories and reused in all branches—for example, the single contacts, timeperiods, and commands directories.

It’s also a good idea to create a small system to deploy the configuration to all the machines, along with the ability to test new configuration before applying it in production. This can be done using a small number of shell scripts. When dealing with multiple computers, locations, and Nagios instances, doing everything manually is very difficult and can get problematic over the long term. This will cause the system to become unmanageable and can lead to errors in actual checks caused by out-of-sync configurations between the slave and master Nagios instances.

A very popular tool that is recommended for this purpose is cfengine (http://www.cfengine.com/). There are other tools that can be used for automating configuration deployment, such as Chef (http://www.getchef.com/), Puppet (http://www.puppetlabs.com/), or Ansible (http://www.ansible.com/). They can be used to automate configuration deployment and to ensure that Nagios is up to date on all the machines. It also allows for customization; for example, a set of files different from the set on the master server can be deployed on slave servers. If you are already familiar with such tools, we recommend that you use them to manage Nagios deployments. If not, try them out and choose one that best suits you.

The first step in creating a distributed environment is to set up the master Nagios server. This will require you to install Nagios from a binary distribution or build it from sources. Details related to Nagios installation are described in Chapter 2, Installing Nagios 4.

The main changes in a single Nagios set up for a master server are defined in the main Nagios configuration file—nagios.cfg. This file must contain the cfg_dir directives for objects related to all of the slave servers. If not, the master Nagios instance will ignore the reports related to hosts that it does not know about.

We’ll also need to make sure that Nagios accepts passive check results for services and that the master Nagios instance does not independently perform active checks. To do this, set the following options in the main Nagios configuration file on the master server:

check_external_commands=1 
accept_passive_service_checks=1 
execute_service_checks=0 

If you also want to rely on passive check results for host checks, you will also need to add the following lines to your main Nagios configuration:

accept_passive_host_checks=1 
execute_host_checks=0 

You will also need to set up the NRDP server on the master Nagios instance. Details of how to set this up are described in Chapter 9, Passive Checks and NRDP.

The next step is to set up the first slave server that will report to the master Nagios instance. This also means that you will need to set up Nagios from a binary or source distribution and configure it properly.

All of the slave Nagios instances also need to have the send_nrdp.php script from the NRDP package in order to communicate changes with the master instance. It is also a good idea to check whether the sending of dummy reports about an existing host and an existing service works is done correctly.

All of the slave instances need to be set up to send obsessive notifications to the master Nagios server. This includes setting up the OCSP and OCHP commands and enabling them in the main Nagios configuration file. (obsessive notifications have already been described earlier in the chapter, in the Introducing obsessive notifications section).

After setting up everything it’s best to run notification commands directly from the command line to see if everything works correctly. Next, restart the slave Nagios server. After that, it is a good idea to check the Nagios logs to see if the notifications are being sent out.

It would also be a good idea to write down or automate all the steps needed to set up a Nagios slave instance. Setting up the master is done only once, but large networks may require you to set up a large number of slaves.

Performing freshness checking

We now have set up distributed monitoring and the slave Nagios instances should report the results to the master Nagios daemon. Everything should work fine and the main web interface should report up-to-date information from all of the hosts and services being monitored.

Unfortunately, this is not always the case. In some cases network connectivity can be down or, for example, the NRDP client and server on the network might fail temporarily. At that point the master Nagios instance may not even know about it.

Based on our assumption that the master Nagios instance is not responsible for monitoring the IT infrastructure, therefore it needs to rely on other systems to do it. Configuration that we have set up earlier does not take the situation where checks are not sent to master the instance into account.

Nagios offers a way to monitor whether results have come within a certain period of time. If no report comes within that period, we can specify that Nagios should treat this as a critical state and warn the administrators about it. This makes sense as obsessive notifications are sent out very frequently. So if a service is scheduled to be checked every 15 minutes and no notification has come within the previous hour, this may indicate a problem with some part of the distributed monitoring configuration.

Implementing this in the master Nagios configuration requires a slightly different approach to the one mentioned in the previous section. The approach in the previous section was to disable service checks completely. This is why all services and/or hosts needed to have their active checks reconfigured for the new approach to work correctly. In this case, it is necessary to enable service checks (and host, if needed) on a global basis in the nagios.cfg file.

For the reasons given earlier, all of the services and/or hosts that receive notifications from slave Nagios instances need to be defined differently in the master configuration from the definitions that are set for the Nagios slaves.

The first change is that active checks for these objects need to be enabled, but should not be scheduled, that is, the normal_check_interval option should not be set. In addition, the check_freshness and freshness_threshold options need to be specified. The first of these options allows monitoring whether results are up to date and the second one specifies the number of seconds after which the results should be considered outdated.

This means that Nagios will only run active checks if there has been no passive check result for a specified period of time. It is very important that the host and service definitions on both the master and slave instances have the same value specified for the check_period directive. Otherwise, the master Nagios instance will raise an alert only for services that are checked during specific time periods. An example could be the workinghours time period, which is not checked on weekends.

For example, the following service definition will accept passive checks, but will report an error if they are not present:

  define service 
  { 
    use                             generic-service 
    host_name                       linuxbox02 
    service_description             SSH 
    check_command                   no-passive-check-results
    
    check_freshness                 1
    freshness_threshold             43200

    active_checks_enabled           1 
    passive_checks_enabled          1 
  } 

The freshness_threshold option specifies the number of seconds after which an active check should be performed. In this case, it is set to 12 hours.

It is also necessary to define a command that will run if no passive check results have been provided.

The following command will use the check_dummy plugin to report an error:

  define command 
  { 
    command_name         no-passive-check-results 
    command_line         $USER1$/check_dummy 2 "No passive check                           results" 
  } 

It is important to make sure that all of the services and/or hosts are defined, so only dummy checks that report problems (and not actual active checks) are performed. This is different from our previous approach that made sure active checks were not performed.

The main drawback of this approach is that it makes the management of configurations on master and slave instances more difficult. We need to maintain the configuration for the master Nagios instance with the service that contains only the dummy freshness checks. However, slave configurations need to have complete check definitions in place.

Related Articles

How to add swap space on Ubuntu 21.04 Operating System

How to add swap space on Ubuntu 21.04 Operating System

The swap space is a unique space on the disk that is used by the system when Physical RAM is full. When a Linux machine runout the RAM it use swap space to move inactive pages from RAM. Swap space can be created into Linux system in two ways, one we can create a...

read more

Lorem ipsum dolor sit amet consectetur

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

1 × 5 =