Understanding escalations in Nagios

A common problem with resolving problems is that a host or a service may have blurred ownership. Often there is no single person responsible for a host or service, which makes things harder. It is also typical to have a service with subtle dependencies on other things, which by themselves are small enough not to be monitored by Nagios. In such a case, it is good to include lower management in the escalations so that they are able to focus on problems that haven’t been resolved in a timely manner.

Here is a good example—a database server might fail because a small Perl script that is run prior to actual start to clean things up has entered an infinite loop. The owner of this machine gets notified. But the question is, who should be fixing it? The script owner? Or perhaps the database administrator? Often this may end up in different teams assuming someone else should resolve it—programmers waiting on database administrators and vice versa.

In such cases, escalations are a great way to solve such complex problems. In the previous example, if the problem has not been resolved after two hours, the IT team coordinator or manager would be notified. Another hour later, he would get another e-mail. At that point, he would schedule an urgent meeting with the developer who owns the script, and the database admin to discuss how this could be solved.

Of course, in real-world scenarios, escalating to management alone would not solve all problems. However, often, situations need a coordinator that will take care of communicating issues between teams and try to find a company-wide solution. Business-critical services also require much higher attention. In such cases, it is a real benefit for the company if it has an escalation ladder that can be followed for all major problems.

Setting up escalations

Nagios offers many ways to set up escalations, depending on your needs. Escalations do not need to be sent out just after a problem occurs, which would create confusion and prevent smaller problems from being solved. Usually, escalations are set up so that additional people are informed only if a problem has not been resolved after a certain amount of time.

From a configuration point of view, all escalations are defined as separate objects. There are two types of objects—hostescalation and serviceescalation. Escalations are configured so that they start and stop being active along with the normal host or service notifications. This way, if you change the notification_interval directive in host or service definition, the times at which escalations start and stop will also change.

A sample escalation for a company’s main router is as follows:

  define hostescalation   { 
    host_name             mainrouter 
    contactgroups         it-management 
    first_notification    2 
    last_notification     0 
    notification_interval 60 
    escalation_options    d,u,r 
  } 

This will define an escalation for host mainrouter. The escalation will cause the it-management contact group to also start being notified starting with the second notification. The escalation will cause notifications about the host being in DOWN and UP state as well as when it recovers. Details of how escalations work are defined in the next section of the chapter.

The following table describes all available directives for defining a host escalation. Items in bold are required when specifying an escalation.

Option

Description

host_name

Specifies a list of all hosts that the escalation should be defined for; separated by commas

hostgroup_name

Specifies a list of all host groups that the escalation should be defined for; all hosts inside said host groups will have the escalation defined for them; separated by commas

contacts

List of all contacts that should receive notifications related to this escalation; separated by commas; at least one contact or contact group needs to be specified for each escalation

contactgroups

List of all contact groups that should receive notifications related to this escalation, separated by commas; at least one contact or contact group needs to be specified for each escalation

first_notification

The number of notifications after which this escalation becomes active; setting this to 0 causes notifications to be sent until the host recovers from the problem; how Nagios handles notifications and escalations is described in more detail later in the next section of the chapter

last_notification

The number of notifications after which this escalation stops being active; how Nagios handles notifications and escalations is described in more detail later in the chapter

notification_interval

Specifies the number of minutes between sending notifications related to this escalation

escalation_period

Specifies the time period during which this escalation should be valid; if not specified, this defaults to 24 hours a day, 7 days a week

escalation_options

Specifies the host states for which notification types should be sent, separated by commas; this can be one or more of the following:

d – host DOWN state

u – host UNREACHABLE state

r – host recovery (UP state)

Service escalations are defined in a very similar way to host escalations. You can specify one or more hosts or host groups, as well as a single service description. Service escalation will be associated with this service on all hosts mentioned in the host_name and hostgroup_name attributes.

The following is an example of a service escalation for an OpenVPN check on the company’s main router:

  define serviceescalation 
  { 
    host_name             mainrouter 
    service_description   OpenVPN 
    contactgroups         it-management 
    first_notification    2 
    last_notification     0 
    notification_interval 60 
    escalation_options    w,c,r 
  } 

This will define an escalation for service OpenVPN running on host mainrouter. The escalation will cause it-management contact group to also start being notified starting with second notification. The escalation will cause notifications about service being in WARNING and CRITICAL state as well as when it recovers. Details of how escalations work are defined in the next section of the chapter.

The following table describes all available directives for defining a service escalation. Items in bold are required when specifying an escalation.

Option

Description

host_name

Specifies a list of all hosts that the escalation should be defined for; separated by commas

hostgroup_name

Specifies a list of all host groups that the escalation should be defined for; all hosts inside said host groups will have the escalation defined for them; separated by commas

service_description

The service for which the escalation is being defined

contacts

List of all contacts that should receive notifications related to this escalation, separated by commas; at least one contact or contact group needs to be specified for each escalation

contactgroups

List of all contact groups that should receive notifications related to this escalation, separated by commas; at least one contact or contact group needs to be specified for each escalation

first_notification

The number of notifications after which this escalation becomes active; how Nagios handles notifications and escalations is described in more detail later in the chapter

last_notification

The number of notifications after which this escalation stops being active; setting this to 0 causes notifications to be sent until the service recovers from the problem; how Nagios handles notifications and escalations is described in more detail later in the chapter

notification_interval

Specifies the number of minutes between sending notifications related to this escalation

escalation_period

Specifies the time period during which escalation should be valid; if not specified, this defaults to 24 hours a day, 7 days a week

escalation_options

Specifies which notification types for service states should be sent, separated by commas; this can be one or more of the following:

r – service recovers (OK state)

w – service WARNING state

c – service CRITICAL state

u – service UNKNOWN state

Understanding how escalations work

Let’s consider the following configuration—a service along with two escalations:

  define service 
  { 
    use                   generic-service 
    host_name             mainrouter 
    service_description   OpenVPN 
    check_command         check_openvpn_remote 
    check_interval        15 
    max_check_attempts    3 
    notification_interval 30 
    notification_period   24x7 
  } 
  # Escalation 1 
  define serviceescalation   { 
    host_name             mainrouter 
    service_description   OpenVPN 
    first_notification    4 
    last_notification     8 
    contactgroups         it-escalation1 
    escalation_options    w,c 
    notification_period   24x7 
    notification_interval 15 
  } 
  # Escalation 2 
  define serviceescalation   { 
    host_name             mainrouter 
    service_description   OpenVPN 
    first_notification    8 
    last_notification     0 
    contactgroups         it-escalation2 
    escalation_options    w,c,r 
    notification_period   24x7 
    notification_interval 120 
  } 

In order to show how the escalations work, let’s take an example—a failing service. A service fails for a total of 16 hours and then recovers—for the clarity of the example, we’ll skip the soft and hard states and the timing required for hard state transitions.

Service notifications are set up so that the first notification is sent out 30 minutes after failure. Later on, they are repeated every 60 minutes and then the next notification is sent 1.5 hours after the actual failure and so on. The service also has two escalations defined for it.

Escalation 1 is first triggered along with the fourth service notification that is sent out. The escalation stops being active after the eighth service notification on the failure. It only sends out reports about problems, not recovery; the escalation_options is set to w,c, which is WARNING and CRITICAL state. The interval for this escalation is configured to be 15 minutes.

Escalation 2 is first triggered along with the eighth service notification and never stops—the last_notification directive is set to 0. It sends out reports about problems and recovery—the escalation_options is set to w,c,r, which is WARNING and CRITICAL state as well as recovery. The interval for this escalation is configured to two hours.

The diagram below shows when both escalations are sent out:

 

 

Notifications for the service itself are sent out 0.5, 1.5, 2.5, 3.5 … hours after the occurrence of the initial service failure.

Escalation 1 becomes active after 3.5 hours—which is when the fourth service notification is sent out. The last notification related to Escalation 1 is sent out 7.5 hours after the initial failure—this is the time when the eighth service notification is sent out. It is sent every 30 minutes; so a total of nine notifications related to Escalation 1 are sent out.

Escalation 2 becomes active after 7.5 hours—which is when the eighth service notification is sent out. The last notification related to Escalation 2 is sent out when the problem is resolved, and concerns the actual problem resolution. It is sent every two hours, so a total of four notifications related to Escalation 2 are sent out.

Escalations can be defined to be independent of each other—there is no reason why Escalation 2 cannot start after the sixth service notification is sent out. There are also no limits on the number of escalations that can be set up for a single host or service.

The main point is that escalations should be defined reasonably, so that they don’t bloat management or other teams with problems that would be solved without their interference anyway.

Escalations can also be used to contact different people for a certain set of objects, based on time periods. If an escalation has the first_notification option set to 1 and the last_notification option set to 0, then all notifications related to this escalation will be sent out exactly in the same way as notifications for the service itself.

For example, normal IT staff may be handling problems normally, but during holidays, if notifications about problems should also go to the CritSit team, then you can simply define an escalation saying that during the holidays time period, the CritSit group should also be notified about problems when the first notification is sent out. The following is an example that is based on the OpenVPN service defined earlier:

  define serviceescalation 
  { 
    host_name             mainrouter 
    service_description   OpenVPN 
    first_notification    1 
    last_notification     0 
    contactgroups         CritSit 
    notification_period   holidays 
    notification_interval 30 
    escalation_options    w,c,r 
  } 

The definitions above specify both the service and its escalation. Please note that the notification_interval option is set to the same value in both the object and the escalation.

Related Articles

How to add swap space on Ubuntu 21.04 Operating System

How to add swap space on Ubuntu 21.04 Operating System

The swap space is a unique space on the disk that is used by the system when Physical RAM is full. When a Linux machine runout the RAM it use swap space to move inactive pages from RAM. Swap space can be created into Linux system in two ways, one we can create a...

read more

Lorem ipsum dolor sit amet consectetur

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

19 + 8 =