Tuning Nagios Load Checks

[See also: check_load initial values cheat sheet].

The standard Nagios plugins include a "check_load" command which will raise a warning or error if the load averages for the target machine exceed some threshold. A little while ago the ops manager and I were discussing what those thresholds should be.

The usage for the check_load command is as follows:

Usage: check_load -w WLOAD1,WLOAD5,WLOAD15 -c CLOAD1,CLOAD5,CLOAD15

Without looking at the source I'm pretty sure that the program is either just opening a pipe to uptime or using the /dev/proc file system to read the load averages for the past 1, 5 and 15 minutes. Should be safe to assume then that Nagios' concept of load is exactly the same as uptime's and that the figures are ultimately is coming from the kernel scheduler. (Note: Yup. :-) Just checked the source.)

So the first question is: what does "load" actually measure?

Unix and Linux Load

Breifly, when Unix machines report their "load" (usually through uptime, top or who) they are reporting a weighted average of the number of processes either running or waiting for the CPU (Linux will also count processes that may be blocked waiting on I/O). This average is calculated over 1, 5 and 15 minutes (hence the three values) based on values that are sampled every 5 seconds (on Linux at least). Dr Neil Gunther has written more than you might ever want to know about how those load averages are calculated and what they mean. It's an excellent series of articles (see also the inevitable Wikipedia article).

So assuming we have a single-core CPU, a load value of "1.0" would suggest that the CPU has been 100% utilised over whatever reporting period that figure was calculated for. A load of "2.0" would mean that whenever one process had the CPU there was another that was forced to wait. However, if we have 2 cores, the same "2.0" load value would suggest that both processes got the CPU time they needed, while a load of "1.0" would suggest the CPU had only been at 50% capacity.

On a simple web server, running a single 2-core CPU a load average of "2.0, 1.0, 0.5" suggests that, over the last minute, the CPU has been 100% utilised; over the last 5 minutes it's been 50% utilised; and over the last 15 minutes, it's been 25% utilised. Halve those values if 4 cores are available and double them if only one is in the system.

You can see then that sensible threshold values for warning and critical states requires you to consider how many CPUs and CPU cores your system has. You're therefore probably going to want to set your thresholds per machine or at least set them differently for each different type of configuration.

For example, one of our Solaris boxes has 12 cores so a load of "6.0" is nothing to be concerned about. However that same load figure on another, single-core box might be worthy of a warning or even critical alert, depending on how sensitive we were to process queue lengths on that box. Except if that box is a Linux box with a lot of I/O and slow devices (like a tape drive) and is counting processes that are sitting idle and waiting for an I/O operation to finish. And what is the application running on it? Is it threaded? How is your kernel counting threads in that total -- or is it just counting processes?

Setting the Check_Load Thresholds

So determining an appropriate warning and critical set of threshold values for check_load will depend on what you think a reasonable process queue length will be; how your specific system treats threads; how your applications on that system behave (and their expected responsiveness levels); and how many CPUs / cores your system has. Oh -- and your performance targets or SLAs.

This is why experienced admins use a time honoured, complicated heuristic process to set an initial value and then continually adjust that value based on the correlation of alerts raised and actual performance and hence user impact.

In other words: we rub our bellies and take a guess and then change the values if we get too many or too few alerts. We're experienced sysadmins -- how much time do you think we have? :-)

In our case, for web servers, we decided that over 5 and 15 minute periods we expect spare capacity on the box -- but we only want to be alerted if the box is basically maxing out on CPU over a significant period. Over 1 minute we expect the occasional spike and don't really want an alert unless it's way beyond expectations. We're using Apache with no threading so 1 load point = 1 process using or waiting for CPU.

We've set warning levels for 15 minute load average at number of CPU cores times 2 (plus one!). For 5 minutes increase the threshold by 5. For one minute, increase it by 5 again. Critical threshold starts at number of CPU cores times 4 and then follows the same pattern for the 5 and 1 minute warning.

Here's a sample nrpe.cfg config file for a web server with 2 cores:

command[check_load]=/path/check_load -w 15,10,5 -c 30,25,20

It's important to actually test this set up. Use ApacheBench or JMeter or similar tool to get your load average up and test performance under those thresholds to see if it's acceptable. If your application is unacceptably slow from a user perspective at lower load values then lower your thresholds.

More Information

I've put together a little check_load cheat sheet that has some initial values for some common configurations. It might be a useful starting point if you're just starting to configure check_load in your Nagios environment.

[Note: This post has been edited since initial publication.]

Thursday, July 10, 2008