4.2 Forced Host Checks vs. Periodic Reachability Tests
Service checks are carried out regularly by Nagios, host checks only when needed. Although the check_interval parameter provides a way of forcing regular host checks, there is no real reason to do this. There is one reason not to do this, however: continual host checks have a considerable influence on the performance of Nagios.
If you nevertheless want to regularly check the reachability of a host, it is better to use a ping-based service check (see Section 6.2 from page 88). At the same time you will obtain further information such as the response times or possible packet losses, which provides indirect clues about the network load or possible network problems. A host check, on the other hand, also issues an OK even if many packets go missing and the network performance is catastrophic. What is involved here--as the name "host check" implies--is only reachability in principle and not the quality of the connection.
Nagios uses plugins for the host and service checks. They provide four different return values (cf. Table 6.1 on page 85): O (OK), 1 (WARNING), 2 (CRITICAL), and 3 (UNKNOWN).
The return value UNKNOWN means that the running of the plugin generally went wrong, perhaps because of wrong parameters. You can normally specify the situations in which the plugin issues a warning or a critical state when it is started.
Nagios determines the states of services and hosts from the return values of the plugin. The states for services are the same as the return values OK, WARNING, CRITICAL and UNKNOWN. For the hosts the picture is slightly different: the UP state describes a reachable host, DOWN means that the computer is down, and UNREACHABLE refers to the state of nonreachability, where Nagios cannot test whether the host is available or not, because a parent is down (see Section 4.1, page 72).
In addition to this, Nagios makes a distinction between two types of state: soft state and hard state. If a problem occurs for the first time (that is, if there was nothing wrong with the state of a service until now) then the program categorizes the new state initially as a soft state and repeats the test several times. It may be the case that the error state was just a one-off event that was eliminated a short while later. Only if the error continues to exist after multiple testing is it then categorized by Nagios as a hard state. Administrators are informed only of hard states, because messages involving short-term disruptions that disappear again immediately afterwards only add to an unnecessary flood of information.
In our example the chronological sequence of states of a service can be illustrated quite simply. A service with the following parameters is used for this purpose:
max_check_attempts determines how often the service check is to be repeated after an error has first occurred. If max_check_attempts has been reached and if the error state continues, Nagios inspects the service again at the intervals specified in normal_check_interval.
Figure 4.4 represents the chronological progression in graphic form: the illustration begins with an OK state (which is always a hard state). Normally Nagios will repeat the service check at five-minute intervals. After ten minutes an error occurs; the state changes to CRITICAL, but this is initially a soft state. At this point in time, Nagios has not yet issued any message.
Now the system checks the service at intervals specified in retry_check_interval, here this is every minute. After a total of five checks (max_check_attempts) with the same result, the state changes from soft to hard. Only now does Nagios inform the relevant people. The tests are now repeated at the intervals specified in normal_check_interval.
The transition of the service to the OK state after an error in the hard state is referred to as a hard recovery. The system informs the administrators of this (if it is configured to do so) as well as of the change between various error-connected hard states (such as from WARNING to UNKNOWN). If the service recovers from an error soft state to the normal state (OK)--also called a soft recovery--the administrators will, however, not be notified.
Even if the messaging system leaves out soft states and switches back to soft states, it will still record such states in the Web interface and in the log files. In the Web front end, soft states can be identified by the fact that the value 2/5 is listed in the column Attempts, for example. This means that max_check_attempts expects five attempts, but only two have been carried out until now. With a hard state, max_check_attempts is listed twice at the corresponding position, which in the example is therefore 5/5.
More important for the administrator in the Web interface than the distinction of whether the state is still "soft" or already "hard", is the duration of the error state in the column Duration. From this a better judgment can be made of how large the overall problem may be.
For services that are not available because the host is down, the entry 1/5 in the column Attempts would appear, since Nagios does not repeat service checks until the entire host is reachable again. The failure of a computer can be more easily recognized by its color in the Web interface: the service overview in Figure 4.3 (page 66) marks the failed host in red; if the computer is reachable, the background remains gray.
The return value UNKNOWN means that the running of the plugin generally went wrong, perhaps because of wrong parameters. You can normally specify the situations in which the plugin issues a warning or a critical state when it is started.
Nagios determines the states of services and hosts from the return values of the plugin. The states for services are the same as the return values OK, WARNING, CRITICAL and UNKNOWN. For the hosts the picture is slightly different: the UP state describes a reachable host, DOWN means that the computer is down, and UNREACHABLE refers to the state of nonreachability, where Nagios cannot test whether the host is available or not, because a parent is down (see Section 4.1, page 72).
In addition to this, Nagios makes a distinction between two types of state: soft state and hard state. If a problem occurs for the first time (that is, if there was nothing wrong with the state of a service until now) then the program categorizes the new state initially as a soft state and repeats the test several times. It may be the case that the error state was just a one-off event that was eliminated a short while later. Only if the error continues to exist after multiple testing is it then categorized by Nagios as a hard state. Administrators are informed only of hard states, because messages involving short-term disruptions that disappear again immediately afterwards only add to an unnecessary flood of information.
In our example the chronological sequence of states of a service can be illustrated quite simply. A service with the following parameters is used for this purpose:
define service{
host_name proxy
service_description DNS
...
normal_check_interval 5
retry_check_interval 1
max_check_attempts 5
...
}
normal_check_interval specifies at what interval Nagios should check the corresponding service as long as the state is OK or if a hard state exists--in this case, every five minutes. retry_check_interval defines the interval between two service checks during a soft state--one minute in the example. If a new error occurs, then Nagios will take a closer look at the service at shorter intervals.max_check_attempts determines how often the service check is to be repeated after an error has first occurred. If max_check_attempts has been reached and if the error state continues, Nagios inspects the service again at the intervals specified in normal_check_interval.
Figure 4.4 represents the chronological progression in graphic form: the illustration begins with an OK state (which is always a hard state). Normally Nagios will repeat the service check at five-minute intervals. After ten minutes an error occurs; the state changes to CRITICAL, but this is initially a soft state. At this point in time, Nagios has not yet issued any message.
Now the system checks the service at intervals specified in retry_check_interval, here this is every minute. After a total of five checks (max_check_attempts) with the same result, the state changes from soft to hard. Only now does Nagios inform the relevant people. The tests are now repeated at the intervals specified in normal_check_interval.
In the next test the service is again available; thus its state changes from CRITICAL to OK. Since an OK state is always a hard state, this change is not subject to any tests by Nagios at shorter intervals.
The transition of the service to the OK state after an error in the hard state is referred to as a hard recovery. The system informs the administrators of this (if it is configured to do so) as well as of the change between various error-connected hard states (such as from WARNING to UNKNOWN). If the service recovers from an error soft state to the normal state (OK)--also called a soft recovery--the administrators will, however, not be notified.
Even if the messaging system leaves out soft states and switches back to soft states, it will still record such states in the Web interface and in the log files. In the Web front end, soft states can be identified by the fact that the value 2/5 is listed in the column Attempts, for example. This means that max_check_attempts expects five attempts, but only two have been carried out until now. With a hard state, max_check_attempts is listed twice at the corresponding position, which in the example is therefore 5/5.
More important for the administrator in the Web interface than the distinction of whether the state is still "soft" or already "hard", is the duration of the error state in the column Duration. From this a better judgment can be made of how large the overall problem may be.
For services that are not available because the host is down, the entry 1/5 in the column Attempts would appear, since Nagios does not repeat service checks until the entire host is reachable again. The failure of a computer can be more easily recognized by its color in the Web interface: the service overview in Figure 4.3 (page 66) marks the failed host in red; if the computer is reachable, the background remains gray.
0 comments:
Post a Comment