Home » » Chapter 4: Nagios Basics ( Page1)

Chapter 4: Nagios Basics ( Page1)

Written By Giai phap ma nguon mo on Tuesday, July 19, 2011 | 9:35 PM

Chapter 4 - from the book Nagios: System and Network Monitoring by Wolfgang Barth -- Reprinted by permission from No Starch Press and Open Source Press.  Available at booksellers now.  Full book details are at the bottom of the article.
Nagios Basics
The fact that a host can be reached, in itself, has little meaning if no service is running on it on which somebody or something relies.  Accordingly, everything in Nagios revolves around service checks.  After all, no service can run without a host. If the host computer fails, it also cannot provide the desired service.
Things get slightly more complicated if a router, for example, is brought into play, which lies between users and the system providing services.  If this fails, the desired service may still be running on the target host, but it is nevertheless no longer reachable for the user.
Nagios is in a position to reproduce such dependencies and to precisely inform the administrator of the failure of an important network component, instead of flooding the administrator with irrelevant error messages concerning services that cannot be reached.  An understanding of such dependencies is essential for the smooth operation of Nagios, which is why Section 4.1 will look in more detail at these dependencies and the way Nagios works.
Another important item is the state of a host or service.  On the one hand Nagios allows a much finer distinction than just "ok" or "not ok"; on the other hand the distinction between (soft state) and (hard state) means that the administrator does not have to deal with short-term disruptions that have long since disappeared by the time the administrator has received the information.  These states also influence the intensity of the service checks.  How this functions in detail is described in Section 4.3.
4.1 Taking into Account the Network Topology
How Nagios handles dependencies of hosts and services can be best illustrated with an example.  Figure 4.1 represents a small network in which the Domain Name Service on proxy is to be monitored.
Figure 4.1: Topology of an example network
The service check always serves as the starting point for monitoring that is regularly performed by the system.  As long as the service can be reached, Nagios takes no further steps; that is, it does not perform any host checks.  For switch1, switch2, and proxy, such a check would be pointless anyway, because if the DNS service responds to proxy, then the hosts mentioned are automatically accessible.
If the name service fails, however, Nagios tests the computer involved with a host check, to see whether the service or the host is causing the problem.  If proxy cannot be reached, Nagios might test the parent hosts entered in the configuration (Figure 4.2).  With the parents host parameter, the administrator has a means available to provide Nagios with information on the network topology.
Figure 4.2: The order of tests performed after a service failure.
When doing this, the administrator only enters the direct neighbor computer fo each host on the path to the Nagios server as the parent.1 Hosts that are allocated in the same network segment as the Nagios server itself are defined without a parent.  For the network topology from Figure 4.1, the corresponding configuration (reduced to the host name and parent) appears as follows:
define host{
    host_name  proxy
 ...
   
  parents    switch2
}

define host{
    host_name  switch2
   ...
   
  parents    switch1
}

define host{
    host_name  switch1
      ...
}

switch1 is located in the same network segment as the Nagios server, so it is therefore not allocated a parent computer.  What belongs to a network segment is a matter of opinion: if you interpret the switches as the segment limit, as is the case here, this has the advantage of being able to more closely isolate a disruption. But you can also take a different view and interpret an IP subnetwork as a segment.  Then a router would form the segment limit; in our example, proxy would then count in the same network as the Nagios server.  However, it would no longer be possible to distinguish between a failure of proxy and a failure of switch1 or switch2.
Figure 4.3: Classification of individual network nodes by Nagios.
If switch1 in the example fails, Figure 4.3 shows the sequence in which Nagios proceeds: first the system, when checking the DNS service on proxy, determines that this service is no longer reachable (1).  To differentiate, it now performs a host check to see what the state of the proxy computer is (2).  Since proxy cannot be reached, but it has switch2 as a parent, Nagios also subjects switch2 to a host check (3).  If this switch also cannot be reached, the system checks its parent, switch1 (4).
If Nagios can establish contact with switch1, the cause for the failure of the DNS service on proxy can be isolated to switch2.  The system accordingly specifies the states of the host: switch1 is UP, switch2 DOWN; proxy, on the other hand, is UNREACHABLE.  Through a suitable configuration of the Nagios messaging system (see Section 12.3 on page 217) you can use this distinction to determine, for example, that the administrator is informed only about the host that is in the DOWN state and represents the actual problem, but not about the hosts that are dependent on the down host.
In a further step, Nagios can determine other topology-specific failures in the network (so-called network outages).  proxy is the parent of gate, so gate is also represented as UNREACHABLE (5).  gate in turn also functions as a parent; the Internet server dependent on this is also classified as "UNREACHABLE".
This "intelligence", which distinguishes Nagios, helps the administrator all the more, the more hosts and services are dependent on a failed component.  For a router in the backbone, on which hundreds of hosts and services are dependent, the system informs administrators of the specific disruption, instead of sending them hundreds of error messages that are not wrong in principle, but are not really of any help in trying to eliminate the disruption.
Share this article :

0 comments:

Post a Comment