| United States-English |
|
|
|
![]() |
HP XC System Software : Administration Guide > Chapter 6 Monitoring the SystemMonitoring Tools |
|
The HP XC System Software includes the shownode metrics command in addition to such standard Linux monitoring commands as the following:
You can use these administrative commands from any node to determine the health of an individual node. Information for these commands is available from their corresponding manpages. The HP XC system also includes the Nagios Web-based utility for system monitoring and such commands as the shownode command. These are discussed in this chapter. The shownode metrics command, which can be issued from any node in the HP XC system, provides the ability to monitor the status of all the nodes in the system. The following arguments to the shownode metrics command monitor the node status:
For more information, see “Displaying System Statistics” and shownode(8). The HP XC System Software uses the Nagios Open Source monitoring application to gather and display system statistics, such as processor load and disk usage. Nagios is a system and network health monitoring application. It watches hosts and services that you specify and alerts you when problems occur or are resolved. On the HP XC system, Nagios is integrated with Supermon for monitoring capabilities. The HP XC system automatically configures the Nagios environment based on the configuration of the system. The autoconfiguration is based on the information in the HP XC configuration and management database (cmdb). The configuration is updated as a result of changes to the HP XC database. Autoconfiguration includes setting up Nagios configuration templates mapped to the configuration for both hosts and services. Nagios reports information collected using the Supermon infrastructure. The data collected by Supermon includes both system performance metrics and environment data, such as fan, temperature, and power supply status. This data is collected on a regular basis by the Supermon facility. Table 6-1 lists the services monitored by Nagios and the type of function monitored for that service. Table 6-1 Services Monitored by Nagios
The HP XC Nagios configuration is designed so that you can customize it as needed. You can find the complete documentation for customizing Nagios on the Nagios Web site: The Nagios system has a Web interface for the information gathered. The Web interface is available over a secure connection. Enter the following URL in your browser to access the Nagios main window:
Figure 6-2 illustrates the Nagios main window. You can choose any of the options on the left navigation bar. These options are shown in Figure 6-3. After you chose an option from the window you are prompted for a login and a password. This login and password are established when the HP XC system is configured. Usually, the login name is nagiosadmin. The Nagios passwords are maintained in the /opt/hptc/nagios/etc/htpasswd.users file. Use the htpasswd command to manipulate this file to add a user, to delete a user, or to change the user password. Nagios offers various views of the HP XC system. These views include the following:
For administrators of systems comprised of tens of nodes, the Service Detail view provides a good overview of the system.. It lists the Nagios hosts and shows their status. Figure 6-4 is an example of the Nagios Service Detail view. The Service Problems view is more useful for administrators of systems comprised of hundreds, or thousands, of nodes. It provides a practical overview of the system, including the number of Nagios hosts that are up or down and Nagios service status totals. Figure 6-5 is an example of the Nagios Service Problems view. Both views offer color coding so that you can detect problems at a glance. The HP XC System Software provides plug-ins that monitor these and other system statistics. The HP XC System Software incorporates the Nan notification aggregator and delimiter for the Nagios paging system. Nan is an open source supplement to the Nagios application. Nagios is capable of sending quantities of messages especially when the system is starting up, shutting down, or experiencing a failure. The Nan front-end utility overcomes the problem of multiple messages by collecting, batching, and reformatting these messages so that they are sent in a controlled manner. You can configure how multiple concatenated notifications are sorted so that the most important notifications appear at the top. The Nan utility consists of the following:
When the nand daemon receives a notification from Nagios, it starts a timer. If the notification is a PROBLEM or RECOVERY, the default time is 300 seconds; if the notification is an ACKNOWLEDGEMENT, the default time is 600 seconds. Subsequent Nagios notifications are queued in the /opt/hptc/nagios/var/nanqueue directory until the specified time elapses. Then the nand daemon sends a condensed message based on the following criteria:
The default values specified in the nand.conf and nanc.conf configuration files are appropriate for most HP XC systems; however, you can change these values to suit your installation. Follow these steps when updating either or both of these configuration files.
Supermon is a highly scalable, high-speed cluster monitoring system. Supermon provides all required node statistics to the Nagios subsystem. System statistics are tiered, aggregated, and stored in the HP XC System Software database. The Supermon components consist of the kernel modules to collect the statistics, the mond and supermond daemons, and the script to load and configure the daemons. The data collected by Supermon includes system performance metrics and environment data, such as fan, temperature, and power supply status. This data is collected on a regular basis. The syslog service runs on each node in the HP XC system. These daemons capture log information and send it to an aggregator regional node. Regional nodes are assigned to each client node. The syslogng_forward service on each regional node enables the node to act as a log aggregator for the global node. Log information is gathered, consolidated, and forwarded to the global node; the global node is not necessarily the head node. Nagios has a syslog plug-in , check_syslogAlerts, that applies a set of rules against all the events in the consolidated log file and generates alerts for those events that match one of the rules. The rules reside in the /opt/hptc/nagios/etc/syslogAlertRules. You can modify this rules file if you want to add additional rules. You can use the collectl utility to collect data on the nodes of the HP XC system. As a development or debug tool, the collectl utility typically gathers more detail more frequently than the supermon utility. The collectl utility does have some overhead, but for most situations, it consumes less than 0.1 percent of the CPU and has minimal effect on user applications. However, even this low level can have a significant impact on some applications, so use the collectl utility with care. The collectl utility also enables you to play back the data in either raw ASCII characters or in a plot form, which can be used to display the data with GnuPlot or Microsoft Excel. Figure 6-6 shows one example of the plotted graph based on the collectl utility's collection of CPU data. Example 6-1 provides an illustration of collectl utility's ASCII output. You can use any of the following methods to run the collectl utility: The default action of this utility is to collect data at 10-second intervals and to display the data in ASCII characters on the terminal screen. Example 6-1 shows the invocation and first record reported from the collectl utility. The information has been edited to fit horizontally on the page. Example 6-1 Using the collectl Utility from the Command Line
The collectl utility provides alternate output formats:
For a discussion of the options to the collectl utility and a description of its output, see collectl(1). After it is enabled, the collectl utility can be run as a service. You can use the service command to stop and start the collectl service. You can also obtain the current status of this service, as shown in the following example:
The collectl service is set up to collect normally reported summary data and to write it in a compressed text file in the /var/log/collectl directory. The actions of the collectl service are specified by the /opt/hptc/config/services/collectl.ini file. By default, the collectl service gathers information on the following subsystems:
The collectl(1) manpage discusses running the collectl utility as a service. You can run the collectl utility as one job in a batch job submission. In a batch job submission, the purpose of the collectl utility is to monitor the node while the batch job processes. You must modify the job submission script, as follows:
Another alternative is to log in to one of the compute nodes used by the application, and run the collectl utility on the command line. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||