Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 6 Monitoring the System

Monitoring Tools

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The HP XC System Software includes the shownode metrics command in addition to such standard Linux monitoring commands as the following:

  • ps

  • sar

  • top

  • uptime

  • vmstat

  • w

You can use these administrative commands from any node to determine the health of an individual node. Information for these commands is available from their corresponding manpages.

The HP XC system also includes the Nagios Web-based utility for system monitoring and such commands as the shownode command. These are discussed in this chapter.

Commands for Monitoring Node Status

The shownode metrics command, which can be issued from any node in the HP XC system, provides the ability to monitor the status of all the nodes in the system.

The following arguments to the shownode metrics command monitor the node status:

  • shownode metrics cpus

  • shownode metrics cputotals

  • shownode metrics load

  • shownode metrics mem

  • shownode metrics paging

  • shownode metrics sensors

  • shownode metrics swap

For more information, see “Displaying System Statistics” and shownode(8).

Nagios

The HP XC System Software uses the Nagios Open Source monitoring application to gather and display system statistics, such as processor load and disk usage. Nagios is a system and network health monitoring application. It watches hosts and services that you specify and alerts you when problems occur or are resolved. On the HP XC system, Nagios is integrated with Supermon for monitoring capabilities.

The HP XC system automatically configures the Nagios environment based on the configuration of the system. The autoconfiguration is based on the information in the HP XC configuration and management database (cmdb). The configuration is updated as a result of changes to the HP XC database. Autoconfiguration includes setting up Nagios configuration templates mapped to the configuration for both hosts and services.

Nagios reports information collected using the Supermon infrastructure. The data collected by Supermon includes both system performance metrics and environment data, such as fan, temperature, and power supply status. This data is collected on a regular basis by the Supermon facility.

Table 6-1 lists the services monitored by Nagios and the type of function monitored for that service.

Table 6-1 Services Monitored by Nagios

ServiceFunction
Apache HTTPS Server

Monitors the Web server providing the Nagios Web interface

Configuration Monitor

Periodically generates and updates configuration display information for all nodes in the HP XC system (see “configuration” below)

configuration

Configuration information reported for this node

Environment

Report on this node's sensor status. Depending on the node type, all available “live” sensors are reported. Select the status information URL for detailed information.

LSF Failover Monitor

LSF master demon and report status. Causes LSF master failover, if required

Load Average

Report this node's most recently collected load average. Alerts are generated based on thresholds defined in /opt/hptc/nagios/etc/nagios_vars.ini

Nagios Monitor

Reports on the status of the Nagios master and monitor daemons across the HP XC system. Nagios daemons run only on service nodes. Smaller systems may have only a single master on the head node.

NodeInfo

Reports and alerts based on this node's process counts, total, user, and zombie processes. System uptime

PING Interconnect

Interconnect ping check

Power

Reports and alerts based on this node's power status and management port ping status.

Power Monitor

Collects and gathers the power status for this monitor/masters set of managed nodes (domain). Individual node status is displayed through “Power” status above.

Resource Monitor

Collects and gathers resource (squeue) information for this monitor/masters set of managed nodes (domain). Individual node status is displayed through “Resource Status” below.

Resource Status

Reports and alerts based on this node's resource usage.

Root key synchronization

Verifies root ssh configuration files are synchronized across the HP XC system.

SLURM Monitor

Collects and gathers resource (sinfo) information for this monitor/masters set of managed nodes (domain). Individual node status is displayed through “SLURM Status” below

SLURM Status

Reports and alerts based on this node's SLURM status

Supermon Metrics Monitor

Gathers supermon metrics for this monitor/masters set of managed nodes (domain). loadave, environmental, node info data is collected via this plugin and stored to the management database.

Syslog Alert Monitor

Monitors the consolidated log based on patterns in the /opt/hptc/nagios/etc/syslogAlertRules file. Individual per-node results are reported through “Syslog Alerts” below.

Syslog Alerts Report and alert based on this node's syslog alert matches
System Free Space

System free space reported by Supermon for this node.

clusternecs1-1

Reports on Procurve switch status including available sensor information as well as checks each port for low speed connections.

 

The HP XC Nagios configuration is designed so that you can customize it as needed. You can find the complete documentation for customizing Nagios on the Nagios Web site:

www.nagios.org

The Nagios system has a Web interface for the information gathered. The Web interface is available over a secure connection. Enter the following URL in your browser to access the Nagios main window:

https://fully-qualified-HP_XC-hostname/nagios

Figure 6-2 illustrates the Nagios main window.

Figure 6-2 Nagios Main Window

Nagios main window

You can choose any of the options on the left navigation bar. These options are shown in Figure 6-3.

Figure 6-3 Nagios Menu (Truncated)

Nagios menu frame

After you chose an option from the window you are prompted for a login and a password. This login and password are established when the HP XC system is configured. Usually, the login name is nagiosadmin.

The Nagios passwords are maintained in the /opt/hptc/nagios/etc/htpasswd.users file. Use the htpasswd command to manipulate this file to add a user, to delete a user, or to change the user password.

Nagios offers various views of the HP XC system. These views include the following:

  • Tactical Overview

  • Service Detail

  • Host Detail

  • Hostgroup Overview

  • Hostgroup Summary

  • Hostgroup Grid

  • Servicegroup Overview

  • Servicegroup Summary

  • Servicegroup Grid

  • Status Map

  • 3-D Status Map

  • Service Problems

  • Host Problems

  • Network Outages

  • Downtime

  • Process Info

  • Performance Info

  • Scheduling Queues

For administrators of systems comprised of tens of nodes, the Service Detail view provides a good overview of the system.. It lists the Nagios hosts and shows their status. Figure 6-4 is an example of the Nagios Service Detail view.

Figure 6-4 Nagios Service Detail View

Nagios Service Detail View

The Service Problems view is more useful for administrators of systems comprised of hundreds, or thousands, of nodes. It provides a practical overview of the system, including the number of Nagios hosts that are up or down and Nagios service status totals. Figure 6-5 is an example of the Nagios Service Problems view.

Figure 6-5 Nagios Service Problems View

Nagios Service Problems View

Both views offer color coding so that you can detect problems at a glance.

Note:

The term Hosts on the Nagios window refers to any entity with an IP address, not just nodes.

For example, Nagios monitors the four nodes and two switches in an HP XC system, and reports on the status of six hosts. SFS is also an example of a Nagios host; Nagios finds the name of the SFS server and displays its data.

Keep this in mind when using the Nagios application.

The HP XC System Software provides plug-ins that monitor these and other system statistics.

Ensure that the following services are available:

  • httpd

    The Nagios application continues to run even if this service is stopped, but you will not be able to use the Web interface.

  • nagios

  • supermon

    The Nagios application continues to run, but it is unable to gather supermon metrics without this service.

  • mond

    The Nagios application continues to run, but it is unable to gather supermon metrics without this service.

Nan Notification Aggregator and Delimiter

The HP XC System Software incorporates the Nan notification aggregator and delimiter for the Nagios paging system. Nan is an open source supplement to the Nagios application.

Nagios is capable of sending quantities of messages especially when the system is starting up, shutting down, or experiencing a failure. The Nan front-end utility overcomes the problem of multiple messages by collecting, batching, and reformatting these messages so that they are sent in a controlled manner. You can configure how multiple concatenated notifications are sorted so that the most important notifications appear at the top.

The Nan utility consists of the following:

  • A daemon, nand

    This daemon is started on the Nagios master node.

  • A client, nanc

    The nanc client is configured as the Nagios email command.

  • The nand daemon /opt/hptc/nagios/etc/nand.conf configuration file.

  • The nanc client /opt/hptc/nagios/etc/nanc.conf configuration file.

When the nand daemon receives a notification from Nagios, it starts a timer. If the notification is a PROBLEM or RECOVERY, the default time is 300 seconds; if the notification is an ACKNOWLEDGEMENT, the default time is 600 seconds. Subsequent Nagios notifications are queued in the /opt/hptc/nagios/var/nanqueue directory until the specified time elapses. Then the nand daemon sends a condensed message based on the following criteria:

  • The delivery method: pager, e-mail, and so on

    If the delivery method is a pager, the number of messages influences the format of the condensed message.

  • The corresponding destination address: the pager telephone number or the e-mail address

  • The notification type: PROBLEM, ACKNOWLEDGEMENT, or RECOVERY

The default values specified in the nand.conf and nanc.conf configuration files are appropriate for most HP XC systems; however, you can change these values to suit your installation. Follow these steps when updating either or both of these configuration files.

  1. Log in as the superuser on the Nagios master node.

  2. Edit the nanc.conf or nand.conf file.

  3. Use the service command to stop the nagios daemon:

    # service nagios stop
  4. Use the service command to restart the nagios daemon:

    # service nagios start

Supermon

Supermon is a highly scalable, high-speed cluster monitoring system. Supermon provides all required node statistics to the Nagios subsystem. System statistics are tiered, aggregated, and stored in the HP XC System Software database.

The Supermon components consist of the kernel modules to collect the statistics, the mond and supermond daemons, and the script to load and configure the daemons.

The data collected by Supermon includes system performance metrics and environment data, such as fan, temperature, and power supply status. This data is collected on a regular basis.

The syslog and syslog-ng Services

The syslog service runs on each node in the HP XC system. These daemons capture log information and send it to an aggregator regional node. Regional nodes are assigned to each client node.

The syslogng_forward service on each regional node enables the node to act as a log aggregator for the global node. Log information is gathered, consolidated, and forwarded to the global node; the global node is not necessarily the head node.

Nagios has a syslog plug-in , check_syslogAlerts, that applies a set of rules against all the events in the consolidated log file and generates alerts for those events that match one of the rules. The rules reside in the /opt/hptc/nagios/etc/syslogAlertRules. You can modify this rules file if you want to add additional rules.

The collectl Utility

You can use the collectl utility to collect data on the nodes of the HP XC system. As a development or debug tool, the collectl utility typically gathers more detail more frequently than the supermon utility. The collectl utility does have some overhead, but for most situations, it consumes less than 0.1 percent of the CPU and has minimal effect on user applications. However, even this low level can have a significant impact on some applications, so use the collectl utility with care.

The collectl utility also enables you to play back the data in either raw ASCII characters or in a plot form, which can be used to display the data with GnuPlot or Microsoft Excel. Figure 6-6 shows one example of the plotted graph based on the collectl utility's collection of CPU data. Example 6-1 provides an illustration of collectl utility's ASCII output.

Figure 6-6 Plotted Output from the collectl Utility

Plotted output from the collectl utility

You can use any of the following methods to run the collectl utility:

Running the collectl Utility from the Command Line

The default action of this utility is to collect data at 10-second intervals and to display the data in ASCII characters on the terminal screen. Example 6-1 shows the invocation and first record reported from the collectl utility. The information has been edited to fit horizontally on the page.

Example 6-1 Using the collectl Utility from the Command Line

# collectl
waiting for 10 second sample...

### RECORD    1 >>> n3 <<< (m.n) (date and time stamp) ###

# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER NICE SYS IDLE WAIT INTR CTXSW PROC RUNQ RUN AVG1 AVG5 AVG15
     0    0   0   99    0 1055    65    0  151   0 0.02 0.04  0.00

# DISK SUMMARY (/sec)
#Reads  R-Merged  R-KBytes   Writes  W-Merged  W-KBytes
     0         0         0        5         7        51

# MEMORY STATISTICS
#<---------------------Physical Memory------------------->
#   TOTAL    USED    FREE    BUFF  CACHED    SLAB  MAPPED
    3965M   1255M   2710M 129800K 920732K  89484K 157068K

<-----------Swap----------><-Inactive-><Pages/sec>
   TOTAL    USED    FREE     TOTAL     IN    OUT
   6141M       0   6141M   368352K      0     51

# NETWORK SUMMARY (/sec)
#InPck  InErr OutPck OutErr   Mult   ICmp   OCmp    IKB    OKB
     8      0      3      0      0      0      0      0      0

# SOCKET STATISTICS
#      <-------------Tcp------------->   Udp   Raw   <---Frag-->
#Used  Inuse Orphan    Tw  Alloc   Mem  Inuse Inuse  Inuse   Mem
  146     33      0    13     51     0     27     1      0     0

# TCP SUMMARY (/sec)
# PureAcks HPAcks   Loss FTrans
         1      0      0      0

The collectl utility provides alternate output formats:

  • Use the --M 1 option to display the output in a single line for a more compressed and easier to read format. Be aware that this option may not produce all the fields.

  • Use the --oT option to timestamp the data.

  • Use the --oD and --od options to provide two formats of a date and timestamp.

For a discussion of the options to the collectl utility and a description of its output, see collectl(1).

Running the collectl Utility as a Service

After it is enabled, the collectl utility can be run as a service. You can use the service command to stop and start the collectl service. You can also obtain the current status of this service, as shown in the following example:

# service collectl status
collectl (pid process_id) is running...

The collectl service is set up to collect normally reported summary data and to write it in a compressed text file in the /var/log/collectl directory.

The actions of the collectl service are specified by the /opt/hptc/config/services/collectl.ini file.

By default, the collectl service gathers information on the following subsystems:

  • CPU

  • Disk

  • Inode and file system

  • Lustre file system

  • Memory

  • Networks

  • Sockets

  • TCP

  • Interconnect

The collectl(1) manpage discusses running the collectl utility as a service.

Running the collectl Utility in a Batch Job Submission

You can run the collectl utility as one job in a batch job submission. In a batch job submission, the purpose of the collectl utility is to monitor the node while the batch job processes. You must modify the job submission script, as follows:

  1. Determine on which node the collectl utility is to be run.

  2. Decide which options you need. Typically, the following options define:

    • The output file, specified with the -f option.

    • The subsystem data to collect, specified with the -s option. The subsystems include the following:

      CPUMemory
      DiskNetworks
      Inode and file SystemNFS V3 data
      InterconnectTCP

    • the number of seconds in the sampling interval, specified with the -i option.

  3. Start the collectl utility on each node with the ssh utility.

    Be sure to run the collectl utility in the background so that the script does not hang while waiting for the collectl utility to complete.

    Collect the process ID for the collectl utility on each node.

    Allow the collectl utility from 5 to 10 seconds to start and quiesce.

  4. Start the batch job and allow it to complete.

  5. When the batch job completes, stop the collectl process on each node by killing its process ID. The collectl process traps the SIGNINT signal and shuts down cleanly.

  6. Copy the files that the collectl process created on the node's disk, and store them in a separate location for later review.

    Delete the files that the collectl process created.

Another alternative is to log in to one of the compute nodes used by the application, and run the collectl utility on the command line.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.