Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 8 Monitoring the System with Nagios

Nagios Report Generator Utility

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The HP XC System Software includes the Nagios report generator utility, nrg. This utility is a command-line interface that you can use to obtain usage information and to diagnose the state of the Nagios engine.

This utility can generate a list of the status of the Nagios hosts and Nagios services, a compact display of the status of the HP XC system, and an analysis of the system state.

The nrg utility uses the Nagios status.log file to help you determine the solution for a Nagios problem. Example 8-1 shows the two-column analysis. The left column identifies the Nagios host and the right column describes the warning or error and possible actions for its solution.

Example 8-1 The nrg Utility System State Analysis

# nrg --mode analyze
Nodelist                 Description
----------------------   ---------------------------------------------------
n[3-7] nh                [Environment - NODATA] No sensor data is available
                         for reporting.  Use 'shownode metrics sensors --
                         last 20m node xxxx' for each of these nodes to
                         verify if sensor data has been recently collected.
                         This status is drawn from the same source as the
                         shownode metrics sensors command.  Look at the
                         status of the 'Sensor Collection Monitor' as that
                         plug-in causes the population of this data.

nh                       [Host Monitor - Critical] A significant percentage
                         of nodes are reported as down, you can run
                         check_node_status --list to see what Nagios
                         believes the  state of the nodes are.

nh                       [LSF Failover Monitor - Critical] The LSF demon is
                         reporting as down.  If failover is disabled (try
                         'controllsf show failover'), you can attempt to
                         restart LSF with 'controllsf start'.  If failover
                         is enabled and you see this message, it is likely
                         that all of your nodes with the resource management
                         role are down, or there is a fatal LSF
                         configuration error (look at the LSF log files).

n[3-7] nh                [NodeInfo - ASSUMEDOK] Pending services are normal,
                         they indicate data has not yet been received by the
                         Nagios engine.  Service *may* be fine, but if it
                         continues to pend for more then about 30 minutes it
                         may indicate data is not being collected.

n[3-7]                   [PING Interconnect - Critical] This typically
                         indicates a node is down, however, it could also
                         indicate a non-functioning interconnect if the
                         nodes is up and operational.

nh                       [Resource Monitor - NOOUTPUT] A service has failed
                         to return an output status.  Typically this
                         indicates a plug-in failure.  Run the plug-in
                         directly to observe any error conditions.  In some
                         cases, this exact message is returned from
                         check_nrpe when a nrpe directive is failing to
                         execute a command.  If you can determine which nrpe
                         command is being requested by the associated plug-
                         in (see /opt/hptc/nagios/etc/nrpe_local.cfg for a
                         list) you can test it using the 'check_nrpe -H
                         nodename -c command' plug-in.

nh                       [Sensor Collection Monitor - Critical] Many nodes
                         have returned warning or critical sensor status.
                         If message is 'Service Timeout', collection is
                         taking too long (>5 minutes or so).  This could
                         indicate a problem on one of the nodes in the
                         console_network role (shownode roles
                         console_network) or a problem running ipmitool.
                         Try running the sensors command directly, time
                         /opt/hptc/supermon/bin/sensors

nh                       [Slurm Monitor - Critical] 'sinfo' reported
                         problems with nodes in some partitions,
                         specifically, some nodes may be marked with an '*'
                         which indicates they may be unresponsive to SLURM.
                         Run 'sinfo' for more information.

n[3-7]                   [Slurm Status - Critical] sinfo reported problems
                         with partitions for this node

nh                       [Supermon Metrics Monitor - Critical] The metrics
                         monitor has returned a critical status indicating a
                         number of nodes have reported critical thresholds.
                         If the actual status is 'Service timed out' then
                         the monitor has taken too long to complete a single
                         iteration.  To verify this run the monitor
                         manually: 'time
                         /opt/hptc/supermon/bin/storeMetrics' from the head
                         node to see if it takes more then about 2-3 minutes
                         (max on a large cluster)

nh                       [Syslog Alert Monitor - NOOUTPUT] The
                         check_syslogalerts plug-in failed to return any
                         status.  This could indicate a problem with the
                         consolidated log or resources needed to execute the
                         plug-in.

nh                       [System Event Log Monitor - NRPEUNABLETOREAD]
                         Indicates the remote command request to check the
                         SEL logs for a group of nodes has failed to return
                         any status.  This may indicate a failure of the
                         check_selmon command. The System Event Log Monitor
                         must proxy to a console connected node to collect
                         console related data.  NRPE is used to proxy these
                         requests to a console connected node. These nodes
                         are identified as members of the 'console_network'
                         role Verify that the check_selmon command can run
                         on those nodes, i.e.,  schedule_service --directive
                         check_selmon_for_mh_xxxxxx where xxxxxx is the name
                         of the management hub reporting the failure.  Look
                         at nagios.log and the consolidated.log to see if
                         there are any indications of failures for NRPE

nh                       [System Event Log - IPMICONNECTFAIL] The check_sel
                         plug-in failed to connect to the console port for
                         this node, common cause is the console device cp-
                         xxxxx, is not reachable.  If this is the head node
                         and the head node is externally connected, you may
                         be able to define cp-xxxxx in /etc/hosts using the
                         external IP to allow connectivity.  Sensor 
                         collection may not be possible when using
                         externally connected console ports for head nodes
                         on platforms  that use IPMI to gather sensor
                         information. If this is not the head node then it
                         may indicate a communication problem with the
                         associated console device 'cp-{nodename}'.

n[3-7] nh                [System Free Space - ASSUMEDOK] Pending services
                         are normal, they indicate data has not yet been
                         received by the Nagios engine.  Service *may* be
                         fine, but if it continues to pend for more then
                         about 30 minutes it may indicate data is not being
                         collected.

This utility can generate:

  • A list of nodes according to their severity:

    • Critical

    • Warning

    • Ok

    • Unkown

    • Pending

  • Nagios hosts status.

  • Nagios services status.

  • Nagios monitors status.

  • A list of nodes that are up or down.

For additional information about this utility, see nrg(8). Help is also available by entering the following command:

# nrg --help

    --help
    --verbose - Report more details
    --log|l - logfile, default $statuslog
    --severity [c,w,o,u,p] - default is all
          c - critical, w - warning, o - ok,
          u - unknown,  p - pending
    --hosts - Only list hosts status
    --services - Only list service status
    --monitors - Only list monitor status
    --up   - Only up nodes
    --down - Only down nodes
    --sort t,h,s - Sort by (t)ime, (h)ost, (s)ervice
    --sort [c,w,o,u,p] - Summary mode only (as cwoup as in severity)
    --mode - Report mode: (f)ull, (s)ummary, (r)aw, (w)atch, (a)nalyze
    --html - Generate html instead of plain text
    --prefix pppp - Specify node prefix (for imported status.logs)
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.