 |
» |
|
|
 |
The HP XC System Software includes the Nagios report generator utility, nrg. This utility is a command-line interface that you can use to obtain usage information and to diagnose the state of the Nagios engine. This utility can generate a list of the status of the Nagios hosts and Nagios services, a compact display of the status of the HP XC system, and an analysis of the system state. The nrg utility uses the Nagios status.log file to help you determine the solution for a Nagios problem. Example 8-1 shows the two-column analysis. The left column identifies the Nagios host and the right column describes the warning or error and possible actions for its solution. Example 8-1 The nrg Utility System State Analysis  |
# nrg --mode analyze
Nodelist Description
---------------------- ---------------------------------------------------
n[3-7] nh [Environment - NODATA] No sensor data is available
for reporting. Use 'shownode metrics sensors --
last 20m node xxxx' for each of these nodes to
verify if sensor data has been recently collected.
This status is drawn from the same source as the
shownode metrics sensors command. Look at the
status of the 'Sensor Collection Monitor' as that
plug-in causes the population of this data.
nh [Host Monitor - Critical] A significant percentage
of nodes are reported as down, you can run
check_node_status --list to see what Nagios
believes the state of the nodes are.
nh [LSF Failover Monitor - Critical] The LSF demon is
reporting as down. If failover is disabled (try
'controllsf show failover'), you can attempt to
restart LSF with 'controllsf start'. If failover
is enabled and you see this message, it is likely
that all of your nodes with the resource management
role are down, or there is a fatal LSF
configuration error (look at the LSF log files).
n[3-7] nh [NodeInfo - ASSUMEDOK] Pending services are normal,
they indicate data has not yet been received by the
Nagios engine. Service *may* be fine, but if it
continues to pend for more then about 30 minutes it
may indicate data is not being collected.
n[3-7] [PING Interconnect - Critical] This typically
indicates a node is down, however, it could also
indicate a non-functioning interconnect if the
nodes is up and operational.
nh [Resource Monitor - NOOUTPUT] A service has failed
to return an output status. Typically this
indicates a plug-in failure. Run the plug-in
directly to observe any error conditions. In some
cases, this exact message is returned from
check_nrpe when a nrpe directive is failing to
execute a command. If you can determine which nrpe
command is being requested by the associated plug-
in (see /opt/hptc/nagios/etc/nrpe_local.cfg for a
list) you can test it using the 'check_nrpe -H
nodename -c command' plug-in.
nh [Sensor Collection Monitor - Critical] Many nodes
have returned warning or critical sensor status.
If message is 'Service Timeout', collection is
taking too long (>5 minutes or so). This could
indicate a problem on one of the nodes in the
console_network role (shownode roles
console_network) or a problem running ipmitool.
Try running the sensors command directly, time
/opt/hptc/supermon/bin/sensors
nh [Slurm Monitor - Critical] 'sinfo' reported
problems with nodes in some partitions,
specifically, some nodes may be marked with an '*'
which indicates they may be unresponsive to SLURM.
Run 'sinfo' for more information.
n[3-7] [Slurm Status - Critical] sinfo reported problems
with partitions for this node
nh [Supermon Metrics Monitor - Critical] The metrics
monitor has returned a critical status indicating a
number of nodes have reported critical thresholds.
If the actual status is 'Service timed out' then
the monitor has taken too long to complete a single
iteration. To verify this run the monitor
manually: 'time
/opt/hptc/supermon/bin/storeMetrics' from the head
node to see if it takes more then about 2-3 minutes
(max on a large cluster)
nh [Syslog Alert Monitor - NOOUTPUT] The
check_syslogalerts plug-in failed to return any
status. This could indicate a problem with the
consolidated log or resources needed to execute the
plug-in.
nh [System Event Log Monitor - NRPEUNABLETOREAD]
Indicates the remote command request to check the
SEL logs for a group of nodes has failed to return
any status. This may indicate a failure of the
check_selmon command. The System Event Log Monitor
must proxy to a console connected node to collect
console related data. NRPE is used to proxy these
requests to a console connected node. These nodes
are identified as members of the 'console_network'
role Verify that the check_selmon command can run
on those nodes, i.e., schedule_service --directive
check_selmon_for_mh_xxxxxx where xxxxxx is the name
of the management hub reporting the failure. Look
at nagios.log and the consolidated.log to see if
there are any indications of failures for NRPE
nh [System Event Log - IPMICONNECTFAIL] The check_sel
plug-in failed to connect to the console port for
this node, common cause is the console device cp-
xxxxx, is not reachable. If this is the head node
and the head node is externally connected, you may
be able to define cp-xxxxx in /etc/hosts using the
external IP to allow connectivity. Sensor
collection may not be possible when using
externally connected console ports for head nodes
on platforms that use IPMI to gather sensor
information. If this is not the head node then it
may indicate a communication problem with the
associated console device 'cp-{nodename}'.
n[3-7] nh [System Free Space - ASSUMEDOK] Pending services
are normal, they indicate data has not yet been
received by the Nagios engine. Service *may* be
fine, but if it continues to pend for more then
about 30 minutes it may indicate data is not being
collected. |
 |
This utility can generate: A list of nodes according to their severity: A list of nodes that are up or down.
For additional information about this utility, see nrg(8). Help is also available by entering the following command: # nrg --help
--help
--verbose - Report more details
--log|l - logfile, default $statuslog
--severity [c,w,o,u,p] - default is all
c - critical, w - warning, o - ok,
u - unknown, p - pending
--hosts - Only list hosts status
--services - Only list service status
--monitors - Only list monitor status
--up - Only up nodes
--down - Only down nodes
--sort t,h,s - Sort by (t)ime, (h)ost, (s)ervice
--sort [c,w,o,u,p] - Summary mode only (as cwoup as in severity)
--mode - Report mode: (f)ull, (s)ummary, (r)aw, (w)atch, (a)nalyze
--html - Generate html instead of plain text
--prefix pppp - Specify node prefix (for imported status.logs)
|
|