Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Release Notes > Chapter 14 Documentation Notes

Information Omitted From the HP XC System Software Administration Guide

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Index

The following sections provide information that was omitted from the HP XC System Software Administration Guide .

System Event Logs

This new functionality was delivered to your HP XC system through the PK01 patch for Version 3.0.

Each hardware platform provided by HP supplies an event logging mechanism to capture platform-specific events to track hardware states and changes. Information in the system event log (SEL) varies, but it typically contains information including, but not limited to, the following:

  • Memory ECC errors

  • Power supply failures

  • Voltage problems

Event logs are stored by the firmware and can become full over time. Some hardware models require regular maintenance to clear the logs to avoid losing critical events. In addition, errors that indicate failure or pending failure of a component need to be brought to the operator's immediate attention.

The HP XC system event log functionality provides complete management of all log types of supported HP platforms. Log information is regularly read and archived, and the information is used to generate Nagios alerts when applicable. Logs that approach a critical size are cleared to prevent event loss.

Event logs are typically accessed through the management port requiring platform/protocol-specific user authentication as well as network access to the console port (cp-nxxx, where nxxx is the node number).

System event log history is captured in the /hptc_cluster/adm/logs/sel/sel-nxxx.log file where nxxx represents the name of the individual node.

System event logs are managed by the standard logrotate functionality. For more information on this utility, see logrotate(8).

Configuring System Event Logging

This new functionality was delivered to your HP XC system through the PK01 patch for Version 3.0.

System event log and hardware sensor information is gathered using the input interface for platforms without an iLO management port. Some platforms require additional user name and password setup to allow access to the BMC/IPMI connection on the console port. In addition, depending on how your head node is attached to the network, additional password setup may be required.

On HP XC systems whose head node is either an HP CP6000 HP Integrity system or an HP CP4000 HP ProLiant system, you can only obtain the sensor and system event log information remotely.

See “Required Task: Configure the BMC Password On Itanium Systems” for instructions to configure the BMC password on HP Integrity systems.

Modifying System Event Log Rotation and Nagios Alerts

This new functionality was delivered to your HP XC system through the PK01 patch for Version 3.0.

You can use the logrotate command to change the rotation of the system event logs and the rules for Nagios alerts.

In this release (with all patches installed), Nagios is able to alert you with power, memory, voltage, and automatic system recovery (ASR) messages.

Nagios alert rules are defined in the /opt/hptc/nagios/etc/selRules file. Edit this file if you want to modify the alert rules.

Moving SLURM and LSF to Their Backup Nodes

This procedure is not documented in the HP XC System Software Administration Guide but it will be included in a future version.

To move the SLURM and LSF daemons from their primary node to their backup node (perhaps due to a maintenance need on the primary node), follow this procedure:

  1. Log into the backup node as root.

  2. Shut down the backup slurmctld daemon:

    # pkill slurmctld
  3. Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf file. Change the value of the ControlMachine attribute to the backup node, and comment out or change the value of the BackupController.

  4. Save your changes and exit the text editor.

  5. Shut down LSF on the primary node (you do not have to be logged into the primary node to do this).

    Shutting down LSF on the primary node will not impact batch jobs, but it will terminate interactive LSF jobs (jobs submitted with the bsub -I option). Therefore, take the appropriate precautions before running this command (either warn users, or close the LSF queues and wait for all jobs to finish).

    # controllsf stop
  6. Log in to the primary SLURM node and shut down the primary slurmctld daemon:

    # pkill slurmctld
  7. On the backup SLURM node, start the primary slurmctld controller:

    # slurmctld
  8. Start LSF locally on the backup node:

    # controllsf start here
  9. Enter the following command if you want to make the backup node the primary node for LSF:

    # controllsf set primary backup_nodename

If you set another node to be the BackupController for SLURM, you can log into that node and run the slurmctld command. This new backup node requires the resource_management role to be assigned to it for this configuration to persist after future runs of the cluster_config command .

To move LSF and SLURM back to the original primary node, follow the same procedure with the assumption that the original primary node is now the backup node, and the original backup node is now the primary node.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003–2007 Hewlett-Packard Development Company, L.P.