| A node boots to local disk and runs through the node
configuration phase (nconfigure) instead of imaging. | An
nconfig starting entry appears in the imaging.log file. | Verify BIOS settings to ensure that the node is set to network
boot and that the correct network adapter is at the top of the boot order. |
| A node hangs while imaging. | You
can determine when a node hangs during imaging by monitoring the imaging.log file,
which is described in “Monitor an Imaging Session”.
Further inspection can be done by setting the correct console parameter in
the /tftpboot/pxelinux.cfg/default file before booting. | Retry the imaging operation. Verify that the network is functioning properly.
|
A node is dropped out of
the imaging process. | You
can determine when a node drops out of the imaging process by monitoring the imaging.log file. The reason the node dropped out might be that the speed
of the node dropped below the acceptable range. The ethtool was
added to the imaging environment, and it queries the speed of the network
connection with the head node and drops a node from the imaging process if
the speed is less than 1000 MB per second. | Configure the maximum speed by adding ETHSPEED=n to
the kernel command line. If the reported speed of the network device is greater
than n, imaging proceeds. Setting ETHSPEED=0 forces
imaging to occur unconditionally. |
| Disk device not found. | Identified
by monitoring imaging.log file or watching the console. | Ensure that disk is working correctly and is properly seated
in the node. |
| The node configuration phase (nconfig) fails, and the
system is left in single-user mode. | Identified by
monitoring imaging.log file. The system will completely
boot, but the node will not show up as available by the sinfo command. | Correct the cluster configuration using the cluster_config utility.
Then, you can use the startsys command to reimage or you
can rerun the nconfigure phase: # service nconfig nconfigure |
|
| A node spontaneously reboots during imaging. | Verified by multiple “starting imaging” messages
in the rsyncd log file. | Verify
hardware, BIOS, and kernel boot option settings. |
| The network boot times out. | The
system boots from local disk and runs nconfigure. You can verify this by
checking messages written to the imaging.log file. | Verify DHCP settings and status of daemon. Verify network status and connections. Monitor the /var/log/dhcpd.log file for DHCPREQUEST messages
from the client node MAC address. Check boot order and BIOS settings. Rerun imaging/booting operations with less nodes.
|
| A node configuration (nconfigure) operation fails while
attempting to access the configuration and management database on the head
node. | The system is placed in single-user mode. | Ensure that the mysqld daemon is running
on the head node. Verify network connections. Boot fewer nodes in a single operation.
|
| An imaged node boots correctly, but the node hangs
in the autoinstall script waiting for the first multicast operation. | Verify that the node has started imaging by looking for
“imaging_started” messages in the rsyncd log
file. Verify that no “finished” messages are in the imaging.log file. | Ensure that startsys is was used to image
the nodes. Check for instances of flamethrower running on the head node. # ps -aef | fgrep flamethrower |
|
| Multicast operation fails. | Verify
that the imaging operation has failed by examining the imaging.log file
and look for multiple retries of flamethrower. | Verify that the network is quiet. A very busy network can
cause dropped multicast UDP packets. Try this: Stop the imaging operation. Verify that no flamethrower daemons are running. Open the /etc/systemimager/flamethrower.conf file. Comment out the line with FEC = Save the file. Retry the imaging operation.
|