Our HPC cluster was lucky enough double in compute capacity recently. Whoop! The new hardware brought with it some significant changes in rack layout and networking fabric. The compute nodes are a combination of Dell R630, DSS1500 and R730 (for GPU K80 and Intel Phi nodes).
The existing core 10GbE CAT6 based fabric (made up of
Dell Force10 S4820T) was replaced by a
Dell Force10 Z9500 and fiber (Z9500 has 132 QSFP+ 40GbE ports that can in turn be broken out into 528 10GbE SFP+ ports).
Physical changes aside (like wiring, top of rack 40GbE to 10GbE breakout panels, etc...), the above meant we had to change the primary boot device from the addon 10GbE CAT6 based NIC to the onboard fiber 10GbE NIC (CentOS 7 sees this interface as
eno1).
This required two changes at the system BIOS / nic hardware config level
- Enable PXE boot on the NIC
- Modify the BIOS boot order
One method to make these two changes in bulk is to use the Dell OpenManage command line tool
racadm, which was what we decided to use.
The following are notes I took while working on a subset of the compute nodes.
Enable PXE on the fiber interface
The first step is to identify the names of the network interfaces. I queried a single node to get the full list of interfaces followed by querying the specific interface (.1) just for grins to see what settings were available. In this case the first integrated port is referenced as NIC.nicconfig.1 and NIC.Integrated.1-1-1
# Get list of Nics
racadm -r 172.16.3.48 -u root -p xxxxxxx get nic.nicconfig
NIC.nicconfig.1 [Key=NIC.Integrated.1-1-1#nicconfig]
NIC.nicconfig.2 [Key=NIC.Integrated.1-2-1#nicconfig]
NIC.nicconfig.3 [Key=NIC.Integrated.1-3-1#nicconfig]
NIC.nicconfig.4 [Key=NIC.Integrated.1-4-1#nicconfig]
NIC.nicconfig.5 [Key=NIC.Slot.3-1-1#nicconfig]
NIC.nicconfig.6 [Key=NIC.Slot.3-2-1#nicconfig]
racadm -r 172.16.3.48 -u root -p xxxxxxx get nic.nicconfig.1
[Key=NIC.Integrated.1-1-1#nicconfig]
LegacyBootProto=NONE
#LnkSpeed=AutoNeg
NumberVFAdvertised=64
VLanId=0
WakeOnLan=Disabled
Next we can enable PXE boot on NIC.Integrated.1-1-1 for the set of nodes. In order for the change to take affect you have to create a job followed by a reboot.
for n in {48..10} ; do
ip=172.16.3.${n}
echo "IP: $ip - configuring nic.nicconfig.1.legacybootproto PXE"
# Get Nic config for integrated port 1
racadm -r $ip -u root -p xxxxxxx get nic.nicconfig.1 | grep Legacy
# Set to PXE
racadm -r $ip -u root -p xxxxxxx set nic.nicconfig.1.legacybootproto PXE
# Verify it's set to PXE (pending)
racadm -r $ip -u root -p xxxxxxx get nic.nicconfig.1 | grep Legacy
# Create a job to enable the changes following the reboot
racadm -r $ip -u root -p xxxxxxx jobqueue create NIC.Integrated.1-1-1
# reboot so that the configur job will execute
ipmitool -I lanplus -H $ip -U root -P xxxxxxx chassis power reset
done
Configure the BIOS boot order
Now that the NIC has PXE enabled and the changes have been applied, the boot order can be modified. If this fails for a node it either means that the job failed to run in the previous step, start debugging.
for n in {48..10} ; do
ip=172.16.3.${n}
echo "IP: $ip - configuring BIOS.biosbootsettings.BootSeq NIC.Integrated.1-1-1,...."
# Get Bios Boot sequence
racadm -r $ip -u root -p xxxxxxx get BIOS.biosbootsettings.BootSeq | grep BootSeq
# Set Bios boot sequence
racadm -r $ip -u root -p xxxxxxx set BIOS.biosbootsettings.BootSeq NIC.Integrated.1-1-1,NIC.Integrated.1-3-1,NIC.Slot.3-1-1,Optical.SATAEmbedded.J-1,HardDisk.List.1-1
# Create a BIOS reboot job so that the boot order changes are applied
racadm -r $ip -u root -p xxxxxxx jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
done
Modify the switch and port locations in BrightCM
We use
Bright Computing Cluster Manager for HPC to manage our HPC nodes (this recently replaced
Rocks in our environment). Within BrightCM we had to modify the boot interface for the set of compute nodes. BrightCM provides excellent CLI support, hooray!
cmsh -c "device foreach -n node0001..node0039 (interfaces; use bootif; set networkdevicename eno1); commit"
Update the switch port locations in BrightCM
BrightCM keeps track of the switch port to node NIC mapping. One reason for this is to prevent accidentally imprinting the wrong image on nodes that got swapped (i.e. you remove two nodes for service and insert them back into the rack in the wrong location).
First I had to identify the new port number for a node, I chose the node that would be last in sequence on the switch. This happened to show up in BrightCM as port 171. I found this by PXE booting the compute node, once it comes up BrightCM notices a discrepancy and displays an interface that allows you to manually address the issue, something akin to "node0039 should be on switch XXX port ## but it's showing up on switch z9500-r05-03-38 port 171" blah blah blah.
Instead of manually addressing the issue, it can be done via the CLI in bulk (assuming there's a sequence). Each of our nodes have two NICs wired to the Z9500 (node0039 would be on ports 171 and 172), thus in the code below I decrement by 2 ports for each node's boot device.
port=171
for n in {39..26} ; do
let port=${port}-2
cmsh -c "device; set node00${n} ethernetswitch z9500-r05-03-38:${port}; commit ;"
done
unset port