Thursday, December 1, 2016

HowTo: Restore a GridScaler GPFS Client Node after Reinstalling the Node

I ran into this issue after reinstalling several compute nodes on our cluster shortly after bringing our new DDN GridScaler GPFS storage cluster online.
$ sudo mmstartup -N c0040
Fri Dec  2 03:36:03 UTC 2016: mmstartup: Starting GPFS ...
c0040:  mmremote: determineMode: Missing file /var/mmfs/gen/mmsdrfs.
c0040:  mmremote: This node does not belong to a GPFS cluster.
mmstartup: Command failed. Examine previous error messages to determine cause.

One method I discovered online was to take the affected node off of the network (or reboot it), remove it from the GPFS cluster, once it's back on the network (or fully rebooted), add it back, license it and start it.

Later I was introduced to the mmsdrrestore command (portion of the man file below:
mmsdrrestore command

Restores the latest GPFS system files on the specified nodes.

Synopsis

mmsdrrestore [-p NodeName] [-F mmsdrfsFile] [-R remoteFileCopyCommand]
             [-a | -N {Node[,Node...] | NodeFile | NodeClass}]

Availability

Available on all IBM Spectrum Scale editions.

Description

The mmsdrrestore command is intended for use by experienced
system administrators.

Use the mmsdrrestore command to restore the latest GPFS
system files on the specified nodes. If no nodes are specified,
the command restores the configuration information only on the
node on which is it run. If the local GPFS configuration file is
missing, the file that is specified with the -F option from
the node that is specified with the -p option is used
instead. This command works best when used with the
mmsdrbackup user exit. See the following IBM Spectrum
Scale: Administration and Programming Reference topic:
mmsdrbackup user exit.

...

Here's an example of using the command to restore the configuration to node c0040 using primary server gs0 (i.e. one of the NSD servers)
$ sudo mmsdrrestore -p gs0 -N c0040
Fri Dec  2 03:47:06 UTC 2016: mmsdrrestore: Processing node gs0
Fri Dec  2 03:47:08 UTC 2016: mmsdrrestore: Processing node c0040
mmsdrrestore: Command successfully completed

Finally, start GPFS on the client (which also mounts the file system(s) if configured to do so
$ sudo mmstartup -N c0040

Monday, August 22, 2016

Dell OMSA 8.3 on CentOS 7.2 "Error! Chassis info setting unavailable on this system."

After installing Dell OMSA 8.3 on a new PowerEdge 730xd running CentOS 7.2 x86_64, the omreport chassis info command reports the following (after starting the services):

# omreport chassis info

Error! Chassis info setting unavailable on this system.
First, the solution (Zurd on the mailing list pointed me here: http://lists.us.dell.com/pipermail/linux-poweredge/2016-August/050692.html) followed by the full ticket I sent to the Dell linux-poweredge mailing list.

The solution for CentOS users (and possibly other non-supported distros) is to stop the services, make the following change, then restart the services
--- /opt/dell/srvadmin/etc/srvadmin-storage/stsvc.ini.orig 2016-08-22 21:28:32.079580254 -0500
+++ /opt/dell/srvadmin/etc/srvadmin-storage/stsvc.ini 2016-08-22 21:20:32.374317823 -0500
@@ -116,7 +116,7 @@
 vil4=dsm_sm_sasvil
 vil5=dsm_sm_sasenclvil
 vil6=dsm_sm_swrvil
-vil7=dsm_sm_psrvil
+; vil7=dsm_sm_psrvil
 vil8=dsm_sm_rnavil

 [SSDSmartInterval]
Now on to the full details of the issue:
  1. Install Dell OMSA 8.3
  2. # wget -q -O - http://linux.dell.com/repo/hardware/dsu/bootstrap.cgi | bash
    # yum clean all
    # yum -y install kernel-devel kernel-headers gcc dell-system-update
    # yum -y install srvadmin-all
  3. Next check the status of the services (not started)
    # srvadmin-services.sh status
    dell_rbu (module) is stopped
    ipmi driver is running
    dsm_sa_datamgrd is stopped
    dsm_sa_eventmgrd is stopped
    dsm_sa_snmpd is stopped
    ● dsm_om_shrsvc.service - LSB: DSM OM Shared Services
       Loaded: loaded (/etc/rc.d/init.d/dsm_om_shrsvc)
       Active: inactive (dead)
         Docs: man:systemd-sysv-generator(8)
     
    Aug 22 20:43:08 r730xd-srv01.local systemd[1]: Starting LSB: DSM OM Shared Services...
    Aug 22 20:43:08 r730xd-srv01.local dsm_om_shrsvc[5144]: [47B blob data]
    Aug 22 20:43:08 r730xd-srv01.local systemd[1]: Started LSB: DSM OM Shared Services.
    Aug 22 20:43:08 r730xd-srv01.local dsm_om_shrsvc[5144]: tput: No value for $TERM and no -T specified
    Aug 22 20:45:55 r730xd-srv01.local systemd[1]: Stopping LSB: DSM OM Shared Services...
    Aug 22 20:45:55 r730xd-srv01.local dsm_om_shrsvc[8804]: [52B blob data]
    Aug 22 20:45:55 r730xd-srv01.local systemd[1]: Stopped LSB: DSM OM Shared Services.
    Aug 22 20:46:28 r730xd-srv01.local systemd[1]: Stopped LSB: DSM OM Shared Services.
    ● dsm_om_connsvc.service - LSB: DSM OM Connection Service
       Loaded: loaded (/etc/rc.d/init.d/dsm_om_connsvc)
       Active: inactive (dead)
         Docs: man:systemd-sysv-generator(8)
     
    Aug 22 20:43:08 r730xd-srv01.local systemd[1]: Starting LSB: DSM OM Connection Service...
    Aug 22 20:43:08 r730xd-srv01.local dsm_om_connsvc[5145]: [50B blob data]
    Aug 22 20:43:08 r730xd-srv01.local systemd[1]: Started LSB: DSM OM Connection Service.
    Aug 22 20:45:55 r730xd-srv01.local systemd[1]: Stopping LSB: DSM OM Connection Service...
    Aug 22 20:46:02 r730xd-srv01.local dsm_om_connsvc[8844]: [55B blob data]
    Aug 22 20:46:02 r730xd-srv01.local systemd[1]: Stopped LSB: DSM OM Connection Service.
    Aug 22 20:46:29 r730xd-srv01.local systemd[1]: Stopped LSB: DSM OM Connection Service.
  4. Start the services
    # srvadmin-services.sh start
    Starting instsvcdrv (via systemctl):                       [  OK  ]
    Starting dataeng (via systemctl):                          [  OK  ]
    Starting dsm_om_shrsvc (via systemctl):                    [  OK  ]
    Starting dsm_om_connsvc (via systemctl):                   [  OK  ]
  5. Try running the chassis info command
    # omreport chassis info
    Error! Chassis info setting unavailable on this system.
     
    # omreport about
    Product name : Dell OpenManage Server Administrator
    Version      : 8.3.0
    Copyright    : Copyright (C) Dell Inc. 1995-2015 All rights reserved.
    Company      : Dell Inc.
  6. The following are the rpms installed via yum
    # rpm -qa | grep srvadmin
    srvadmin-xmlsup-8.3.0-1908.9058.el7.x86_64
    srvadmin-omacore-8.3.0-1908.9058.el7.x86_64
    srvadmin-server-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-oslog-8.3.0-1908.9058.el7.x86_64
    srvadmin-idrac-vmcli-8.3.0-1908.9058.el7.x86_64
    srvadmin-storageservices-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-smcommon-8.3.0-1908.9058.el7.x86_64
    srvadmin-omcommon-8.3.0-1908.9058.el7.x86_64
    srvadmin-smweb-8.3.0-1908.9058.el7.x86_64
    srvadmin-racsvc-8.3.0-1908.9058.el7.x86_64
    srvadmin-nvme-8.3.0-1908.9058.el7.x86_64
    srvadmin-storage-cli-8.3.0-1908.9058.el7.x86_64
    srvadmin-storageservices-8.3.0-1908.9058.el7.x86_64
    srvadmin-omilcore-8.3.0-1908.9058.el7.x86_64
    srvadmin-racadm4-8.3.0-1908.9058.el7.x86_64
    srvadmin-isvc-8.3.0-1908.9058.el7.x86_64
    srvadmin-argtable2-8.3.0-1908.9058.el7.x86_64
    srvadmin-racadm5-8.3.0-1908.9058.el7.x86_64
    srvadmin-cm-8.3.0-1908.9058.el7.x86_64
    srvadmin-isvc-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-rac4-populator-8.3.0-1908.9058.el7.x86_64
    srvadmin-tomcat-8.3.0-1908.9058.el7.x86_64
    srvadmin-itunnelprovider-8.3.0-1908.9058.el7.x86_64
    srvadmin-storelib-sysfs-8.3.0-1908.9058.el7.x86_64
    srvadmin-storageservices-cli-8.3.0-1908.9058.el7.x86_64
    srvadmin-deng-8.3.0-1908.9058.el7.x86_64
    srvadmin-rac-components-8.3.0-1908.9058.el7.x86_64
    srvadmin-ominst-8.3.0-1908.9058.el7.x86_64
    srvadmin-sysfsutils-8.3.0-1908.9058.el7.x86_64
    srvadmin-rac5-8.3.0-1908.9058.el7.x86_64
    srvadmin-base-8.3.0-1908.9058.el7.x86_64
    srvadmin-idrac-ivmcli-8.3.0-1908.9058.el7.x86_64
    srvadmin-rac4-8.3.0-1908.9058.el7.x86_64
    srvadmin-webserver-8.3.0-1908.9058.el7.x86_64
    srvadmin-standardAgent-8.3.0-1908.9058.el7.x86_64
    srvadmin-storelib-8.3.0-1908.9058.el7.x86_64
    srvadmin-storage-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-omacs-8.3.0-1908.9058.el7.x86_64
    srvadmin-racdrsc-8.3.0-1908.9058.el7.x86_64
    srvadmin-idracadm-8.3.0-1908.9058.el7.x86_64
    srvadmin-idrac-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-realssd-8.3.0-1908.9058.el7.x86_64
    srvadmin-storage-8.3.0-1908.9058.el7.x86_64
    srvadmin-all-8.3.0-1908.9058.el7.x86_64
    srvadmin-hapi-8.3.0-1908.9058.el7.x86_64
    srvadmin-deng-snmp-8.3.0-1908.9058.el7.x86_64
    srvadmin-server-cli-8.3.0-1908.9058.el7.x86_64
    srvadmin-jre-8.3.0-1908.9058.el7.x86_64
    srvadmin-idrac-8.3.0-1908.9058.el7.x86_64

Wednesday, June 22, 2016

How To: Enable PXE and Configure Boot Order Via Dell RACADM Command

Our HPC cluster was lucky enough double in compute capacity recently. Whoop! The new hardware brought with it some significant changes in rack layout and networking fabric. The compute nodes are a combination of Dell R630, DSS1500 and R730 (for GPU K80 and Intel Phi nodes).

The existing core 10GbE CAT6 based fabric (made up of Dell Force10 S4820T) was replaced by a Dell Force10 Z9500 and fiber (Z9500 has 132 QSFP+ 40GbE ports that can in turn be broken out into 528 10GbE SFP+ ports).

Physical changes aside (like wiring, top of rack 40GbE to 10GbE breakout panels, etc...), the above meant we had to change the primary boot device from the addon 10GbE CAT6 based NIC to the onboard fiber 10GbE NIC (CentOS 7 sees this interface as eno1).

This required two changes at the system BIOS / nic hardware config level
  • Enable PXE boot on the NIC
  • Modify the BIOS boot order
One method to make these two changes in bulk is to use the Dell OpenManage command line tool racadm, which was what we decided to use.

The following are notes I took while working on a subset of the compute nodes.

Enable PXE on the fiber interface

The first step is to identify the names of the network interfaces. I queried a single node to get the full list of interfaces followed by querying the specific interface (.1) just for grins to see what settings were available. In this case the first integrated port is referenced as NIC.nicconfig.1 and NIC.Integrated.1-1-1
# Get list of Nics
racadm -r 172.16.3.48 -u root -p xxxxxxx get nic.nicconfig

NIC.nicconfig.1 [Key=NIC.Integrated.1-1-1#nicconfig]
NIC.nicconfig.2 [Key=NIC.Integrated.1-2-1#nicconfig]
NIC.nicconfig.3 [Key=NIC.Integrated.1-3-1#nicconfig]
NIC.nicconfig.4 [Key=NIC.Integrated.1-4-1#nicconfig]
NIC.nicconfig.5 [Key=NIC.Slot.3-1-1#nicconfig]
NIC.nicconfig.6 [Key=NIC.Slot.3-2-1#nicconfig]

racadm -r 172.16.3.48 -u root -p xxxxxxx get nic.nicconfig.1

[Key=NIC.Integrated.1-1-1#nicconfig]
LegacyBootProto=NONE
#LnkSpeed=AutoNeg
NumberVFAdvertised=64
VLanId=0
WakeOnLan=Disabled

 
Next we can enable PXE boot on NIC.Integrated.1-1-1 for the set of nodes. In order for the change to take affect you have to create a job followed by a reboot.
for n in {48..10} ; do
  ip=172.16.3.${n}
  echo "IP: $ip - configuring nic.nicconfig.1.legacybootproto PXE"
  # Get Nic config for integrated port 1
  racadm -r $ip -u root -p xxxxxxx get nic.nicconfig.1 | grep Legacy
  # Set to PXE
  racadm -r $ip -u root -p xxxxxxx set nic.nicconfig.1.legacybootproto PXE
  # Verify it's set to PXE (pending)
  racadm -r $ip -u root -p xxxxxxx get nic.nicconfig.1 | grep Legacy
  # Create a job to enable the changes following the reboot
  racadm -r $ip -u root -p xxxxxxx jobqueue create NIC.Integrated.1-1-1
  # reboot so that the configur job will execute
  ipmitool -I lanplus -H $ip -U root -P xxxxxxx chassis power reset
done 

Configure the BIOS boot order

Now that the NIC has PXE enabled and the changes have been applied, the boot order can be modified. If this fails for a node it either means that the job failed to run in the previous step, start debugging.
for n in {48..10} ; do
  ip=172.16.3.${n}
  echo "IP: $ip - configuring BIOS.biosbootsettings.BootSeq NIC.Integrated.1-1-1,...."
  # Get Bios Boot sequence
  racadm -r $ip -u root -p xxxxxxx get BIOS.biosbootsettings.BootSeq | grep BootSeq
  # Set Bios boot sequence
  racadm -r $ip -u root -p xxxxxxx set BIOS.biosbootsettings.BootSeq NIC.Integrated.1-1-1,NIC.Integrated.1-3-1,NIC.Slot.3-1-1,Optical.SATAEmbedded.J-1,HardDisk.List.1-1
  # Create a BIOS reboot job so that the boot order changes are applied
  racadm -r $ip -u root -p xxxxxxx jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
done
 

Modify the switch and port locations in BrightCM

We use Bright Computing Cluster Manager for HPC to manage our HPC nodes (this recently replaced Rocks in our environment). Within BrightCM we had to modify the boot interface for the set of compute nodes. BrightCM provides excellent CLI support, hooray!
cmsh -c "device foreach -n node0001..node0039 (interfaces; use bootif; set networkdevicename eno1); commit"
 

Update the switch port locations in BrightCM

BrightCM keeps track of the switch port to node NIC mapping. One reason for this is to prevent accidentally imprinting the wrong image on nodes that got swapped (i.e. you remove two nodes for service and insert them back into the rack in the wrong location).
First I had to identify the new port number for a node, I chose the node that would be last in sequence on the switch. This happened to show up in BrightCM as port 171. I found this by PXE booting the compute node, once it comes up BrightCM notices a discrepancy and displays an interface that allows you to manually address the issue, something akin to "node0039 should be on switch XXX port ## but it's showing up on switch z9500-r05-03-38 port 171" blah blah blah.
Instead of manually addressing the issue, it can be done via the CLI in bulk (assuming there's a sequence). Each of our nodes have two NICs wired to the Z9500 (node0039 would be on ports 171 and 172), thus in the code below I decrement by 2 ports for each node's boot device.
port=171
for n in {39..26} ; do
  let port=${port}-2
  cmsh -c "device; set node00${n} ethernetswitch z9500-r05-03-38:${port}; commit ;"
done
unset port
 

Friday, August 7, 2015

How To: Clear Dell iDRAC Job Queue

I'm in the process of deploying 41 new Dell R630 PowerEdge servers in our HPC environment. To help manage the hardware I'm using a new tool (to us anyways), Dell OpenManage Essentials.

OME requires a Microsoft Windows OS, luckily (since we are a Linux shop) it's a snap to install Windows Server 2012 in KVM.

Some of the functionality provided by OME:

  • Reporting and alerting
  • Firmware upgrades
  • Configuration deployment (BIOS settings, iDRAC, RAID, etc...)
  • Bare metal provisioning
While OME is free, some of the features require a license. I've only been using OME for a couple of days so I haven't had a chance to test all of its features, but I have found that configuration requires a license (ex: ability to push a configuration template out to a node(s)). Firmware upgrades and reporting do not require a license.

The first task to be handled by OME, firmware upgrades on all 41 nodes. My initial attempts failed. Reading through the logs revealed that the remote clients couldn't reach TCP port 1278 on the OME server. Firmware upgrades started deploying after opening that TCP 1278 in the Windows firewall.

Each server had a long list of upgrades including BIOS, iDRAC, and the 6 network cards (mix of 10Gbit and 1Gbit). All of the firmware deployed successfully, with the exception of the Ethernet cards. Grrr, back to the scanning the logs.


Results:  
 Downloading Packages.
 Calling InstallFromUri method to Download packages to the iDRAC 
 There are some pending reboot jobs on the iDRAC that maybe block updating the system. It is recommended that you clear all the jobs before updating
 Downloading Package: Network_Firmware_6FD9P_WN64_16.5.20_A00.EXE onto the iDRAC 
 Package download has successfully started and the Job ID is JID_388846411941
 The URI given to the iDRAC to download from: http://192.168.2.69:1278/install_packages/Packages/Network_Firmware_6FD9P_WN64_16.5.20_A00.EXE

Ok, but how do you do this? I didn't see any native way to do this from within OME, so on to Google.

Thanks to this post on Jon Munday's blog, I was able to clear the pending jobs with a little PowerShell for loop action to hit all nodes.

The following command displays the job queue for the range of compute nodes (192.16.2.10 thru 50)

For ($i=10; $i -lt 51; $i++) { winrm e cimv2/root/dcim/DCIM_LifecycleJob -u:$USER -p:$PASSWORD -SkipCNcheck -SkipCAcheck -r:https://192.168.2.$i/wsman -auth:basic -encoding:utf-8 }
 
The next command clears the queue. Sorry for the long single line, I don't know if PowerShell supports spanning a command across multiple lines like I can do in Bash:

For ($i=10; $i -lt 51; $i++) { winrm invoke DeleteJobQueue "cimv2/root/dcim/DCIM_JobService?CreationClassName=DCIM_JobService+Name=JobService+SystemName=Idrac+SystemCreationClassName=DCIM_ComputerSystem" '@{JobID="JID_CLEARALL"}' -r:https://192.168.2.$i/wsman -u:$USER -p:$PASSWORD -SkipCNCheck -SkipCACheck -auth:basic -encoding:utf-8 -format:pretty }
 


Friday, May 29, 2015

A Linux User's First Time on a Mac

I'm in the process of switching from a Dell XPS 13 (Project Sputnik) ultrabook to a Macbook Pro 13 (Broadwell i7). The battery life on my current laptop was horrible, an hour of real world use with the screen dimmed and killing running applications like Scrooge. Unfortunately, the XPS solders the battery onboard, so it's not easily replaced (if at all?). It still functions great when near power.

You may be thinking, why would you do such a thing if you are a Linux user?

For starters, the specs for the Macbook Pro 13 (Spring 2015) are better than the XPS for a similar cost (how much is of course open to interpretation):
  • Intel 3.1 GHz Core i7-5557U vs  3.0 GHz Core i7-5500U Processor
  • 16 GB (optional) vs 8 GB RAM
  • Intel Iris 6100 vs Intel HD Graphics 5500
  • PCIe based SSD vs SATA
  • Non-touch Retina Display vs touch UltraSharp QHD+ (personally I don't see the value yet in having a touch screen on a Linux laptop, especially considering the screen doesn't completely fold over)
  • Magsafe Power Adapter
  • Proven battery life
I do most of my work via SSH to the Linux systems, so the workstation doesn't have to be Linux, although it makes life much less painful. I figured, what the heck, let's try a BSD like system that has a history of awesome battery life.

Without further ado, here are some of my experiences using a Mac and OSX for the first time as a Linux user.

Useful Applications

  • Oh My Zsh - Trying out Zsh shell as an alternative for Bash for the first time, pretty darn cool.
  • Caffeine - Useful for temporarily preventing the laptop from sleeping (don't kill my SSH or VPN connections, dangit!
  • iTerm2 - Really nice terminal replacement for the builtin OSX terminal. Ton's of features like built in Tmux, search, transparent background, it's own built in auto completion (Cmd ;), etc... This part was what got me searching for a new terminal in the first place "Coming from a Unix world? You'll feel at home with focus follows mouse, copy on select, middle button paste, and keyboard shortcuts to avoid mousing."
  • MagicPrefs - This app lets you configure middle mouse paste functionality on the trackpad (set mine to three finger press)
  • XChat Azure - Excellent IRC client
  • TextWrangler - Extremely nice graphical script editor
  • Microsoft Office 2016 Preview - Because I need it Office work

Graphical Text Editor

I use vi/vim extensively on Linux and now on my Macbook Pro. That said, I do like to edit in a GUI text editor as well. After a good bit of searching around, TextWrangler is the (free) one I've been most happy using.

The way I understand it, TextWrangler is sort of the little brother to the professional product BBEdit, which adds "its extensive professional feature set including Web authoring capabilities and software development tools". I primarily work with Ruby (shell scripts, not Rails), Perl, Bash, Puppet and other system management type scripting, TextWrangler works very well for these.

One thing I found missing that I use regularly in other editors is the ability to duplicate a line without the cumbersome highlight, copy, paste. Many GUI editors provide this ability using a shortcut like Ctrl + d, or in vi, yy p (yank yank paste).

After searching around in the keyboard shortcuts a Google search led me to this post which mentioned creating an "AppleScript" to accomplish the task. What tha?

While the code did work, it left both the original and new lines highlighted, which was a bit annoying. I decided I wanted the cursor to remain where it originally was located. By the way, in TextWrangler, the "cursor" is called the "insertion point" both in the documentation and in AppleScript.

So, here's my updated script (my changes are the single lines following each comment):

tell application "TextWrangler"
  tell window 1
    # Get the current position for the cursor so we can pace it back
    set cursorLoc to characterOffset of selection
    select line the (startLine of the selection)
    copy (contents of the selection) as text to myText
    set the contents of the selection to myText & myText
    # Move the cursor back to the first column with a character
    select insertion point before character cursorLoc
  end tell
end tell  

I searched all over the place to get a hint how to place the cursor back in it's original location. As you can see, AppleScript isn't syntactically like Ruby, Perl, Java, etc... It's pretty funky and totally dynamic based on the application being scripted. TextWrangler provides a dictionary, but I didn't find they contents very helpful.

I finally stumbled on this post that mentions using "characterOffset of selection". Voila! All in all, it's pretty darn cool that the OS provides a simple way to extend the functionality of a GUI app.

Create the script in the AppleScript Editor (either launch it from Spotlight Search (Command Space) or click the script menu next to Help in TextWrangler and "Open Script Editor".

Copy and paste the code above, then save it in the directory
"~/Library/Application\ Support/TextWrangler/Scripts"
as something like DuplicateLine.scpt (the extension will get added automatically).

Next, restart TextWrangler and go to Window -> Palettes -> Scripts, click on DuplicateLine in the list and click Set Shortcut. Set it to whatever, I set mine to Ctrl D. This shortcut is already set to delete line, so I altered that shortcut in preferences to Shift Ctrl D.


Wednesday, July 9, 2014

Using Check_Openmanage with Check_MK via WATO

In an older post I described the steps to integrate check_openmanage Nagios plugin with check_mk. This approach required manually editing the etc/check_mk/main.mk file to configure the extra_nagios_conf and legacy_checks.

This updated guide uses the check_mk WATO (Web Administration Tool) to integrate the check_openmanage check using a feature called "Active Checks".

Here's the guide, hope it helps:

Environment:

Install Check_openmanage

Unless otherwise specified all paths are relative to the site owners home (ex: /opt/omd/sites/mysite)
  1. Make sure your dell servers had the following SNMP packages installed prior to installing OMSA (if not, it's easy to 'yum remove srvadmin-\*' 'yum install srvadmin-all': net-snmp, net-snmp-libs, net-snmp-utils
    • Start the OMSA services 'srvadmin-services.sh start' and then check 'srvadmin-services.sh status' to verify that the snmpd component is running
    • Ensure that snmpd is running and configured
    • Configure the firewall to allow access from your OMD server to udp port 161
  2. change users on your OMD server to the site user: $ su - mysite
  3. Download the latest check_openmanage from http://folk.uio.no/trondham/software/check_openmanage.html to ~/tmp and extract
  4. copy the check_openmanage script to local/lib/nagios/plugins (this defaults to $USER2$ in your commands)
    
    $ cp tmp/check_openmanage-3.7.11/check_openmanage local/lib/nagios/plugins/
    $ chmod +x local/lib/nagios/plugins/check_openmanage
    
  5. copy the PNP4Nagios template
    
    $ cp tmp/check_openmanage-3.7.11/check_openmanage.php etc/pnp4nagios/templates/
    
  6. Test check_openmanage to see that it can successfully query a node
    
    local/lib/nagios/plugins/check_openmanage -H dell-r720xd-01 -p -C MySecretCommunity
    
    OK - System: 'PowerEdge R720xd', SN: 'XXXXXX1', 24 GB ram (6 dimms), 2 logical drives, 14 physical drives|T0_System_Board_Inlet=21C;42;47 T1_System_Board_Exhaust=30C;70;75 T2_CPU1=48C;86;91 T3_CPU2=39C;86;91 W2_System_Board_Pwr_Consumption=126W;0;0 A0_PS1_Current_1=0.6A;0;0 A1_PS2_Current_2=0.2A;0;0 V25_PS1_Voltage_1=240V;0;0 V26_PS2_Voltage_2=240V;0;0 F0_System_Board_Fan1=2280rpm;0;0 F1_System_Board_Fan2=2280rpm;0;0 F2_System_Board_Fan3=2280rpm;0;0 F3_System_Board_Fan4=3000rpm;0;0 F4_System_Board_Fan5=3600rpm;0;0 F5_System_Board_Fan6=3480rpm;0;0
    
    

WATO Configuration

  1. Create a Host Group by clicking Host Groups under WATO - Configuration, click New Group (click save when done):
    • Name: omsa
    • Alias: Dell OpenManage
  2. Create a Host Tag by clicking Host Tags under WATO - Configuration, click New Tag Group (click save when done):
    • Internal ID: dellomsa
    • Topic: (leave empty)
    • Choices:
      • Tag ID: omsa
      • Description: Dell OpenManage
  3. Create a Active Check by clicking Host & Service Parameter under WATO - Configuration, click Active Checks, click Classical active and passive Nagios checks (create a new one, click save when done):
    • Folder: Main Directory
    • Host Tags: Select Dell OpenManage is set
    • Service Description: check_openmanage
    • Commmand Line: $USER2$/check_openmanage -H $HOSTADDRESS$ -p -C MySecretCommunity
    • Service Description: check_openmanage
    • Check Perfomance Data
  4. Add the omsa Host Tag to a host running OpenManage with SNMP configured by clicking Hosts under WATO - Configuration, and click the properties editor (pencil icon) for the host (click Save & go to Services when done):
    • Host tags: Dell OpenManage: check Dell OpenManage twice

On the Host services page you should see the new service at the bottom, example:
Custom checks (defined via rule)
Status  Checkplugin   Item              Service Description  Plugin output    
OK      custom        check_openmanage  check_openmanage     OK - System: 'PowerEdge R710', SN: 'XXXXXX1', 24 GB ram (6 dimms), 2 logical drives, 14 physical drives
Click Activate Missing services or Save manual check configuration. Activate the changes and you should start seeing the check within a few minutes and graphs after 10 minutes or so. Hope this helps, and comments are welcome.

Tuesday, June 24, 2014

Replace The Foreman Self Signed Certificate with a Trusted Certificate

I've installed a few Foreman servers to provide provisioning and configuration management (via Puppet). This document will cover the steps to replace the self signed certificate used for the web interface with a trusted certificate.

For those unfamiliar with Foreman and Puppet, here are snippet from both project pages (http://theforeman.org/learn_more.html and http://puppetlabs.com/puppet/what-is-puppet:

Foreman is an open source project that helps system administrators manage servers throughout their lifecycle, from provisioning and configuration to orchestration and monitoring. Using Puppet or Chef and Foreman's smart proxy architecture, you can easily automate repetitive tasks, quickly deploy applications, and proactively manage change, both on-premise with VMs and bare-metal or in the cloud.

Foreman provides comprehensive, interaction facilities including a web frontend, CLI and RESTful API which enables you to build higher level business logic on top of a solid foundation.
Puppet is IT automation software that defines and enforces the state of your infrastructure throughout your software development cycle. From provisioning and configuration to orchestration and reporting, from initial code development through production release and updates, Puppet frees sysadmins from writing one-off, fragile scripts and other manual tasks. At the same time, Puppet ensures consistency and dependability across your infrastructure.

With Puppet, repetitive tasks are automated away, so sysadmins can quickly deploy business applications, scaling easily from tens of servers to thousands, both on-premise and in the cloud.

By default, the Puppet / Foreman server install uses Puppet's own internal CA for issuing SSL certificates. The Foreman install defaults to using the Puppet CA self signed cert for the web interface. The following steps will replace The Foreman's SSL certificate for the user web interface, but will leave the Puppet CA and SSL certs in place for Puppet related work.

I spent a bit of time trying to get this working, but each attempt resulted in a working web interface and broken Puppet master to client communications and Foreman proxy. Essentially, I was changing the SSL certificate entries in too many locations. Dominic on the #theforeman channel on FreeNode IRC directed me to this Google Groups thread that listed the short list of places to make the change. The following steps are based on that post.

  1. Create the SSL key and csr
    sudo su - 
    mkdir /root/Incommon-cert
    cd /root/Incommon-cert
    
    openssl req -out $(hostname)-2048.csr -new -newkey rsa:2048 -nodes -keyout $(hostname -f)-2048.key
  2. Copy the contents of the csr to the clipboard and use it to request an InCommon SSL certificate
  3. Once the cert is approved, download the following files to /root/Incommon-cert on the Puppet / Foreman server:
    • as X509 Certificate only, Base64 encoded
    • as X509 Intermediates/root only, Base64 encoded
  4. Rename the files so that we know these are InCommon files
    mv puppet.tld.blah.crt puppet.tld.blah-2048-incommon-cert.crt
    mv puppet.tld.blah_interm.crt puppet.tld.blah-2048-incommon-interm.crt
    chown root:root *.crt
  5. Copy the files to the appropriate directories
    cp puppet.tld.blah-2048.key /var/lib/puppet/ssl/private_keys/
    cp puppet.tld.blah-2048-incommon-cert.crt /var/lib/puppet/ssl/certs/
    cp puppet.tld.blah-2048-incommon-interm.crt /var/lib/puppet/ssl/certs/
    wget https://www.incommon.org/certificates/repository/incommon-ssl.ca-bundle -O /var/lib/puppet/ssl/certs/incommon-ssl.ca-bundle
  6. Set the appropriate permissions and SELinux configs for the key
    cd /var/lib/puppet/ssl/private_keys/
    chown puppet:puppet *.key
    chmod 640 *.key
    chcon -u system_u -r object_r -t puppet_var_lib_t *.key
    
    ls -lZ
    -rw-r-----. puppet puppet system_u:object_r:puppet_var_lib_t:s0 puppet.tld.blah-2048.key
    -rw-r-----. puppet puppet system_u:object_r:puppet_var_lib_t:s0 puppet.tld.blah.pem
  7. Set perms and SELinux for the certs
    cd /var/lib/puppet/ssl/certs/
    chown puppet:puppet *
    chcon -u system_u -r object_r -t puppet_var_lib_t *.crt
    
    ls -lZ
    -rw-r--r--. puppet puppet system_u:object_r:puppet_var_lib_t:s0 ca.pem
    -rw-r--r--. puppet puppet system_u:object_r:puppet_var_lib_t:s0 incommon-ssl.ca-bundle
    -rw-r--r--. puppet puppet system_u:object_r:puppet_var_lib_t:s0 puppet.tld.blah-2048-incommon-cert.crt
    -rw-r--r--. puppet puppet system_u:object_r:puppet_var_lib_t:s0 puppet.tld.blah-2048-incommon-interm.crt
    -rw-r--r--. puppet puppet system_u:object_r:puppet_var_lib_t:s0 puppet.tld.blah.pem
  8. Next edit the various config files
    • /etc/puppet/node.rb: Change the line :ssl_ca to use the new interm cert
      --- /etc/puppet/node.rb.orig 2014-03-24 17:48:09.215000045 -0500
      +++ /etc/puppet/node.rb 2014-06-24 10:24:51.049282905 -0500
      @@ -8,7 +8,8 @@
         :facts        => true,          # true/false to upload facts
         :timeout      => 10,
         # if CA is specified, remote Foreman host will be verified
      -  :ssl_ca       => "/var/lib/puppet/ssl/certs/ca.pem",      # e.g. /var/lib/puppet/ssl/certs/ca.pem
      +  #:ssl_ca       => "/var/lib/puppet/ssl/certs/ca.pem",      # e.g. /var/lib/puppet/ssl/certs/ca.pem
      +  :ssl_ca       => "/var/lib/puppet/ssl/certs/puppet.tld.blah-2048-incommon-interm.crt",      # e.g. /var/lib/puppet/ssl/certs/ca.pem
         # ssl_cert and key are required if require_ssl_puppetmasters is enabled in Foreman
         :ssl_cert     => "/var/lib/puppet/ssl/certs/puppet.tld.blah.pem",    # e.g. /var/lib/puppet/ssl/certs/FQDN.pem
         :ssl_key      => "/var/lib/puppet/ssl/private_keys/puppet.tld.blah.pem"      # e.g. /var/lib/puppet/ssl/private_keys/FQDN.pem
    • /usr/lib/ruby/site_ruby/1.8/puppet/reports/foreman.rb: Change the foreman_ssl_ca = line to use the interm cert
      --- /usr/lib/ruby/site_ruby/1.8/puppet/reports/foreman.rb.orig 2014-03-24 17:44:37.494000046 -0500
      +++ /usr/lib/ruby/site_ruby/1.8/puppet/reports/foreman.rb 2014-06-24 10:28:54.497406986 -0500
      @@ -5,7 +5,8 @@
       # URL of your Foreman installation
       $foreman_url='https://puppet.tld.blah'
       # if CA is specified, remote Foreman host will be verified
      -$foreman_ssl_ca = "/var/lib/puppet/ssl/certs/ca.pem"
      +#$foreman_ssl_ca = "/var/lib/puppet/ssl/certs/ca.pem"
      +$foreman_ssl_ca = "/var/lib/puppet/ssl/certs/puppet.tld.blah-2048-incommon-interm.crt"
       # ssl_cert and key are required if require_ssl_puppetmasters is enabled in Foreman
       $foreman_ssl_cert = "/var/lib/puppet/ssl/certs/puppet.tld.blah.pem"
       $foreman_ssl_key = "/var/lib/puppet/ssl/private_keys/puppet.tld.blah.pem"
    • /etc/httpd/conf.d/05-foreman-ssl.conf: Change three lines SSLCertificateFile, SSLCertificateKeyFile and SSLCertificateChainFile to use the new cert, key and CA bundle respectively
      --- /etc/httpd/conf.d/05-foreman-ssl.conf.orig 2014-06-24 10:30:59.917531640 -0500
      +++ /etc/httpd/conf.d/05-foreman-ssl.conf 2014-06-24 10:32:36.318164714 -0500
      @@ -35,11 +35,18 @@
       
         ## SSL directives
         SSLEngine on
      -  SSLCertificateFile      /var/lib/puppet/ssl/certs/puppet.tld.blah.pem
      -  SSLCertificateKeyFile   /var/lib/puppet/ssl/private_keys/puppet.tld.blah.pem
      -  SSLCertificateChainFile /var/lib/puppet/ssl/certs/ca.pem
      +  SSLCertificateFile      "/var/lib/puppet/ssl/certs/puppet.tld.blah-2048-incommon-cert.crt"
      +  SSLCertificateKeyFile   "/var/lib/puppet/ssl/private_keys/puppet.tld.blah-2048.key"
      +  SSLCertificateChainFile "/var/lib/puppet/ssl/certs/incommon-ssl.ca-bundle"
         SSLCACertificatePath    /etc/pki/tls/certs
         SSLCACertificateFile    /var/lib/puppet/ssl/certs/ca.pem
      +
      +#  SSLCertificateFile      /var/lib/puppet/ssl/certs/puppet.tld.blah.pem
      +#  SSLCertificateKeyFile   /var/lib/puppet/ssl/private_keys/puppet.tld.blah.pem
      +#  SSLCertificateChainFile /var/lib/puppet/ssl/certs/ca.pem
      +#  SSLCACertificatePath    /etc/pki/tls/certs
      +#  SSLCACertificateFile    /var/lib/puppet/ssl/certs/ca.pem
      +
         SSLVerifyClient         optional
         SSLVerifyDepth          3
         SSLOptions +StdEnvVars
  9. Restart the services (foreman-proxy restart probably isn't necessary but may as well)
    service httpd restart
    service foreman-proxy restart