Friday, December 16, 2011

VM Guests Take Forever to Shutdown in VMware Workstation 8

I recently had a situation where a user purchased a new i7 + SSD based laptop. Needless to say, the laptop is fast.

As part of the purchase, the user also upgraded to the latest VMware Workstation, version 8.x. The users RHEL4 guest was transferred over from the old laptop (running an older version of Workstation). It powered up and ran just fine on the new hardware / version of VMware, with the exception of the guest shutdown.

It literally took an hour from the point where the RHEL4 issued the halt to the underlying hardware (i.e. the OS portion of the shutdown is done) to a fully stopped virtual guest. From an appearance stand point, the VM screen turns black and remains that way until it fully stops.

The laptop hard drive indicator was also flashing madly during the entire process.

I found suggestions to prevent virus scanners from scanning .vmdk files, that didn't change the behaviour.

The laptop also uses PGP whole disk encryption. Perhaps Workstation 8 and PGP don't play nicely together? I couldn't find any references. The old laptop (much slower than this) also had PGP with the older VMware Workstation, so PGP didn't seem to be a prime candidate.

I eventually discovered this post on the VMware message boards that provided the solution.

The thread discusses a similar issue happening in VMware Workstation 6.0.4 on Windows XP.

The solution was to add the following lines to either the global VMware config.ini file, or each individual guests .vmx file. Exit out of VMware Workstation before modifying config.ini or the .vmx files


prefvmx.minVmMemPct = "100"
mainMem.useNamedFile = "FALSE"
mainMem.partialLazySave = "FALSE"
mainMem.partialLazyRestore = "FALSE"


With that code added to the configuration file, the virtual machine shuts down immediately after the operating system issues the halt.

Placing the code in the config.ini file affects all virtual machines, new and existing, unless the settings are overridden in the individual .vmx files.

For Windows 7, the config.ini file can be found here
C:\ProgramData\VMware\VMware Workstation\config.ini

For Windows XP it can be found here:
C:\Documents and Settings\All Users\Application Data\VMware\VMware Workstation\config.ini

Friday, November 18, 2011

Dell Optiplex 790 Workstations hang while rebooting with CentOS 6

I'm working on deploying a large number of Dell Optiplex 790 workstations using kickstart and CentOS 6.

During the initial testing I found that the 790's wouldn't completely reboot with CentOS 6 installed or booted into the install media. They'd get as far as "Restarting".

The solution is to pass an option to the kernel:

reboot=pci


This can be added manually to the grub configuration file for systems already installed. For kickstarting:

1. Add the option in your kickstart file

bootloader --location=mbr --driveorder=sda --append="crashkernel=auto rhgb quiet reboot=pci" --md5pass=$1$.xxxxx


2. During the initial boot off of the CD/DVD, press TAB to alter to boot options (this is all one continuous line broken into multiple for readability)

> vmlinuz initrd=initrd.img ks=http://192.168.1.5/ks/el6/wks1.cfg
    ip=192.168.1.100 netmask=255.255.255.0 gateway=192.168.1.1 nameserver=192.168.1.1
    ksdevice=eth0 reboot=pci

Thursday, November 10, 2011

Fedora 16 does not Boot if /boot is on Software RAID

In previous versions of Fedora, you could configure /boot to exist on a software RAID device (say a software mirror), however in Fedora 16 this will result in failure to boot. This wasn't a supported configuration, but it used to work.

This is a known "issue" and is explained as follows:

Cannot boot with /boot partition on a software RAID array
link to this item - Bugzilla: #750794

Attempting to boot after installing Fedora 16 with the /boot partition on a software RAID array will fail, as the software RAID modules for the grub2 bootloader are not installed. Having the /boot partition on a RAID array has never been a recommended configuration for Fedora, but up until Fedora 16 it has usually worked.

To work around this issue, do not put the /boot partition on the RAID array. Create a simple BIOS boot partition and a /boot partition on one of the disks, and place the other system partitions on the RAID array. Alternatively, you can install the appropriate grub2 modules manually from anaconda's console before rebooting from the installer, or from rescue mode. Edit the file /mnt/sysimage/boot/grub2/grub.cfg and add the lines:

insmod raid
insmod mdraid09
insmod mdraid1x
Now run these commands:

chroot /mnt/sysimage
grub2-install /dev/sda
grub2-install /dev/sdb
Adjust the device names as appropriate to the disks used in your system.

I had a system where I'd created a mirror for /boot that had been reinstalled from Fedora 13, 14, 15 and now 16. As reported, it failed to boot following the F16 install.

Destroying the mirror and creating a simple /dev/sda2 partition for /boot got it booting.

Friday, November 4, 2011

Using check_dell_bladechassis with check_mk

This post builds off of a previous post that documented getting check_openmanage working with check_mk.

In this post we'll add check_dell_bladechassis to the mix to allow for monitoring of Dell M1000e blade chassis' (via the CMC management card).

This was done on the following system:
Unless otherwise specified all paths are relative to the site owners home (ex: /opt/omd/sites/mysite) The check_openmanage code in this blog post is not necessary to get check_dell_bladechassis, I'm just including it to help tie this entry to the previous post.
  1. Change users on your OMD server to the site user: $ su - mysite
  2. Download the latest check_dell_bladechassis from http://folk.uio.no/trondham/software/check_dell_bladechassis.html to ~/tmp and extract
  3. copy the check_dell_bladechassis script to local/lib/nagios/plugins (this defaults to $USER2$ in your commands)
    
    $ cp tmp/check_dell_bladechassis-1.0.0/check_dell_bladechassis local/lib/nagios/plugins/
    $ chmod +x local/lib/nagios/plugins/check_dell_bladechassis
    
  4. copy the PNP4Nagios template
    
    $ cp tmp/check_dell_bladechassis-1.0.0/check_dell_bladechassis.php etc/pnp4nagios/templates/
    
  5. Test check_dell_bladechassis to see that it can successfully query an M1000e CMC (I've inserted carriage returns in the output to make it more readable)
    
    local/lib/nagios/plugins/check_dell_bladechassis -H dell-m1000e-01 -p -C MySecretCommunity
    
    OK - System: 'PowerEdge M1000e', SN: 'XXXXXX', Firmware: '3.03', hardware working fine|
    'total_watt'=1500.000W;0;7928.000 'total_amp'=6.750A;0;0 'volt_ps1'=239.500V;0;0 
    'volt_ps2'=242.750V;0;0 'volt_ps3'=242.750V;0;0 'volt_ps4'=241.750V;0;0 'volt_ps5'=241.750V;0;0 
    'volt_ps6'=242.750V;0;0 'amp_ps1'=1.688A;0;0 'amp_ps2'=1.641A;0;0 'amp_ps3'=0.188A;0;0 
    'amp_ps4'=1.516A;0;0 'amp_ps5'=1.500A;0;0 'amp_ps6'=0.219A;0;0
    
    
  6. Edit the main.mk file to define the command, etc... (the perfdata_format and monitoring_host I got from a previous emailer to the list, not sure if they are needed)
    
    all_hosts = [
     'dell-m1000e-01|snmp|m1000e|nonpub',
     'dell-r710-01|linsrv|kvm|omsa|nonpub',
     'dell-2950-01|linsrv|omsa|nonpub',
     'hp-srv-01|winsrv|smb', ]
    
    # Are you using PNP4Nagios and MRPE checks? This will make PNP
    # choose the correct template for standard Nagios checks:
    perfdata_format = "pnp"
    #set the monitoring host
    monitoring_host = "nagios"
    
    # SNMP Community
    snmp_default_community = "someCommunityRO"
    
    snmp_communities = [
      ( "MySecretCommunity", ["nonpub"], ALL_HOSTS ),
    ]
    
    extra_nagios_conf += r"""
    
    # ARG1: community string
    define command {
        command_name    check_openmanage
        command_line    $USER2$/check_openmanage -H $HOSTADDRESS$ -p -C $ARG1$
    }
    
    define command {
        command_name    check_dell_bladechassis
        command_line    $USER2$/check_dell_bladechassis -H $HOSTADDRESS$ -p -C $ARG1$
    }
    
    """
    
    legacy_checks = [
      # On all hosts with the tag 'omsa' check Dell OpenManage for status 
      # service description "Dell OMSA", process performance data
      ( ( "check_openmanage!MySecretCommunity", "Dell OMSA", True), [ "omsa" ], ALL_HOSTS ),
      # similar for m1000e
      ( ( "check_dell_bladechassis!MySecretCommunity", "Dell Blade Chassis", True), [ "m1000e" ], ALL_HOSTS ),
    
    ]
    
    
  7. That should be it, reinventory your M1000e and reload
    
    $ check_mk -II dell-m1000e-01
    $ check_mk -O
    
    
  8. The php code has a bug that can be fixed using the below patch (see the first comment for details)
    
    --- a/check_dell_bladechassis.php 2009-08-04 07:00:15.000000000 -0500
    +++ b/check_dell_bladechassis.php 2011-12-21 14:44:25.488132187 -0600
    @@ -41,7 +41,7 @@
      
      $opt[$count] = "--slope-mode --vertical-label \"$vlabel\" --title \"$def_title: $title\" ";
      
    -        $def[$count] .= "DEF:var$i=$rrdfile:$DS[$i]:AVERAGE " ;
    +        $def[$count] = "DEF:var$i=$rrdfile:$DS[$i]:AVERAGE " ;
             $def[$count] .= "AREA:var$i#$PWRcolor:\"$NAME[$i]\" " ;
             $def[$count] .= "LINE:var$i#000000: " ;
     
    @@ -62,7 +62,7 @@
      
      $opt[$count] = "-X0 --lower-limit 0 --slope-mode --vertical-label \"$vlabel\" --title \"$def_title: $title\" ";
      
    -        $def[$count] .= "DEF:var$i=$rrdfile:$DS[$i]:AVERAGE " ;
    +        $def[$count] = "DEF:var$i=$rrdfile:$DS[$i]:AVERAGE " ;
             $def[$count] .= "AREA:var$i#$AMPcolor:\"$NAME[$i]\" " ;
             $def[$count] .= "LINE:var$i#000000: " ;
     
    @@ -75,6 +75,7 @@
         if(preg_match('/^volt_/',$NAME[$i])){
      if ($visited_volt == 0) {
          ++$count;
    +     $def[$count] = '';
          $visited_volt = 1;
      }
      
    @@ -87,6 +88,7 @@
      
      $def[$count] .= "DEF:var$i=$rrdfile:$DS[$i]:AVERAGE " ;
      $def[$count] .= "LINE:var$i#".$colors[$v++].":\"$NAME[$i]\" " ;
    +
      $def[$count] .= "GPRINT:var$i:LAST:\"%3.2lf $UNIT[$i] last \" ";
      $def[$count] .= "GPRINT:var$i:MAX:\"%3.2lf $UNIT[$i] max \" ";
      $def[$count] .= "GPRINT:var$i:AVERAGE:\"%3.2lf $UNIT[$i] avg \\n\" ";
    @@ -96,6 +98,7 @@
         if(preg_match('/^amp_/',$NAME[$i])){
      if ($visited_amp == 0) {
          ++$count;
    +     $def[$count] = '';
          $visited_amp = 1;
      }
     
    

Hope this helps, and comments are welcome.

Thursday, June 23, 2011

HowTo - Selectively enable service notifications in Check_mk / OMD

Check_mk installed via Open Monitoring Distribution are an extremely powerful combination for monitoring devices on a network.

Notifications from Nagios for the services discovered by Check_mk can overwhelm your inbox / mobile phone if notification for all services is enabled (default).

The following configures Nagios via Check_mk in an 'opt-in' method for service notifications using extra_service_conf in main.mk

The following section in main.mk will:
* Enable notifications for "IPMI Sensor Summary" to get temperature alerts
* Enable notifications for "fs_*" to get alerts for all file system disk usage
* Enable notifications for "Memory Used" for a specific server, server1
* Disabling all other service notifications


extra_service_conf["notifications_enabled"] = [
  ( "1", ALL_HOSTS, ["IPMI Sensor Summary","ambient_temp"]),
  ( "1", ALL_HOSTS, ["IPMI Sensor Summary","fs_*"]),
  ( "1", ["server1"], ["Memory Used"]),
  ( "0", ALL_HOSTS, ALL_SERVICES),
]

Wednesday, June 1, 2011

Upgrading VMware ESXi 4.0 to ESXi 4.1 Update 1

The following are notes I took while upgrading several ESXi 4.0 servers (Dell PowerEdge M600) to Update 4.1.0 Update 1.

I decided to post these notes since I encountered an error that wasn't specifically identified in a VMware KB, although there is one for the issue as you'll see, the error is slightly different.

Here's the VMware KB article that discusses the upgrade options to 4.1:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1022140

The following is the specific video that I used as reference (uses esxupdate from the SSH consol):
http://www.youtube.com/watch?v=F0wSHPSvmpk&feature=player_embedded#at=57

  1. I updated the ESXi 4.0 servers to the latest patches prior to performing the upgrade to 4.1u1. I'm not sure if this is required or not.
  2. Copy the upgrade-from-esxi4.0-to-4.1-update01-348481.zip to the local datastore (I put it in a new directory /vmfs/volumes/datastore1/esxi-upgrade). The upgrade package may be different if your version isn't currently ESXi version 4.0. Also, if you are upgrading from ESX 4.0 make sure to watch the video and follow the initial preupgrade step!
  3. SSH into the ESXi server
  4. Put the ESXi server into maintenance mode via either the vSphere Client GUI or from the command line
    
    # vim-cmd /hostsvc/maintenance_mode_enter
    # vim-cmd /hostsvc/runtimeinfo | grep inMaintenanceMode
       inMaintenanceMode = true, 
    
    
  5. Try to run the update, mine ran for several minutes and eventually errored:
    
    # esxupdate --bundle=/vmfs/volumes/datastore1/esxi-upgrade/upgrade-from-esxi4.0-to-4.1-update01-348481.zip update
    
    The following problems were encountered trying to resolve dependencies:
       Requested VIB deb_vmware-esx-firmware_4.1.0-1.4.348481 conflicts with the
       host
    
    I found a KB article that covered a similar error and followed it's advice to remove an unneeded Cisco Nexus package:
    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1026752
    • Check if the cisco package is installed
      
      # esxupdate query --vib-view | grep cross_cisco | grep installed
      
      cross_cisco-vem-v100-esx_4.0.4.1.1.28-0.5.2                        installed     2009-07-17T17:53:15.448003+00:00 
      
      
    • Remove the package
      
      # esxupdate -b cross_cisco-vem-v100-esx_4.0.4.1.1.28-0.5.2 remove
      
  6. Run the update again, this time it succeeds (ran for 10 minutes or so)
    
    # esxupdate --bundle=/vmfs/volumes/datastore1/esxi-upgrade/upgrade-from-esxi4.0-to-4.1-update01-348481.zip update
    
    Unpacking vmware-esx-tools-light-4.1.0-1.4.348481.i386.vib          ################################################################################################ [100%]
    Unpacking vmware-esx-firmware-4.1.0-1.4.348481.i386.vib             ################################################################################################ [100%]
    Unpacking cross_oem-vmware-esx-drivers-net-vxge_400.2.0.28.21239-.. ################################################################################################ [100%]
    Unpacking vmware-esx-esxupdate-esxi-4.1.0-0.0.260247.i386.vib       ################################################################################################ [100%]
    Removing packages :vmware-esx-tools-light vmware-esx-viclient       ################################################################################################ [100%]
    Installing packages :deb_vmware-esx-esxupdate-esxi_4.1.0-0.0.260247 ################################################################################################ [100%]
    Installing packages :deb_vmware-esx-firmware_4.1.0-1.4.348481       ################################################################################################ [100%]
    Installing packages :cross_oem-vmware-esx-drivers-net-vxge_400.2... ################################################################################################ [100%]
    Installing packages :deb_vmware-esx-tools-light_4.1.0-1.4.348481    ################################################################################################ [100%]
    
    Running [/usr/sbin/vmkmod-install.sh]...
    ok.
    The update completed successfully, but the system needs to be rebooted for the
    changes to be effective.
    
  7. Exit maintenance mode and reboot
    
    # vim-cmd /hostsvc/maintenance_mode_exit
    # vim-cmd /hostsvc/runtimeinfo | grep inMaintenanceMode
       inMaintenanceMode = false, 
    
    # reboot
    

Thursday, May 26, 2011

Fedora 15 - How To Modify the Clock to Show Date

Fedora 15 released on 05/24/2011 introducing the world to Gnome 3 (and Gnome Shell).

For those of us used to the Windows 95 look and feel of Gnome 2, the new Gnome is going to take some getting use to.

I'm going to borrow a line from Dos Equis, "I don't always use the graphical desktop in Linux, but when I do, I prefer Gnome"

Here's a quick configuration tip for Gnome 3. One of the first things I've noticed is that the clock applet doesn't show the date by default, only showing Day and Time.

I always forget the date and like for the clock to show Day Date, Time. The applet doesn't provide a preferences option when clicked, so there isn't a way to make this change within the applet (correct me if I'm wrong, please).

There is a way, however. The dconf-editor tool! On my Fedora 15 test workstation, dconf-editor was not installed by default.

$ sudo yum install dconf-editor


Once it's installed, run the application as your user (not root)

$ dconf-editor


The "Configuration Editor" window will open. Click the following in the tree on the left:
  • org
  • gnome
  • shell
  • clock
Click the check box next to "show-date". Your clock applet should now display similar to: Thu May 26, 09:58 It's also possible to move the position of the clock and other applets via the dconf-editor tool. An Annonymous comment indicated that the F15 release notes also contain a method of configuring the clock (see section 2.1.1.1.7): http://docs.fedoraproject.org/en-US/Fedora/15/html/Release_Notes/sect-Release_Notes-Changes_for_Desktop_Users.html

Wednesday, March 23, 2011

Install Luster Monitoring Tool (LMT) on CentOS 5.5

In this article, I document the steps to build and install LMT and it's dependency Cerebro. The configuration of Cerebro can get pretty complex, in this example we make it simple to focus on LMT.

I don't cover MySQL configuration yet but plan to do so in the near future.

Cerebro and LMT Build Instructions

The build and install OS are CentOS 5.5 x86_64 systems.
  1. Download and build cerebro (http://sourceforge.net/projects/cerebro/files/cerebro/) on your favorite build machine (make sure to set up your ~/rpmbuild directory structure and your ~/.rpmmacros file).
    • Download the latest source code (1.12 at this time) to ~/rpmbuild/SOURCES/
    • Download the src.rpm (I found it under the version 1.10 tree) and extract
      
      $ mkdir ~/sources/cerebro
      $ cd ~/sources/cerebro
      $ rpm2cpio cerebro-1.10-1.src.rpm | cpio -idvm
      $ mv cerebro.spec ~/rpmbuild/SPECS/
      
      
    • Modify the cerebro.spec file as follows for version 1.12 (unified diff format)
      
      --- cerebro.spec 2010-04-07 16:17:35.000000000 -0500
      +++ cerebro.spec.new 2011-03-23 14:25:02.654373643 -0500
      @@ -1,12 +1,12 @@
       Name:    cerebro 
      -Version: 1.10
      +Version: 1.12
       Release: 1
       
       Summary: Cerebro cluster monitoring tools and libraries
       Group: System Environment/Base
       License: GPL
      -Source: cerebro-1.10.tar.gz
      -BuildRoot: %{_tmppath}/cerebro-1.10
      +Source: cerebro-1.12.tar.gz
      +BuildRoot: %{_tmppath}/cerebro-1.12
       
       %description
       Cerebro is a collection of cluster monitoring tools and libraries.
      @@ -90,7 +90,7 @@
       Event module to monitor node up/down.
       
       %prep
      -%setup  -q -n cerebro-1.10
      +%setup  -q -n cerebro-1.12
       
       %build
       %configure --program-prefix=%{?_program_prefix:%{_program_prefix}} \
      @@ -157,6 +157,7 @@
       %defattr(-,root,root)
       %doc README NEWS ChangeLog DISCLAIMER DISCLAIMER.UC COPYING
       %config(noreplace) %{_sysconfdir}/init.d/cerebrod
      +%config(noreplace) %{_sysconfdir}/cerebro.conf
       %{_includedir}/*
       %dir %{_libdir}/cerebro
       %{_libdir}/libcerebro*
      
    • Before building the rpm I had to comment out the %_vendor string in my .rpmmacros file, otherwise the configure kept adding the vendor to the --target switch
    • Build the rpm, this will build several rpms, for Lustre Monitoring Tool all we need is the cerebro package
      
      $ rpmbuild -ba --sign ~/rpmbuild/SPECS/cerebro.spec
      
      
    • Look at the package info
      
      $ rpm -qpi ~/rpmbuild/RPMS/x86_64/cerebro-1.12-1.x86_64.rpm 
      Name        : cerebro                      Relocations: (not relocatable)
      Version     : 1.12                              Vendor: (none)
      Release     : 1                             Build Date: Wed 23 Mar 2011 02:12:09 PM CDT
      Install Date: (not installed)               Build Host: buildhost01
      Group       : System Environment/Base       Source RPM: cerebro-1.12-1.src.rpm
      Size        : 1039859                          License: GPL
      Signature   : DSA/SHA1, Wed 23 Mar 2011 02:12:09 PM CDT, Key ID xxxx
      Summary     : Cerebro cluster monitoring tools and libraries
      Description :
      Cerebro is a collection of cluster monitoring tools and libraries.
      
      
    • Take a look at the contents of the rpm
      
      $ rpm -qpl ~/rpmbuild/RPMS/x86_64/cerebro-1.12-1.x86_64.rpm 
      /etc/cerebro.conf
      /etc/init.d/cerebrod
      /usr/include/cerebro
      /usr/include/cerebro.h
      ...
      
  2. LMT RPM build
    • Temporarily install cerebro to satisfy the build requirement
      
      $ sudo rpm -Uvh ~/rpmbuild/RPMS/x86_64/cerebro-1.12-1.x86_64.rp
      
      
    • Install lua-devel package from Epel
      
      $ sudo yum install lua-devel
      
      =============================================================================================
       Package                Arch                Version                  Repository         Size
      =============================================================================================
      Installing:
       lua-devel              i386                5.1.4-4.el5              epel               18 k
       lua-devel              x86_64              5.1.4-4.el5              epel               18 k
      Installing for dependencies:
       lua                    i386                5.1.4-4.el5              epel              228 k
       lua                    x86_64              5.1.4-4.el5              epel              229 k
      
      
    • Download the lmt src rpm
      
      $ mkdir ~/sources/lmt
      $ cd ~/sources/lmt
      $ wget http://lmt.googlecode.com/files/lmt-3.1.2-1.src.rpm
      
      $ rpmbuild --rebuild --sign lmt-3.1.2-1.src.rpm
      
      
      $ ls -l ~/rpmbuild/RPMS/x86_64/lmt-*
      lmt-3.1.2-1.el5.myrepo.x86_64.rpm
      lmt-server-3.1.2-1.el5.myrepo.x86_64.rpm
      lmt-server-agent-3.1.2-1.el5.myrepo.x86_64.rpm
      
      
    • LMT-GUI RPM build
    • Install the prerequisite java-devel
      
      $ sudo yum install java-devel
      
      =======================================================================================================
       Package                        Arch         Version                       Repository             Size
      =======================================================================================================
      Installing:
       java-1.6.0-openjdk-devel       x86_64       1:1.6.0.0-1.16.b17.el5        centos5-updates        12 M
      
      Transaction Summary
      =======================================================================================================
      
    • Download the lmt-gui src rpm and build
      
      $ mkdir ~/sources/lmt-gui
      $ cd ~/sources/lmt-gui
      $ wget http://lmt.googlecode.com/files/lmt-gui-3.0.0-1.src.rpm
      
      $ rpmbuild --rebuild --sign lmt-gui-3.0.0-1.src.rpm 
      
      
      
      $ rpm -qpi ~/rpmbuild/RPMS/x86_64/lmt-gui-3.0.0-1.el5.myrepo.x86_64.rpm 
      Name        : lmt-gui                      Relocations: (not relocatable)
      Version     : 3.0.0                             Vendor: (none)
      Release     : 1.el5.myrepo                 Build Date: Wed 23 Mar 2011 02:44:25 PM CDT
      Install Date: (not installed)               Build Host: build01
      Group       : Applications/System           Source RPM: lmt-gui-3.0.0-1.el5.myrepo.src.rpm
      Size        : 2347300                          License: GPL
      Signature   : DSA/SHA1, Wed 23 Mar 2011 02:44:25 PM CDT, Key ID xxxx
      Packager    : Jim Garlick 
      URL         : http://code.google.com/p/lmt
      Summary     : Lustre Montitoring Tools Client
      Description :
      Lustre Monitoring Tools (LMT) GUI Client
      
      
    • Next I copy the RPMs to our local repository
      
      $ cd ~/rpmbuild/RPMS/x86_64/
      $ cp -a lmt-* cerebro-1.12-1.x86_64.rpm /share/repo/mirror/myrepo/el5/x86_64/RPMS/
      
      $ cd ../../SRPMS
      $ cp -a cerebro-* /share/repo/mirror/myrepo/el5/SRPMS/
      $ cd ~/sources
      $ cp -a lmt/lmt-3.1.2-1.src.rpm lmt-gui/lmt-gui-3.0.0-1.src.rpm /share/repo/mirror/myrepo/el5/SRPMS/
      
    • Rebuild the repodata for the repository
      
      $ createrepo /share/repo/mirror/myrepo/el5/x86_64/
      

Cerebro and LMT Install Instructions

  1. Install cerebro and lmt-server-agent on the mds's and oss's
    
    $ for n in mds-{0..1} oss-{0..2}; do ssh root@lustre-$n yum install -y cerebro lmt-server-agent ; done
    
  2. Install cerebro and lmt-server on the management server
    
    $ ssh root@management-server yum -y install cerebro lmt-server
    
  3. Modify the /etc/cerebro.conf file to look like this (by default the entire file is comments, append this to the end)
    • On the Lustre servers
      
      cerebro_metric_server 192.168.0.10
      cerebro_event_server 192.168.0.10
      cerebrod_heartbeat_frequency 10 20
      cerebrod_speak on
      cerebrod_speak_message_config 192.168.0.10
      cerebrod_listen off
      
    • On the management server
      
      cerebrod_heartbeat_frequency 10 20
      cerebrod_speak on
      cerebrod_speak_message_config 192.168.0.10
      cerebrod_listen on
      cerebrod_listen_message_config 192.168.0.10
      cerebrod_metric_controller on
      cerebro_metric_server 192.168.0.10
      cerebrod_event_server on
      cerebro_event_server 192.168.0.10
      
  4. Configure the daemon to start on the servers and management server
    
    $ for n in mds-{0..1} oss-{0..2}; do ssh root@lustre-$n "/sbin/chkconfig cerebrod on && /sbin/service cerebrod start" ; done
    
    $ ssh root@managment-server "/sbin/chkconfig cerebrod on && /sbin/service cerebrod start"
    
    
  5. Login to the management server and verify that the server see's all of the servers (this can be run from any of the servers, not just the management server)
    
    $ /usr/sbin/cerebro-stat -m updown_state
    
    MODULE DIR = /usr/lib64/cerebro
    mgmt-srv: 1
    lustre-mds-0: 1
    lustre-mds-1: 1
    lustre-oss-0: 1
    lustre-oss-1: 1
    lustre-oss-2: 1
    
  6. Now run the -l switch to see the available metrics (lmt_mdt, lmt_ost and lmt_osc are added by the lmt-server package)
    
    $ /usr/sbin/cerebro-stat -l
    
    MODULE DIR = /usr/lib64/cerebro
    metric_names
    cluster_nodes
    lmt_mdt
    updown_state
    lmt_ost
    lmt_osc
    
  7. Run the ltop (will default to the first Lustre file system found unless otherwise specified) command on the management node to view a toplike output for OSTs
    
    $ ltop
    
    Filesystem: lustre
        Inodes:    209.344m total,     77.286m used ( 37%),    132.057m free
         Space:     42.978t total,     15.931t used ( 37%),     27.047t free
       Bytes/s:  0.000g read,       0.000g write,                 1 IOPS
       MDops/s:  4 open,        2 close,     285 getattr, 0 setattr
                     0 link,        0 unlink,      0 mkdir,         0 rmdir
                     1 statfs, 5 rename,      0 getxattr
    >OST S        OSS   Exp   CR rMB/s wMB/s  IOPS   LOCKS  LGR  LCR %cpu %mem %spc
    0000 F stre-oss-0   131    0     0     0     0  515290   87    0    0  100   41
    0001 F stre-oss-0   131    0     0     0     0  528633  106    0    0  100   41
    0002 F stre-oss-1   131    0     0     0     0  509573   16    0    0  100   35
    0003 F stre-oss-1   131    0     0     0     0  518495   21    0    0  100   36
    0004 F stre-oss-2   131    0     0     0     0  533299   49    0    0  100   34
    0005 F stre-oss-2   131    0     0     0     0  527621   61    0    0  100   35
    

Friday, March 18, 2011

Using check_openmanage with check_mk

Here's my guide to installing check_openmanage in an OMD site, in case it helps anyone:

This was done on the following system:
Unless otherwise specified all paths are relative to the site owners home (ex: /opt/omd/sites/mysite)
  1. Make sure your dell servers had the following SNMP packages installed prior to installing OMSA (if not, it's easy to 'yum remove srvadmin-\*' 'yum install srvadmin-all': net-snmp, net-snmp-libs, net-snmp-utils
    • Start the OMSA services 'srvadmin-services.sh start' and then check 'srvadmin-services.sh status' to verify that the snmpd component is running
    • Ensure that snmpd is running and configured
    • Configure the firewall to allow access from your OMD server to udp port 161
  2. change users on your OMD server to the site user: $ su - mysite
  3. Download the latest check_openmanage from http://folk.uio.no/trondham/software/check_openmanage.html to ~/tmp and extract
  4. copy the check_openmanage script to local/lib/nagios/plugins (this defaults to $USER2$ in your commands)
    
    $ cp tmp/check_openmanage-3.6.5/check_openmanage local/lib/nagios/plugins/
    $ chmod +x local/lib/nagios/plugins/check_openmanage
    
  5. copy the PNP4Nagios template
    
    $ cp tmp/check_openmanage-3.6.5/check_openmanage.php etc/pnp4nagios/templates/
    
  6. If you are running CentOS 5.5/RHEL 5.5 or earlier (it's unclear whether or not this will be an issue in EL5.6), and you want performance graphs, you'll need to edit the check_openmanage.php template (see this bug: https://bugs.op5.com/bug_view_advanced_page.php?bug_id=4008), comment out the original condition and replace:
    
    $ vi etc/pnp4nagios/templates/check_openmanage.php
    
    ##    if(preg_match('/^enclosure_(?.+?)_temp_\d+$/', $NAME[$i], $matches)
    ##       || preg_match('/^e(?.+?)t\d+$/', $NAME[$i], $matches)){
    # This is the fixed line for CentOS 5.5 and prior
         if(preg_match('/^enclosure_(.+?)_temp_\d+$/', $NAME[$i], $matches)){
    
  7. Test check_openmanage to see that it can successfully query a node (ack, I need to update my driver)
    
    local/lib/nagios/plugins/check_openmanage -H dell-r710-01 -p -C MySecretCommunity
    
    Controller 1 [SAS 5/E Adapter]: Driver '3.04.13rh' is out of date|fan_0_system_board_fan_1_rpm=3600;0;0 fan_1_system_board_fan_3_rpm=3600;0;0 fan_2_system_board_fan_4_rpm=3600;0;0 fan_3_system_board_fan_5_rpm=3600;0;0 fan_4_system_board_fan_2_rpm=3600;0;0 pwr_mon_0_ps_1_current=0.6;0;0 pwr_mon_1_ps_2_current=0.4;0;0 pwr_mon_2_system_board_system_level=182;0;0 temp_0_system_board_ambient=21;42;47
    
    
  8. Edit the main.mk file to add tags to the OMSA hosts and the check command (the perfdata_format and monitoring_host I got from a previous emailer to the list, not sure if they are needed)
    
    all_hosts = [ 'dell-r710-01|linsrv|kvm|omsa|nonpub', 'dell-2950-01|linsrv|omsa|nonpub', 'hp-srv-01|winsrv|smb', ]
    
    # Are you using PNP4Nagios and MRPE checks? This will make PNP
    # choose the correct template for standard Nagios checks:
    perfdata_format = "pnp"
    #set the monitoring host
    monitoring_host = "nagios"
    
    # SNMP Community
    snmp_default_community = "someCommunityRO"
    
    snmp_communities = [
      ( "MySecretCommunity", ["nonpub"], ALL_HOSTS ),
    ]
    
    # other main.mk stuff
    
    extra_nagios_conf += r"""
    
    # ARG1: community string
    define command {
        command_name    check_openmanage
        command_line    $USER2$/check_openmanage -H $HOSTADDRESS$ -p -C $ARG1$
    }
    
    """
    
    legacy_checks = [
      # On all hosts with the tag 'omsa' check Dell OpenManage for status 
      # service description "Dell OMSA", process performance data
      ( ( "check_openmanage!MySecretCommunity", "Dell OMSA", True), [ "omsa" ], ALL_HOSTS ),
    ]
    
  9. That should be it, simply reload and your new check should start working for all hosts tagged with 'omsa'
    
    $ check_mk -O
    
    
To make it cleaner, the legacy_check should be able to determine the community string based on the settings in snmp_default_community and snmp_communities

I've only been testing check_mk for a few days now and am not sure how to do that. (suggestions?)

Hope this helps, and comments are welcome.

Monday, February 21, 2011

Building Mellanox OFED 1.5.2 for Rocks 5.4

Here are my notes from Rocks 5.4 and Mellanox OFED 1.5.2

Perform the build steps on a compute node. That way if the build process, run as root, has a bug, we don't risk having to rebuild the head node.

The MLNX_OFED-1.5.2 comes with modules for kernel 2.6.18-194.el5, we are using 2.6.18-194.17.1.el5, so we need to build new kernel modules.

1. Download the ISO file MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso from this page

2. Ensure that the build system is running the correct kernel

# uname -r

2.6.18-194.17.1.el5

3. Mount the ISO and copy the contents to a scratch work area

# mount -t iso9660 -o loop /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso /mnt/cdrom 
# mkdir /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5
# cp -r /mnt/cdrom/* /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5/
# umount /mnt/cdrom
# rm /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso

4. Install some dependencies

# yum -y install libtool tcl-devel libstdc++-devel mkisofs gcc-c++ rpm-build

5. Uninstall some RPM files that will fail to uninstall during the ISO build

# yum remove \*openmpi\*

6. Build the new ISO file

# cd /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5

# ./docs/mlnx_add_kernel_support.sh -i /root/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso
Note: This program will create MLNX_OFED_LINUX ISO for rhel5.5 under /tmp directory.
      All Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Do you want to continue?[y/N]:y
Building OFED RPMs...
Removing OFED RPMs...
Running mkisofs...
Created /tmp/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso

# mkdir /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5
# mv /tmp/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5.iso /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5.iso

7. Copy the new files from the iso to the NFS share

# mount -t iso9660 -o loop /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5.iso /mnt/cdrom
# rsync -a /mnt/cdrom/ /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5/

# umount /mnt/cdrom

8. List the new kernel modules

# cd /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5
# find . -name kernel-* | grep 194.17
./x86_64/kernel-ib-1.5.2-2.6.18_194.17.1.el5.x86_64.rpm
./x86_64/kernel-mft-2.6.2-2.6.18_194.17.1.el5.x86_64.rpm
./x86_64/kernel-ib-devel-1.5.2-2.6.18_194.17.1.el5.x86_64.rpm

9. Test the installer on one of the compute nodes

# cd /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5
# ./mlnxofedinstall --force --hpc

This will automatically update the firmware on the HCA.

10. This OFED can be installed on the compute nodes by adding this section to extend-compute.xml (note, I normally put other driver updates into this 'post-98-installdrivers' script). Also notice the yum install, the MLNX OFED install will remove any package containing 'openmpi' in the package name, this line reinstalls said packages


<file name="/etc/rc.d/rocksconfig.d/post-98-installdrivers" perms="0755">
#!/bin/sh

# Install Mellanox
if [ "$(/sbin/lspci | grep -i connectx)" != "" ] ; then
  /usr/bin/yum -y remove openmpi\* rocks-openmpi\*
  /share/apps/mellanox/MLNX_OFED_LINUX-1.5.2-2.0.0-rhel5.5-2.6.18-194.17.1.el5/mlnxofedinstall --hpc --force

  /sbin/chkconfig --add openibd
  /sbin/chkconfig openibd on
  /sbin/service openibd start
fi

/usr/bin/yum -y install my-custom-openmpi my-custom-application-openmpi

/bin/mv /etc/rc.d/rocksconfig.d/post-98-installdrivers /root/post-98-installdrivers

# Reboot one final time
/sbin/shutdown -r now

</file>

Adding Infiniband over IP to Rocks

20120611 - Based on a question to the Rocks mailing list, I'm adding this section to explain how to enable TCP/IP over Inifiniband via Rocks. This process should add the IP addresses to the Rocks managed DNS / hosts. The IP addresses of my compute-0-x nodes start at 254 and work backwards, so that's what I used for the IB ip addresses: First add the new network, calling it 'infiniband', or whatever name you'd like
# rocks add network infiniband subnet=192.168.3.0 netmask=255.255.255.0
# ip=254 && for node in {1..16}; do
   rocks add host interface compute-0-${node} ib0 \
     ip=192.168.3.${ip} subnet=ib-cheaha ;
   let ip=${ip}-1 ;
done
Repeat for the next set of nodes
# ip=238 && for node in {1..16}; do
   rocks add host interface compute-1-${node} ib0 \
     ip=192.168.3.${ip} subnet=ib-cheaha ;
   let ip=${ip}-1 ;
done
And so on... Change the sshd_config on the compute nodes to not use DNS. I have found that ssh to compute nodes take close to a minute when this is set to true
# rocks set attr ssh_use_dns false
Synchronize the configuration

# rocks sync config
Now open the firewall the ib0 for all ports and protocols
# rocks open appliance firewall compute \
   network=infiniband service="all" protocol="all"

# rocks sync host firewall compute

# rocks list host firewall compute-0-1
SERVICE PROTOCOL CHAIN ACTION NETWORK   OUTPUT-NETWORK FLAGS                                COMMENT SOURCE
ssh     tcp      INPUT ACCEPT public     -------------- -m state --state NEW                 ------- G     
all     all      INPUT ACCEPT public     -------------- -m state --state RELATED,ESTABLISHED ------- G     
all     all      INPUT ACCEPT infiniband -------------- ------------------------------------ ------- A     
all     all      INPUT ACCEPT private    -------------- ------------------------------------ ------- G     
Hope this helps

Building Mellanox OFED 1.4 for Rocks 5.3

Here are my notes from building Mellanox OFED 1.4 on a Rocks 5.3 x86_64 cluster utilizing CentOS 5.4 and kernel 2.6.18-128-7.1:

1. Download the ISO file MLNX_OFED_LINUX-1.4-rhel5.3.iso from this page

The MLNX_OFED-1.4 comes with modules for kernel 2.6.18-128, we are using 2.6.18-128-7.1, so we need to build new modules.

2. Mount the ISO and copy the contents to a scratch work area

# mount -t iso9660 -o loop /root/MLNX_OFED_LINUX-1.4-rhel5.3.iso /mnt/cdrom 
# mkdir /root/MLNX_OFED-1.4
# cp -r /mnt/cdrom/* /root/MLNX_OFED_LINUX-1.4/
# umount /mnt/cdrom

3. Edit the script so that it will work with CentOS (our centos-release says 5.4, we are still running a 5.3 kernel), this is the script that will build a new ISO file

# cd /root/MLNX_OFED-1.4

TabularUnifieddocs/mlnx_add_kernel_support.sh
Index: docs/mlnx_add_kernel_support.sh
===================================================================
--- docs/mlnx_add_kernel_support.sh.orig 2009-12-17 15:51:46.000000000 -0600
+++ docs/mlnx_add_kernel_support.sh 2009-12-17 15:52:00.000000000 -0600
@@ -279,7 +279,7 @@
         redhat-release-5Server-5.2.0.4)
         distro="rhel5.2"
         ;;
-        redhat-release-5Server-5.3.0.3)
+        redhat-release-5Server-5.3.0.3 | centos-release-5-4.el5.centos.1 )
         distro="rhel5.3"
         ;;
         sles-release-10-15.2)
4. Install some dependencies

# yum -y install libtool tcl-devel libstdc++-devel mkisofs gcc-c++

5. Uninstall some RPM files that will fail to uninstall during the ISO build

/bin/rpm --nodeps -e --allmatches openmpi-libs-1.3.2-2.el5 \
 openmpi-devel-1.3.2-2.el5 rocks-openmpi-1.3.2-1 openmpi-libs-1.3.2-2.el5 \
 openmpi-devel-1.3.2-2.el5 openmpi-1.3.2-2.el5 openmpi-1.3.2-2.el5 \
 openmpi-gnu-1.3.3-1.el5.uabeng

6. Build the new ISO file

# cd /root/MLNX_OFED_LINUX-1.4
# ./docs/mlnx_add_kernel_support.sh -i /root/MLNX_OFED_LINUX-1.4-rhel5.3.iso

Note: This program will create MLNX_OFED_LINUX ISO for rhel5.3 under /tmp directory.
      All Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Do you want to continue?[y/N]:y
Building OFED RPMs...
Removing OFED RPMs...
Running mkisofs...
Created /tmp/MLNX_OFED_LINUX-1.4-rhel5.3.iso

# mv /tmp/MLNX_OFED_LINUX-1.4-rhel5.3.iso /share/apps/mellanox/MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1.iso

7. Copy the new files from the iso to the NFS share

# mount -t iso9660 -o loop /share/apps/mellanox/MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1.iso /mnt/cdrom
# cp -r /mnt/cdrom /share/apps/mellanox/MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1

# cd /share/apps/mellanox
# find ./MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1 -name kernel-* | grep x86
./MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1/x86_64/kernel-ib-1.4-2.6.18_128.7.1.el5.x86_64.rpm
./MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1/x86_64/kernel-ib-1.4-2.6.18_128.el5.x86_64.rpm
./MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1/x86_64/kernel-ib-devel-1.4-2.6.18_128.el5.x86_64.rpm
./MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1/x86_64/kernel-ib-devel-1.4-2.6.18_128.7.1.el5.x86_64.rpm

8. Test the installer on the compute node

# cd /share/apps/mellanox/MLNX_OFED_LINUX-1.4-rhel5.3-kernel-2.6.18_128.7.1
# ./mlnxofedinstall --hpc

This program will install the MLNX_OFED_LINUX package on your machine.
Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed. 
Do you want to continue?[y/N]:y

Uninstalling the previous version of OFED 

Starting MLNX_OFED_LINUX-1.4 installation ... 

Installing mpi-selector RPM 
Preparing...                ########################################### [100%]
   1:mpi-selector           ########################################### [100%]
Installing kernel-ib RPM 
Preparing...                ########################################### [100%]
   1:kernel-ib              ########################################### [100%]
Installing ib-bonding RPM 
Preparing...                ########################################### [100%]
   1:ib-bonding             ########################################### [100%]
Installing mft RPM 
Preparing...                ########################################### [100%]
   1:mft                    ########################################### [100%]
Install user level RPMs: 
Preparing...                ########################################### [100%]
   1:libibverbs             ########################################### [  2%]
   2:libibcommon            ########################################### [  4%]
   3:libibumad              ########################################### [  6%]
   4:opensm-libs            ########################################### [  8%]
   5:librdmacm              ########################################### [ 10%]
   6:openmpi_intel          ########################################### [ 12%]
   7:libibmad               ########################################### [ 14%]
   8:infiniband-diags       ########################################### [ 16%]
   9:openmpi_gcc            ########################################### [ 18%]
  10:mpitests_openmpi_gcc   ########################################### [ 20%]
  11:mpitests_openmpi_pgi   ########################################### [ 22%]
  12:mpitests_openmpi_intel ########################################### [ 24%]
  13:qperf                  ########################################### [ 26%]
  14:perftest               ########################################### [ 28%]
  15:ibutils                ########################################### [ 30%]
  16:libmthca               ########################################### [ 32%]
  17:libmlx4                ########################################### [ 34%]
  18:openmpi_pgi            ########################################### [ 36%]
  19:mstflint               ########################################### [ 38%]
  20:mlnxofed-docs          ########################################### [ 40%]
  21:ofed-scripts           ########################################### [ 42%]
  22:libibverbs             ########################################### [ 44%]
  23:libibcommon            ########################################### [ 46%]
  24:libibumad              ########################################### [ 48%]
  25:mvapich_intel          ########################################### [ 50%]
  26:opensm-libs            ########################################### [ 52%]
  27:librdmacm              ########################################### [ 54%]
  28:libibcommon-devel      ########################################### [ 56%]
  29:libibumad-devel        ########################################### [ 58%]
  30:libibverbs-devel       ########################################### [ 60%]
  31:librdmacm-utils        ########################################### [ 62%]
  32:opensm                 ########################################### [ 64%]
  33:mvapich_gcc            ########################################### [ 66%]
  34:mpitests_mvapich_gcc   ########################################### [ 68%]
  35:mpitests_mvapich_pgi   ########################################### [ 70%]
  36:mpitests_mvapich_intel ########################################### [ 72%]
  37:mvapich_pgi            ########################################### [ 74%]
  38:libibverbs-utils       ########################################### [ 76%]
  39:librdmacm-devel        ########################################### [ 78%]
  40:librdmacm-devel        ########################################### [ 80%]
  41:opensm-devel           ########################################### [ 82%]
  42:opensm-devel           ########################################### [ 84%]
  43:libibumad-devel        ########################################### [ 86%]
  44:libibcommon-devel      ########################################### [ 88%]
  45:libibverbs-devel       ########################################### [ 90%]
  46:libibmad               ########################################### [ 92%]
  47:libmthca               ########################################### [ 94%]
  48:libmlx4                ########################################### [ 96%]
  49:libibmad-devel         ########################################### [ 98%]
  50:libibmad-devel         ########################################### [100%]
Device (15b3:673c):
        0c:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0)
        Link Width: 8x
        Link Speed: 2.5Gb/s


Installation finished successfully. 

The firmware version 2.6.0 is up to date. 
Note: To force firmware update use '--force-fw-update' flag.
Configuring /etc/security/limits.conf. 
warning: /etc/infiniband/openib.conf saved as /etc/infiniband/openib.conf.rpmsave