HP P2000 G3 FC MSA – troubleshooting a faulty Controller (blinking Fault/Service Required LED)

Setting up a new HP P2000 G3 FC MSA with dual controllers over the last couple of days for a small staging environment, I ran into issues from the word go. The device in question was loaded with 24 SFF disks and two Controllers (Controller A and B).

 

On the very first boot we noticed a fault (amber) LED on the front panel. Inspecting the back of the unit, I noticed that Controller A and B were both still flashing their green “FRU OK” LEDs, (which according to the manual means that the controllers were still booting up), even after waiting a number of hours. On Controller A, I could see a blinking amber “Fault/Service Required” LED. Following through the troubleshooting steps in the manuals lead nowhere as the end synopsis was to check the event logs. Even the Web interface was acting up – I could not see the controller’s listed, could not see any disks and the event logs were completely empty. Obviously there was a larger issue at hand preventing the MSA and even the Web interface from functioning properly. To further confuse matters, after shutting down and restarting the device, controller B starting blinking the amber LED instead of A this time, both still stuck in their “Booting up” state. Refer to the linked LED diagram below and you’ll see that the LED flashing green is labelled as 6, and the amber blinking LED is the one labelled as 7 on the top controller in the diagram.

LED Diagram

HP Official documentation

After powering the unit down completely, and then powering back up again, the MSA was still stuck in the same state. Powering down the unit once more, removing and reseating both controllers did not help either. Lastly, I powered it all off again, removed controller A completely, then powered up the device with just Controller B installed. Surprisingly the MSA booted up perfectly, and LED number 6 (FRU OK) went a nice solid green after a minute or so of booting up. No amber LEDs were to be seen. Good news then! Hot plugging controller A back in at this stage with the device powered on resulted in both controllers reporting a healthy status and all the disks and hardware being detected. A final test was done by powering off everything and powering it back up again as it should be from a cold start. Everything worked this time.

 

Here is a photo of the rear of the device once all was resolved showing the solid green FRU OK LEDs on both controllers.

 

 

Bit of an odd one, but it would seem that controllers together were preventing each other from starting up. Removing one then booting up with this seemed to solve the problem, and at the end of the day all hardware was indeed healthy. After this the 24 disks were assigned and carved up into some vdisks to be presented to our ESXi hosts!

 

Troubleshooting & Fixing VMware Host Profile errors

 

Synopsis

 

Trying to apply a Host profile created from another Host in a cluster today I got an error message which resulted in only some of the host profile actually being applied.

 

A specifed parameter was not correct. changedValue.key

 

Error message received after trying to apply profile to Host

 

I thought that the error message looked familiar, but couldn’t quite remember at the time, so I left what I was doing to take a look at again later. On my way home this evening I had a bit of a brainwave – the ESX host I had taken the original profile from was a slightly different update level (2) as opposed to the update level of the newer host I was applying the profile to. I also remembered where I had seen the text “changedValue.key” in the error message before – changing Advanced Settings on a Host using PowerCLI! This gave me a good idea as to where to look for the issue I was having with this Host Profile – the Advanced Settings in the Host Profile.

 

I knew it was probably to do with a value that was different between hosts because of their differing update levels, but to gather more information I decided to hit the log files to find out more… Opening up /var/log/vmware/hostd.log on the Host and navigating down to the time I tried to apply the Host Profile I found this (interesting bit in the screenshot below, full log text in the section after that):

 

Interesting bits of information that help point to the issue in hostd.log

 

[2012-02-20 19:50:07.849 F66966D0 info 'TaskManager'] Task Created : haTask-ha-host-vim.option.OptionManager.updateValues-896
[2012-02-20 19:50:07.853 F66966D0 verbose 'VersionOptionProvider'] Attempt to set readonly option
[2012-02-20 19:50:07.853 F66966D0 info 'App'] AdapterServer caught exception: vmodl.fault.InvalidArgument
[2012-02-20 19:50:07.853 F66966D0 info 'TaskManager'] Task Completed : haTask-ha-host-vim.option.OptionManager.updateValues-896 Status error
[2012-02-20 19:50:07.853 F66966D0 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0x5cf27a98] : Invoke done [updateValues] on [vim.option.OptionManager:ha-adv-options]
[2012-02-20 19:50:07.853 F66966D0 verbose 'Vmomi'] Arg changedValue:
(vim.option.OptionValue) [
   (vim.option.OptionValue) {
      dynamicType = <unset>,
      key = "Misc.HostAgentUpdateLevel",
      value = "2",
   },
   (vim.option.OptionValue) {
      dynamicType = <unset>,
      key = "Misc.HostAgentUpdateLevel",
      value = "2",
   }
]
[2012-02-20 19:50:07.853 F66966D0 info 'Vmomi'] Throw vmodl.fault.InvalidArgument
[2012-02-20 19:50:07.853 F66966D0 info 'Vmomi'] Result:
(vmodl.fault.InvalidArgument) {
   dynamicType = <unset>,
   faultCause = (vmodl.MethodFault) null,
   invalidProperty = "changedValue.key",
   msg = "",
}
[root@hostnamehere vmware]#

 

The cause:

 

So we can see that the Host Profile did a “change value” (changedValue) on the key “Misc.HostAgentUpdateLevel” and this is where our error was thrown with an “invalidProperty” (changedValue.key). If we google the message “vmodl.fault.InvalidArgument” we’ll arrive at the VMware SDK Reference Guide which states that “An InvalidArgument exception is thrown if the set of arguments passed to the function is not specified correctly.” In this case we’ll soon see that this is happening because the value that is trying to be changed is actually a read-only value for the Host – as it should be, as it just references the update level of the host – this wouldn’t normally be something you want to change.

 

The issue here was of course that original host off which the profile was based is update 2, whereas the new host having the profile applied is update 4. The two settings differ, therefore Host Profiles tries to change this value on the new Host. The setting is really read-only, therefore Host Profiles fails to apply the value and throws this error message at us, which also results in the rest of our host profile (annoyingly) not being applied. Ideally if Host Profiles found a read-only value that shouldn’t be changed, it would not change this value.

 

Solution:

 

So the simple solution is to either:

 

  • Take a Host Profile from a Host with the settings you need which is on the same update level as the Hosts you will be applying this profile to.
  • Modify this Host Profile (edit) and remove the Advanced Setting for “Misc.HostAgentUpdateLevel“.

 

In my case, I was testing the host profile on a clean ESX Host before using it for other Hosts – that meant I also only had one new ESX host of this particular update level and therefore couldn’t use the first option (take the profile from an existing host). So I therefore just went to Home -> Host Profiles and edited this Host Profile to get rid of the unnecessary key called “Misc.HostAgentUpdateLevel” like so:

 

Remove the two entries for "Misc.HostAgentUpdateLevel" from the Host Profile

 

After removing the entries referring to this read-only key, I simply re-applied the profile and this time around all the settings went on as expected and there was no more error message. So to sum it all up, check that you aren’t first of all taking a Host Profile from a reference host of a different update level as your target hosts (and if you have to you can then resort to manually editing your profile as I did). If you get cryptic errors applying your Host Profiles, check your Host log files for more info and clues as to where the issue may lie.

 

Troubleshooting VMware Update Manager errors

 

Today I was creating an upgrade baseline for some old ESX 4.x hosts to be patched up to a newer update level. I ran into an error whilst uploading an ESX ISO with the new update version and subsequently found myself troubleshooting the issue. I thought I would do a quick post on general things to check when troubleshooting VMware Update Manager.

 

  • First of all check you are of course uploading the correct file / ISO! Note that ESX upgrade baselines work with ISO files and ESXi deal with .zip files. Ensure you are using the correct file and build of ESX or ESXi depending on which you are planning on using.
  • Consult the log files! Logs are kept in different locations depending on your OS that VUM is running on.
    • Windows XP, 2000, and 2003 – C:\Documents and Settings\All Users\Application Data\VMware\VMware Update Manager\Logs
    • Windows Server 2008 and above – C:\ProgramData\VMware\VMware Update Manager\Logs\
    • ESX update manager logs are kept in – /var/log/vmware/esxupdate.log
      • use cat /var/log/vmware/esxupdate.log | more to view the log file in ESX from the shell / PuTTy SSH session.
  • The log file in Windows should be named something similar to “vmware-vum-server-log4cpp.log”
  • You should be able to locate an issue that has occurred by noting the time the issue happened in Update Manager. Open up the relevant log file and navigate down to the time it happened in your logs. Hopefully the description / entry will point you in the right direction.

 

In my case today, I was trying to upload an ESX 4.0 Update 4 ISO (Complete) to create a new Upgrade Baseline for some older ESX 4.0 hosts. I got an error after uploading the ISO using the new baseline upgrade wizard. See below:

 

Error message after uploading ISO for Host Upgrade Baseline

 

Although the message above in the GUI is not very descriptive, after looking at the log files on the Update Manager server I found an entry which explained what my problem was:

 

Importing classic ESX upgrade package from C:\WINDOWS\TEMP\vum-temp2748786559745325611upload.iso
[2012-02-14 14:28:21:100 'HostUpgradeMetadata' 14028 DEBUG]  [metadata, 682] Stamped MD5: 580834a00621d98be322deb4b31971d8
[2012-02-14 14:28:21:100 'HostUpgradeMetadata' 14028 DEBUG]  [metadata, 559] ComputeISOChecksum started...
[2012-02-14 14:28:25:026 'HostUpgradeMetadata' 14028 DEBUG]  [metadata, 597] ComputeISOChecksum finished...
[2012-02-14 14:28:25:026 'HostUpgradeMetadata' 14028 ERROR]  [metadata, 693] MD5 check failed: f7a4523e2b7312b9b0f5441f8fa1f9d5
[2012-02-14 14:28:25:026 'HostUpgradeMetadata' 14028 ERROR]  [metadata, 721] Integrity check of upgrade ISO failed

 

The problem was that my ISO file was corrupt – the download had seemed to complete just fine when downloading from vmware.com, but there must have been an issue. A quick check of the ISO using my md5sum command line utility in Windows confirmed that the MD5 hash for this ISO file did not match the MD5 hash listed for the ISO on vmware.com (as pointed out in the log files above).

 

I downloaded a fresh copy of the ISO, checked the MD5 again  to ensure it matched this time, and re-uploaded to create a new baseline. Everything worked as expected this time around.