vSAN Invalid State Component Metadata Fixed!

Just a quick note to follow on from my post regarding the invalid component metadata on vSAN. This is now fixed by VMware in the latest round of patches.

https://kb.vmware.com/kb/2145347

I recently upgraded my lab servers to the latest round of patches (end June 2017) and the errors which appeared after updating the disk firmware when I applied VUM patches. Nice!

Get vSAN Invalid State Metadata using PowerShell

Have had some ‘fun’ with our All-Flash vSAN clusters recently, after updating to 6.0 U3, then VMware certifying new HBA firmware / driver for our HPE DL380 servers. Every time I have updated / rebooted we end up with invalid metadata:

We’ve had a couple of tickets with VMware on this issue now, and a fix for this is still outstanding. It was scheduled for 6.0 U3 but failed a regression test. So for now when we patch / reboot we have to go fixing these!

The vSAN health page shown above only shows the affected component. On a Hybrid environment, you can remove the capacity disk from the diskgroup and re-add to resolve this, but for All-Flash you need to remove the entire diskgroup. Our servers have 2 x diskgroups per host, so we need to identify which diskgroup needs destroying.

To discover the diskgroup, you have to identify which capacity disk the affected component resides on. There is a VMware KB article for this – but it never worked on our vSAN nodes, so there was a different set of commands VMware support provided us to obtain these. VMware KB is HERE.

Now I’ve ended up doing this several times, and decided pull this into a PowerShell function to make life easier. It will return an object showing the host, Disk UUID, Disk Name & Affected components:

The script does require Posh-SSH, as it connects to the host over SSH to obtain the information. You can download this over at the PowerShell Gallery.

Here’s the code I put together:

function Get-vSANBadMetadataDisks {
  [cmdletbinding()]
  param(
    $vmhostname,
    $rootcred
  )

  if (!$vmhostname) {
    $vmhostname = Read-Host "Enter FQDN of esxi host"
  }
  if (!$rootcred) {
    $rootcred = Get-Credential -UserName 'root' -Message 'Enter root password'
  }
  Write-Host "[$vmhostname : Gathering Metadata information]" -ForegroundColor Cyan
  $session = New-SSHSession -ComputerName $vmhostname -Credential $rootcred -AcceptKey
  if (!$session) {
    Write-Warning "error connecting. exiting"
    return
  }

  $Disks = (Invoke-SSHCommand -SSHSession $session -Command 'vsish -e ls /vmkModules/lsom/disks/ | sed "s/.$//"').Output
  
  $Report = @()
  foreach($Disk in $Disks) {
    $NAA = (Invoke-SSHCommand -SSHSession $session -Command "localcli vsan storage list | grep -B 2 $Disk | grep Displ").output.split(':')[1].trim()
    $Display = (Invoke-SSHCommand -SSHSession $session -Command "esxcli storage core device list | grep -A 1 $NAA | grep Displ").output.split(':')[1].trim()
    
    $Components = (Invoke-SSHCommand -SSHSession $session -Command "vsish -e ls /vmkModules/lsom/disks/$Disk/recoveredComponents/ 2>/dev/null | grep -v 626 | sed 's/.$//'").output
    
    $DiskGroupMetadata = '' | Select-Object VMHost,Disk_UUID,Disk_Name,Components
    $DiskGroupMetadata.VMHost = $vmhostname.split('.')[0]
    $DiskGroupMetadata.Disk_UUID = $Disk
    $DiskGroupMetadata.Disk_Name = $Display
    $DiskGroupMetadata.Components = $Components

    $Report += $DiskGroupMetadata

  }
  
  return $Report | sort Disk_Name
}

To use this, just pass in the VMHostName and Root Credential (using Get-Credential) or just call the function and it will ask for those and run. Of course for this to work you will need to have SSH access enabled for your hosts, and afaik it will not work in lockdown mode.

With the output you can see which capacity disks / diskgroup is affected to delete and re-create. As you can see from the first screenshot I’ve had quite of few of these to do today! 🙁

vSAN is pretty darn good when it’s running – but we do have these challenges when any maintenance is required which is a little more involved than it seems. That said I think some of our challenges stem form using All-Flash, our small Hybrid test cluster has always been solid on the other hand!

Happy Hyper-converging!

UPDATE – Creating Custom EFI Bootable ESXi Installation ISO

** UPDATE ** – VMware have published updated documentation with 2 different methods to generate the ISO Located Here

I have recently been working on a solution to be able to build our ESXi hosts unattended. I know there is plenty of documentation out there on using PXE to kickstart ESXi installations but that unfortunately does not meet my use case – which is to streamline the ESXi installs at our remote locations where we do not have PXE / DHCP networks.

Based on the above, I wanted to put a kickstart script on the ESXi installer ISO which is automatically called (based on modifying the boot.cfg file) to install ESXi in a scripted fashion. The process to accomplish this is documented in the VMware Documentation Center Located Here

My idea to accomplish this seemed sound – and will save the detail for another day, but a major stumbling block I had come across was after creating the ISO file based on the instructions above, it would not boot on an EFI based host. The same ISO worked fine if you used a BIOS based machine.

Cue some serious digging around as to why, including a ticket to VMware for this, it turns out the documentation linked above will only create the boot sector on the ISO for BIOS machines. There is a separate bootstrap which has to be created at the ISO generation stage which allows the resulting ISO to boot both EFI and BIOS.

You can see what boot sectors are on the ISO by using a linux dumpet utility. I won’t go into details on this, but it’s a useful tool for looking at CDs / ISOs.

It turns out that VMware are not alone on this challenge! After much searching online it turns out that most Linux distributions fall foul of this same limitation, but I had managed to find some articles which led me down the right path to eventually solve this!

to resolve this, you need to change the mkisofs command on the vmware documentation to the following:

mkisofs -relaxed-filenames -J -R -b isolinux.bin -c boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e efiboot.img -boot-load-size 1 -no-emul-boot -o customesxi.iso .

the key piece is the El Torito boot – this allows the ISO to strap the EFI bootable image. It turns out that EFI does not have support for ISO9660 file systems, so you have to create a GPT disk image containing the EFI images to boot.

Hope this helps! As for the VMware ticket – they have acknowledged this, and a request has been put into update the documentation.

iovDisableIR change in ESXi 6.0 U2 P04 causing PSOD on HPE Servers – Updated

So we have had an ‘interesting’ issue at work on the past few weeks!

We have had Gen8 / Gen9 blades in our environment randomly crashing over the last month. We had originally been sent down what seems an incorrect path, however seem we are on the right track now!

Symptoms

HP BL460c Gen8 and Gen9 blades with v2 / v3 processors, would randomly crash. There was no specific cause for them but it seemed to be more prevalent in higher I/O periods such as backups running. It started out that PSODs looked like this:

After logging a call with VMware, we were led down a path that the mlx4_core error in the above screenshot was causing the issue. After further investigation, it turned out that after upgrading from vSphere 5.5 to vSphere 6.0 (using VUM) there were mlx4 drivers left behind – which is what was causing the ‘jumpstart dependancy error’. Once we removed the bad 5.5 VIBs all was well.

The root cause as to why the 5.5 drivers remained after the 6.0 install, is because whilst the driver was present in the HP utilities bundle for 6.0 – the driver version was not revised, so VUM just ignored this! We have run into this before, and fed back to HP that even if the driver is the ‘same’, the version should be revised (and particularly incremented) to ensure the driver gets updated. This is not an issue if you use a fresh-install of vSphere 6.0.

So – we fixed this across the environment (along with some other VIBs – more on this later) and hoped this would be the end of it. 2 days later, we get further PSODs but this time without the mlx4 dependancy error!

Back to square one. We updated logs with VMware, and this time opened a case with HP – as it’s technically reporting LINT1 / NMI hardware errors. Cue 3 days later, and finally get some solid information back – a very interesting discovery!

HP recommended this customer advisory – One I have seen before and a long time back. Strange to me as we have never seen this issue before, and it’s not a setting we change as standard. There was also a specific error called out in the advisory which we had asked for confirmation this was in the logs:

ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40.

Anyhow – the most crucial piece that the HP L2 tech informed us, is that ESXi 6.0 Patch ESXi600-201611401-BG changed the setting in the HP customer advisory from a default of false to true.

After running a script in PowerCLI – it appears that is certainly the case, all the hosts we had running ESXi 6.0 Build 4600944 had the iovDisableIR setting set to TRUE (so it is disabled). This is causing the PSODs according to HP.

Digging a little further, the iovDisableIR is a parameter which handles IRQ Interrupt remapping. This is a feature developed by Intel to improve performance. According to VMware this feature had it’s issues originally – particularly with certain Intel chipsets so recommend disabling it in certain circumstances. However HP do support this and infect per their advisory recommend this is enabled to prevent PSODs. The interesting piece – is where the VMware KB (1030265) linked from the HP Customer Advisory states that the error may occur with QLogic HBAs. This is the HBA we use for our FC storage in the environment, but also explains why we have not seen PSODs in our Rackmounts (were they use Direct or SAS attached storage). But – On our Gen9 hardware, we have not seen PSODs, instead the QLogic HBA failing, or the host just rebooting, so I believe these are related to the above setting.

So – to resolve this, we need to do the following across all our hosts running ESXi 6.0 Build 4600944 (I have also had word this is also the same in the ESXi 6.5 release):

esxcli system settings kernel set –setting=iovDisableIR -v FALSE

This requires a reboot to take effect. To determine if the host is affected, you can use the following PowerCLI script to gather a report of the current setting.

$Hosts = Get-VMHost
$Report = $Hosts | % {
    $iovdata = ($_ | Get-EsxCli).system.settings.kernel.list() | ? {$_.Name -like 'iovDisableIR'}
    $_ | select name,parent,model,processorType,version,build,@{n='iovDisableIR_Conf';e={$iovdata.Configured}},@{n='iovDisableIR_Runtime';e={$iovdata.Runtime}}
}

We are awaiting further information from HP/VMware who are now collaborating on our cases to determine the root cause and why this was changed (and is attributed), however have rolled this setting out across our blade environment and will continue to monitor. I will update this post when we know more!

*** Update 15th Feb ***

VMware have now released a KB article on this issue.
VMware KB (2149043)

Word of Warning:

We did some digging on this setting, and found that iovDisableIR has been Disabled (set to TRUE) on the ESXi 6.5 initial release. It does not appear to be configured unique to HP Custom ISO’s

Home Automation using Home Assistant

First post on my re-launched blog! I hope this time I will keep this up with musings from both my personal and professional worlds with all things computing related!

So first blog is to introduce my chosen Home Automation platform I am getting to grips with at home, and a little background as to why I have gone this route.

Why automate your home?

Many an answer to this question out there, however for me, it’s a fun and interesting subject, which seeing the results brings great satisfaction. There is plenty of challenges to provide the right balance of home automation. The ability to ‘set and forget’ a number of routines is nice, and without there are many things which would be ‘forgotten’ around the house. Such as switching on the dehumidifier in the mornings, and off in the evening.

What do I automate?

Lots of things! Simple things like power sockets, lighting, the home A/V equipment, and some more advanced things like our terrarium (turtle aquarium).

What do you use to automate the home?

There are many different components to my home automation. Here is but a few:

  • Telldus Tellstick Live .NET (433 Mhz sender / Receiver)
  • 433Mhz switches for lights
    • Home Easy
    • Nexa
  • 433Mhz sockets for plug based devices
    • Home Easy
    • Nexa
    • Conrad
  • Logitech Harmony Ultimate Touch
  • ESP8266 devices (more on this later)

Where does Home Assistant come in?

Home Assistant provides the ‘gel’ between the different devices across the house, and other external components, such as weather, sunrise/sunset, and presence detection. These allow the devices across the home to be fully automated based on various conditions. It is an incredibly powerful platform, but can be daunting at first. The extent of devices support is vast, and very intricate.

I can use Home Assistant to automatically switch on the outside lights when arriving home after sunset, then switch off again after 10 minutes. Automatically turn on lights based on sunset levels etc. More on how I ‘gel’ my home together will come in future posts.

Where can I get more information?

The Home Assistant website is the best place to start learning about this platform, and the components supported. Development is very active, and new components are being added monthly! So it’s worth keeping an eye on if there are devices you are looking to automate but are not available yet.

More to come on how I got this setup, and how to get going to come!