new vSAN cmdlet: Get-VSANObjectHealth

Been a while since I posted, so about time I caught up with some things I have been working on!

I will be posting a series of PowerCLI / Powershell cmdlets to assist with vSAN Day-2 operations. There is plenty out there for creating new clusters, but not much around for helping to troubleshoot / support production environments. Some of these have come from VMworld Europe Hackathon I recently attended in Europe which was a blast! I’ll do my best to credit others who helped contribute to the creation of these cmdlets also!

The first I created Get-VSANObjectHealth is to help obtain a state of the objects in a vSAN cluster. This is a great way to validate how things are in the environment, as it will give you the different status of objects, particularly if they are degraded or rebuilding. The idea of this cmdlet was to integrate into our host remediation script so I could verify that all the objects are healthy prior to moving onto the next host. I trust this verification in PowerCLI more so than VUM attempting to put a host into maintenance and then getting stuck.

Code is below and also posted on GitHub Here

How to use:

Details on the Switches:

– HealthyOnly
Will only return True if all objects are healthy, else will return False

– ShowObjectUUIDs
Will extend the query to include an array of ObjectUUIDs for each health category. Good if you need to investigate specific objects

– UseCachedInfo
Will use the cached vSAN Health data from the vCenter server rather than forcing an update from the cluster. Great if you need a quick check, but not recommended if you need a current picture of object health (such as just after exiting from Maintenance Mode)

Enjoy and please do let me know (either via comments here or GitHub) if there is any other enhancements you would like to see! More cmdlets to come soon.

Get vSAN Invalid State Metadata using PowerShell

Have had some ‘fun’ with our All-Flash vSAN clusters recently, after updating to 6.0 U3, then VMware certifying new HBA firmware / driver for our HPE DL380 servers. Every time I have updated / rebooted we end up with invalid metadata:

We’ve had a couple of tickets with VMware on this issue now, and a fix for this is still outstanding. It was scheduled for 6.0 U3 but failed a regression test. So for now when we patch / reboot we have to go fixing these!

The vSAN health page shown above only shows the affected component. On a Hybrid environment, you can remove the capacity disk from the diskgroup and re-add to resolve this, but for All-Flash you need to remove the entire diskgroup. Our servers have 2 x diskgroups per host, so we need to identify which diskgroup needs destroying.

To discover the diskgroup, you have to identify which capacity disk the affected component resides on. There is a VMware KB article for this – but it never worked on our vSAN nodes, so there was a different set of commands VMware support provided us to obtain these. VMware KB is HERE.

Now I’ve ended up doing this several times, and decided pull this into a PowerShell function to make life easier. It will return an object showing the host, Disk UUID, Disk Name & Affected components:

The script does require Posh-SSH, as it connects to the host over SSH to obtain the information. You can download this over at the PowerShell Gallery.

Here’s the code I put together:

To use this, just pass in the VMHostName and Root Credential (using Get-Credential) or just call the function and it will ask for those and run. Of course for this to work you will need to have SSH access enabled for your hosts, and afaik it will not work in lockdown mode.

With the output you can see which capacity disks / diskgroup is affected to delete and re-create. As you can see from the first screenshot I’ve had quite of few of these to do today! 🙁

vSAN is pretty darn good when it’s running – but we do have these challenges when any maintenance is required which is a little more involved than it seems. That said I think some of our challenges stem form using All-Flash, our small Hybrid test cluster has always been solid on the other hand!

Happy Hyper-converging!

UPDATE – Creating Custom EFI Bootable ESXi Installation ISO

** UPDATE ** – VMware have published updated documentation with 2 different methods to generate the ISO Located Here

I have recently been working on a solution to be able to build our ESXi hosts unattended. I know there is plenty of documentation out there on using PXE to kickstart ESXi installations but that unfortunately does not meet my use case – which is to streamline the ESXi installs at our remote locations where we do not have PXE / DHCP networks.

Based on the above, I wanted to put a kickstart script on the ESXi installer ISO which is automatically called (based on modifying the boot.cfg file) to install ESXi in a scripted fashion. The process to accomplish this is documented in the VMware Documentation Center Located Here

My idea to accomplish this seemed sound – and will save the detail for another day, but a major stumbling block I had come across was after creating the ISO file based on the instructions above, it would not boot on an EFI based host. The same ISO worked fine if you used a BIOS based machine.

Cue some serious digging around as to why, including a ticket to VMware for this, it turns out the documentation linked above will only create the boot sector on the ISO for BIOS machines. There is a separate bootstrap which has to be created at the ISO generation stage which allows the resulting ISO to boot both EFI and BIOS.

You can see what boot sectors are on the ISO by using a linux dumpet utility. I won’t go into details on this, but it’s a useful tool for looking at CDs / ISOs.

It turns out that VMware are not alone on this challenge! After much searching online it turns out that most Linux distributions fall foul of this same limitation, but I had managed to find some articles which led me down the right path to eventually solve this!

to resolve this, you need to change the mkisofs command on the vmware documentation to the following:

mkisofs -relaxed-filenames -J -R -b isolinux.bin -c boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e efiboot.img -boot-load-size 1 -no-emul-boot -o customesxi.iso .

the key piece is the El Torito boot – this allows the ISO to strap the EFI bootable image. It turns out that EFI does not have support for ISO9660 file systems, so you have to create a GPT disk image containing the EFI images to boot.

Hope this helps! As for the VMware ticket – they have acknowledged this, and a request has been put into update the documentation.

iovDisableIR change in ESXi 6.0 U2 P04 causing PSOD on HPE Servers – Updated

So we have had an ‘interesting’ issue at work on the past few weeks!

We have had Gen8 / Gen9 blades in our environment randomly crashing over the last month. We had originally been sent down what seems an incorrect path, however seem we are on the right track now!

Symptoms

HP BL460c Gen8 and Gen9 blades with v2 / v3 processors, would randomly crash. There was no specific cause for them but it seemed to be more prevalent in higher I/O periods such as backups running. It started out that PSODs looked like this:

After logging a call with VMware, we were led down a path that the mlx4_core error in the above screenshot was causing the issue. After further investigation, it turned out that after upgrading from vSphere 5.5 to vSphere 6.0 (using VUM) there were mlx4 drivers left behind – which is what was causing the ‘jumpstart dependancy error’. Once we removed the bad 5.5 VIBs all was well.

The root cause as to why the 5.5 drivers remained after the 6.0 install, is because whilst the driver was present in the HP utilities bundle for 6.0 – the driver version was not revised, so VUM just ignored this! We have run into this before, and fed back to HP that even if the driver is the ‘same’, the version should be revised (and particularly incremented) to ensure the driver gets updated. This is not an issue if you use a fresh-install of vSphere 6.0.

So – we fixed this across the environment (along with some other VIBs – more on this later) and hoped this would be the end of it. 2 days later, we get further PSODs but this time without the mlx4 dependancy error!

Back to square one. We updated logs with VMware, and this time opened a case with HP – as it’s technically reporting LINT1 / NMI hardware errors. Cue 3 days later, and finally get some solid information back – a very interesting discovery!

HP recommended this customer advisory – One I have seen before and a long time back. Strange to me as we have never seen this issue before, and it’s not a setting we change as standard. There was also a specific error called out in the advisory which we had asked for confirmation this was in the logs:

ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40.

Anyhow – the most crucial piece that the HP L2 tech informed us, is that ESXi 6.0 Patch ESXi600-201611401-BG changed the setting in the HP customer advisory from a default of false to true.

After running a script in PowerCLI – it appears that is certainly the case, all the hosts we had running ESXi 6.0 Build 4600944 had the iovDisableIR setting set to TRUE (so it is disabled). This is causing the PSODs according to HP.

Digging a little further, the iovDisableIR is a parameter which handles IRQ Interrupt remapping. This is a feature developed by Intel to improve performance. According to VMware this feature had it’s issues originally – particularly with certain Intel chipsets so recommend disabling it in certain circumstances. However HP do support this and infect per their advisory recommend this is enabled to prevent PSODs. The interesting piece – is where the VMware KB (1030265) linked from the HP Customer Advisory states that the error may occur with QLogic HBAs. This is the HBA we use for our FC storage in the environment, but also explains why we have not seen PSODs in our Rackmounts (were they use Direct or SAS attached storage). But – On our Gen9 hardware, we have not seen PSODs, instead the QLogic HBA failing, or the host just rebooting, so I believe these are related to the above setting.

So – to resolve this, we need to do the following across all our hosts running ESXi 6.0 Build 4600944 (I have also had word this is also the same in the ESXi 6.5 release):

esxcli system settings kernel set –setting=iovDisableIR -v FALSE

This requires a reboot to take effect. To determine if the host is affected, you can use the following PowerCLI script to gather a report of the current setting.

We are awaiting further information from HP/VMware who are now collaborating on our cases to determine the root cause and why this was changed (and is attributed), however have rolled this setting out across our blade environment and will continue to monitor. I will update this post when we know more!

*** Update 15th Feb ***

VMware have now released a KB article on this issue.
VMware KB (2149043)

Word of Warning:

We did some digging on this setting, and found that iovDisableIR has been Disabled (set to TRUE) on the ESXi 6.5 initial release. It does not appear to be configured unique to HP Custom ISO’s