Get vSAN Invalid State Metadata using PowerShell

Have had some ‘fun’ with our All-Flash vSAN clusters recently, after updating to 6.0 U3, then VMware certifying new HBA firmware / driver for our HPE DL380 servers. Every time I have updated / rebooted we end up with invalid metadata:

We’ve had a couple of tickets with VMware on this issue now, and a fix for this is still outstanding. It was scheduled for 6.0 U3 but failed a regression test. So for now when we patch / reboot we have to go fixing these!

The vSAN health page shown above only shows the affected component. On a Hybrid environment, you can remove the capacity disk from the diskgroup and re-add to resolve this, but for All-Flash you need to remove the entire diskgroup. Our servers have 2 x diskgroups per host, so we need to identify which diskgroup needs destroying.

To discover the diskgroup, you have to identify which capacity disk the affected component resides on. There is a VMware KB article for this – but it never worked on our vSAN nodes, so there was a different set of commands VMware support provided us to obtain these. VMware KB is HERE.

Now I’ve ended up doing this several times, and decided pull this into a PowerShell function to make life easier. It will return an object showing the host, Disk UUID, Disk Name & Affected components:

The script does require Posh-SSH, as it connects to the host over SSH to obtain the information. You can download this over at the PowerShell Gallery.

Here’s the code I put together:

To use this, just pass in the VMHostName and Root Credential (using Get-Credential) or just call the function and it will ask for those and run. Of course for this to work you will need to have SSH access enabled for your hosts, and afaik it will not work in lockdown mode.

With the output you can see which capacity disks / diskgroup is affected to delete and re-create. As you can see from the first screenshot I’ve had quite of few of these to do today! 🙁

vSAN is pretty darn good when it’s running – but we do have these challenges when any maintenance is required which is a little more involved than it seems. That said I think some of our challenges stem form using All-Flash, our small Hybrid test cluster has always been solid on the other hand!

Happy Hyper-converging!