Get vSAN Invalid State Metadata using PowerShell

Have had some ‘fun’ with our All-Flash vSAN clusters recently, after updating to 6.0 U3, then VMware certifying new HBA firmware / driver for our HPE DL380 servers. Every time I have updated / rebooted we end up with invalid metadata:

We’ve had a couple of tickets with VMware on this issue now, and a fix for this is still outstanding. It was scheduled for 6.0 U3 but failed a regression test. So for now when we patch / reboot we have to go fixing these!

The vSAN health page shown above only shows the affected component. On a Hybrid environment, you can remove the capacity disk from the diskgroup and re-add to resolve this, but for All-Flash you need to remove the entire diskgroup. Our servers have 2 x diskgroups per host, so we need to identify which diskgroup needs destroying.

To discover the diskgroup, you have to identify which capacity disk the affected component resides on. There is a VMware KB article for this – but it never worked on our vSAN nodes, so there was a different set of commands VMware support provided us to obtain these. VMware KB is HERE.

Now I’ve ended up doing this several times, and decided pull this into a PowerShell function to make life easier. It will return an object showing the host, Disk UUID, Disk Name & Affected components:

The script does require Posh-SSH, as it connects to the host over SSH to obtain the information. You can download this over at the PowerShell Gallery.

Here’s the code I put together:

function Get-vSANBadMetadataDisks {
  [cmdletbinding()]
  param(
    $vmhostname,
    $rootcred
  )

  if (!$vmhostname) {
    $vmhostname = Read-Host "Enter FQDN of esxi host"
  }
  if (!$rootcred) {
    $rootcred = Get-Credential -UserName 'root' -Message 'Enter root password'
  }
  Write-Host "[$vmhostname : Gathering Metadata information]" -ForegroundColor Cyan
  $session = New-SSHSession -ComputerName $vmhostname -Credential $rootcred -AcceptKey
  if (!$session) {
    Write-Warning "error connecting. exiting"
    return
  }

  $Disks = (Invoke-SSHCommand -SSHSession $session -Command 'vsish -e ls /vmkModules/lsom/disks/ | sed "s/.$//"').Output
  
  $Report = @()
  foreach($Disk in $Disks) {
    $NAA = (Invoke-SSHCommand -SSHSession $session -Command "localcli vsan storage list | grep -B 2 $Disk | grep Displ").output.split(':')[1].trim()
    $Display = (Invoke-SSHCommand -SSHSession $session -Command "esxcli storage core device list | grep -A 1 $NAA | grep Displ").output.split(':')[1].trim()
    
    $Components = (Invoke-SSHCommand -SSHSession $session -Command "vsish -e ls /vmkModules/lsom/disks/$Disk/recoveredComponents/ 2>/dev/null | grep -v 626 | sed 's/.$//'").output
    
    $DiskGroupMetadata = '' | Select-Object VMHost,Disk_UUID,Disk_Name,Components
    $DiskGroupMetadata.VMHost = $vmhostname.split('.')[0]
    $DiskGroupMetadata.Disk_UUID = $Disk
    $DiskGroupMetadata.Disk_Name = $Display
    $DiskGroupMetadata.Components = $Components

    $Report += $DiskGroupMetadata

  }
  
  return $Report | sort Disk_Name
}

To use this, just pass in the VMHostName and Root Credential (using Get-Credential) or just call the function and it will ask for those and run. Of course for this to work you will need to have SSH access enabled for your hosts, and afaik it will not work in lockdown mode.

With the output you can see which capacity disks / diskgroup is affected to delete and re-create. As you can see from the first screenshot I’ve had quite of few of these to do today! 🙁

vSAN is pretty darn good when it’s running – but we do have these challenges when any maintenance is required which is a little more involved than it seems. That said I think some of our challenges stem form using All-Flash, our small Hybrid test cluster has always been solid on the other hand!

Happy Hyper-converging!

3 thoughts on “Get vSAN Invalid State Metadata using PowerShell

  1. Pingback: vSAN Invalid State Component Metadata Fixed! | OS Help

  2. Thank you very much for posting this script! I too have had zero success using the cmmds_find command in RVC to locate the disk associated with an invalid component. And destroying an entire disk group to get rid of it seemed insane. With your script we could ID the problem disk and remove that from the group to clear the invalid component.

    • Glad it helped! We are using de-duplication so have no choice but to destroy the entire diskgroup. BTW this is now fixed in the latest 6.0 patches!

Leave a Reply to mitsujase Cancel reply

Your email address will not be published. Required fields are marked *