vSAN on HPE Synergy Composeable Infrastructure – Part 1

It’s been a while since I have posted, been pulled in many different directions with work priorities so blogging took a temporary side-line! I am now back and going to blog about a project I am currently working on to build out an extended VDI vSAN environment.

Our existing vSAN environment is running on DL380 G9 rackmounts, which whilst we had some initial teething issues have been particularly solid and reliable of late!

We are almost to the point of exhausting our CPU and Memory resources for this environment, along with about 60% utilized on the vSAN datastores across the 3 clusters. So with this it felt a natural fit to expand our vSAN environment as we continue the migration to Windows 10, and manage the explosive growth of the environment – aided by recently implementing Microsoft Azure MFA authentication vs 2-factor using a VPN connection.

As an organization, we are about to a refresh a number of HP Gen8 blades in our datacentre, and in looking at going to Gen10 knowing that this could be the last generation to support C7000 chassis, we thought it would be a good time to look at other solutions. This is where HPE Synergy composable infrastructure came in! After an initial purchase of 4 frames, and a business requirement causing us to expand this further – we felt that expanding vSAN could be a good fit into Synergy with the D3940 storage module.

Now we have the hardware in the datacentre and finally racked up, I am going to be going through a series of blogs on how vSAN looks in HPE Synergy composable infrastructure, our configuration, and some of the Synergy features / automation capabilities which make this easier to implement vs the traditional DL380 Gen9 rackmount hardware we have in place today. Stay tuned or follow my twitter handle for notifications for more on this series.

Get AHS Data from HPE iLO4+ using PowerShell

I discovered this possibility a year back, and it’s only now I invested the time to get this working! It turned out not to be as challenging as I thought, in fact the hardest bit was getting the authentication token.

I have written a PowerShell function called Get-AHSData which allows you to gather the AHS data from an HPE iLO 4 or newer. These AHS logs are frequently requested by HPE Support when logging calls for Proliant, and downloading these using the iLO UI can be cumbersome – and involves a mouse!

Get-AHSData will allow you to specify the server, iLO Credentials and a Start / End date for the log range if necessary. By default it will grab metrics for 1 day. You specify a folder to export the file to and it will go grab the file and save it there, returning a file list (if you run against multiple iLOs).

Code is out on GitHub Here

Sorry it’s a little long to embed here, and keeping it in GitHub will allow me to iterate it with improvements without having to circle back and update this page.

An example of this running:

Let me know on GitHub if you have any issues, or feel free to fork and improve! I already have a couple of enhancements from my colleagues which I will look to include.

Home Assistant SSL Certificate renewal with uPnP

A slight niggle that has been going on with my Home Assistant install for a little while now where SSL certificate renewals were not happening, and I would end up having to renew them manually. Reason I am doing renewals in the first place is because I am using the very capable and free LetsEncrypt certificates to secure my Home Assistant instance.

I am not going to go into details of how you can provision certificates on Home Assistant, as this is covered very well on the HASS website already LINK

What this post is going to elaborate on, is the authentication process for getting / renewing a certificate.

Lets Encrypt have a number of security measures for verifying you are the owner of a system that a certificate is being requested for. In terms of HTTP/HTTPS authentication, it will only test against HTTP port 80, or HTTPS port 443, as these are considered privileged ports in Linux requiring elevated permissions to have an application claim them for listening upon.
I would hope most others are alike myself where I do not like the idea of opening/redirecting either of these ports to my Home Assistant server just for a certificate renewal, and rather it be temporary.

The certbot tool which is used to handle the certificate request/renewal process, can be configured to launch a http web server on a custom port, then you have to redirect either TCP port 80, or 443 (depending on your renewal parameters) to this custom port which certbot is listening on. Usually this step is done at your home router.
Again, configurations of home routers is not something to go into detail here and varies wildly.

I did not want to leave a forward permanently open so Home Assistant could automate it’s renewal request, and my router does not have any capability to automate configuration of port forwarding. So I decided to look at a different approach.

In comes uPnP – a network protocol which allows discovery and management of network devices. A subset of this is the ability for an application to request a port to be forward to it from your Internet router. This is more applicable in home environments than corporate. You likely have applications which do this already. This is where we are going to tag along with those applications and request a port forward temporarily to allow the certificate renewal.

More information about uPnP is available here.
The implementation I have used in the script for certbot is based on Python’s MiniuPnP and code written by Silas S. Brown

My complete Home Assistant configurations are available on GitHub, but extracts are below too.

Configuration is generally the same as the Home Assistant guide here but we will make a few tweaks to step 8 (auto-renew process).

I have created a small script which will forward the port using uPNP, then request the certificate and remove the port once done to handle the renewal process. I have customized the process to use port 8124 temporarily so I do not have to stop/restart Home Assistant to perform the renewal.

So the script looks like this:

Simply replace the ‘certbot-auto’ line in step 8 of the guide with the path the the above script. Also save the script file below to the same location.

The certbot_upnp.sh script being called above, is an additional Python script to perform the actual port forwarding. I customized it from Silas code to simplify the request, and have hard coded the ports in the script for now. You can edit this and the certbot-auto line should you wish to use a different port:

NOTE – There are some prerequisites you need to for the uPnP module to work. You will need the python-miniupnp library for the script to work. On Ubuntu, you can install this by:

I am sure there are much more fluid ways to integrate this, and incorporate in a Home Assistant component, but I am still rather green to Python, but would be interested to see how others would take this and customize into their workflows!

Flash Marlin 3DP Firmware from Octopi / Raspberry Pi

I thought this would be useful to document here, as it’s something that I can refer back to, and hopefully will help others!

I have an Anet A8 3D Printer for numerous home projects, and I have got so much use out of it since getting it – love the versatility, and have even designed some parts to help around the house!
Anet A8

The printer is managed through the very popular Octoprint. I actually have the dedicated Octopi image running on a Raspberry Pi Zero W which works great and is a very cheap way to get your printer Wi-Fi enabled! It’s pretty much a case of just flashing the image to an SD card, then connecting your printer via a USB cable.
Octoprint Image

With my printer connected to my Pi, whenever I wanted to update the printers firmware (which is flashed with Marlin) I would have to disconnect from the Pi, and connect to a Windows 10 tablet (as my Mac does not play nice with the serial chip on the printer), so I went in search of a better solution, and I came across some tips on how I could flash the firmware using a Raspberry Pi (or my Octopi!). Details on this below:

1. Compile your firmware

I’m not going to go into details on how to download or install the firmware – There is plenty of documentation out there on how to do this normally. But I will touch on how to get that firmware ready to install from your Octopi / Raspberry Pi.

The CPU on Raspberry Pis is rather slow, and doing the full compile on the Pi would take a very long time! Also as most the documentation is aimed at how to configure Arduino to flash your 3DP firmware, with the Octopi not having a GUI installed that’s an additional complexity.

So – We can get around this by using your much more powerful PC to compile the firmware. Then it’s just a case of uploading to the Pi and flashing. To compile the firmware into a HEX file:
– Click Sketch -> Export Compiled Binary

This will save the HEX file into the same directory as your ino file. Make a note of where that file is, or have your file explorer open ready for the next step.

2. Upload HEX file to your Octopi / Octoprint

There are plenty of ways you can do this, and you may have a preferred method, however I am getting the file into my Octopi using SCP from my mac. If you are a windows user you can use WinSCP.
I uploaded the file into the /home/pi folder ready for the next step.

Note: The default password for octopi is the same as Raspbian. User: pi / Password: raspberry

SCP example:
scp pi@octopi:/home/pi

3. SSH into Octopi and flash firmware

Next you will need to SSH into your Octopi / Raspberry Pi so we can carry out the steps to install and flash. On Windows you can use PuTTY.

Once logged in:

This will install avrdude – the application which runs in the background to Arduino to upload your compiled code. This is what we will use to upload the HEX file to the mainboard.
apt-get update
apt-get install avrdude

Now we need to determine what the (Linux) name of the USB Serial port is that the printer is connected to. The best place to see this is in your connection tab of Octoprint:

As above – in my case it’s /dev/ttyUSB0

Whilst you are in Octoprint, make sure you Disconnect from the printer – that way it will release the Serial port for you to flash the board.

And finally – we can now flash the firmware:

avrdude -p m1284p -c arduino -P [USB Port from Octoprint] -b 57600 -D -U flash:w:[file you uploaded earlier]:I
In my case:
avrdude -p m1284p -c arduino -P /dev/ttyUSB0 -b 57600 -D -U flash:w:Marlin.ino.sanguino.hex:I

NOTE: The above code is specific to the Anet A8 v1.0 board. If you are using a different setup such as RAMPS, you will need to check which processor is being used and adjust accordingly.

The board will automatically reset when complete, and all being well you should be running the new firmware!

Please note – I do not take any responsibility for any damage / bricking / in-operation of your board. Usually these are not terminal if something does go wrong, but it’s too much to explain here if something does.

Happy Printing!

new vSAN cmdlet: Get-VSANObjectHealth

Been a while since I posted, so about time I caught up with some things I have been working on!

I will be posting a series of PowerCLI / Powershell cmdlets to assist with vSAN Day-2 operations. There is plenty out there for creating new clusters, but not much around for helping to troubleshoot / support production environments. Some of these have come from VMworld Europe Hackathon I recently attended in Europe which was a blast! I’ll do my best to credit others who helped contribute to the creation of these cmdlets also!

The first I created Get-VSANObjectHealth is to help obtain a state of the objects in a vSAN cluster. This is a great way to validate how things are in the environment, as it will give you the different status of objects, particularly if they are degraded or rebuilding. The idea of this cmdlet was to integrate into our host remediation script so I could verify that all the objects are healthy prior to moving onto the next host. I trust this verification in PowerCLI more so than VUM attempting to put a host into maintenance and then getting stuck.

Code is below and also posted on GitHub Here

How to use:

Details on the Switches:

– HealthyOnly
Will only return True if all objects are healthy, else will return False

– ShowObjectUUIDs
Will extend the query to include an array of ObjectUUIDs for each health category. Good if you need to investigate specific objects

– UseCachedInfo
Will use the cached vSAN Health data from the vCenter server rather than forcing an update from the cluster. Great if you need a quick check, but not recommended if you need a current picture of object health (such as just after exiting from Maintenance Mode)

Enjoy and please do let me know (either via comments here or GitHub) if there is any other enhancements you would like to see! More cmdlets to come soon.

Spinning Connecting page on Home Assistant

Something that had been bothering me a little over the past week or so, which I only now just had the time to investigate – is Home Assistant sitting with a spinning connecting logo.

Turns out that the browser elements of the application, and the reason I was getting this is because the script which manages the DNS update of my public IP had stopped!

So if this happens to you – check that you are using the right IP to access your Home Assistant instance!

Luckily things were still working fine in the background.

vSAN Invalid State Component Metadata Fixed!

Just a quick note to follow on from my post regarding the invalid component metadata on vSAN. This is now fixed by VMware in the latest round of patches.


I recently upgraded my lab servers to the latest round of patches (end June 2017) and the errors which appeared after updating the disk firmware when I applied VUM patches. Nice!

Get vSAN Invalid State Metadata using PowerShell

Have had some ‘fun’ with our All-Flash vSAN clusters recently, after updating to 6.0 U3, then VMware certifying new HBA firmware / driver for our HPE DL380 servers. Every time I have updated / rebooted we end up with invalid metadata:

We’ve had a couple of tickets with VMware on this issue now, and a fix for this is still outstanding. It was scheduled for 6.0 U3 but failed a regression test. So for now when we patch / reboot we have to go fixing these!

The vSAN health page shown above only shows the affected component. On a Hybrid environment, you can remove the capacity disk from the diskgroup and re-add to resolve this, but for All-Flash you need to remove the entire diskgroup. Our servers have 2 x diskgroups per host, so we need to identify which diskgroup needs destroying.

To discover the diskgroup, you have to identify which capacity disk the affected component resides on. There is a VMware KB article for this – but it never worked on our vSAN nodes, so there was a different set of commands VMware support provided us to obtain these. VMware KB is HERE.

Now I’ve ended up doing this several times, and decided pull this into a PowerShell function to make life easier. It will return an object showing the host, Disk UUID, Disk Name & Affected components:

The script does require Posh-SSH, as it connects to the host over SSH to obtain the information. You can download this over at the PowerShell Gallery.

Here’s the code I put together:

To use this, just pass in the VMHostName and Root Credential (using Get-Credential) or just call the function and it will ask for those and run. Of course for this to work you will need to have SSH access enabled for your hosts, and afaik it will not work in lockdown mode.

With the output you can see which capacity disks / diskgroup is affected to delete and re-create. As you can see from the first screenshot I’ve had quite of few of these to do today! 🙁

vSAN is pretty darn good when it’s running – but we do have these challenges when any maintenance is required which is a little more involved than it seems. That said I think some of our challenges stem form using All-Flash, our small Hybrid test cluster has always been solid on the other hand!

Happy Hyper-converging!

UPDATE – Creating Custom EFI Bootable ESXi Installation ISO

** UPDATE ** – VMware have published updated documentation with 2 different methods to generate the ISO Located Here

I have recently been working on a solution to be able to build our ESXi hosts unattended. I know there is plenty of documentation out there on using PXE to kickstart ESXi installations but that unfortunately does not meet my use case – which is to streamline the ESXi installs at our remote locations where we do not have PXE / DHCP networks.

Based on the above, I wanted to put a kickstart script on the ESXi installer ISO which is automatically called (based on modifying the boot.cfg file) to install ESXi in a scripted fashion. The process to accomplish this is documented in the VMware Documentation Center Located Here

My idea to accomplish this seemed sound – and will save the detail for another day, but a major stumbling block I had come across was after creating the ISO file based on the instructions above, it would not boot on an EFI based host. The same ISO worked fine if you used a BIOS based machine.

Cue some serious digging around as to why, including a ticket to VMware for this, it turns out the documentation linked above will only create the boot sector on the ISO for BIOS machines. There is a separate bootstrap which has to be created at the ISO generation stage which allows the resulting ISO to boot both EFI and BIOS.

You can see what boot sectors are on the ISO by using a linux dumpet utility. I won’t go into details on this, but it’s a useful tool for looking at CDs / ISOs.

It turns out that VMware are not alone on this challenge! After much searching online it turns out that most Linux distributions fall foul of this same limitation, but I had managed to find some articles which led me down the right path to eventually solve this!

to resolve this, you need to change the mkisofs command on the vmware documentation to the following:

mkisofs -relaxed-filenames -J -R -b isolinux.bin -c boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e efiboot.img -boot-load-size 1 -no-emul-boot -o customesxi.iso .

the key piece is the El Torito boot – this allows the ISO to strap the EFI bootable image. It turns out that EFI does not have support for ISO9660 file systems, so you have to create a GPT disk image containing the EFI images to boot.

Hope this helps! As for the VMware ticket – they have acknowledged this, and a request has been put into update the documentation.

iovDisableIR change in ESXi 6.0 U2 P04 causing PSOD on HPE Servers – Updated

So we have had an ‘interesting’ issue at work on the past few weeks!

We have had Gen8 / Gen9 blades in our environment randomly crashing over the last month. We had originally been sent down what seems an incorrect path, however seem we are on the right track now!


HP BL460c Gen8 and Gen9 blades with v2 / v3 processors, would randomly crash. There was no specific cause for them but it seemed to be more prevalent in higher I/O periods such as backups running. It started out that PSODs looked like this:

After logging a call with VMware, we were led down a path that the mlx4_core error in the above screenshot was causing the issue. After further investigation, it turned out that after upgrading from vSphere 5.5 to vSphere 6.0 (using VUM) there were mlx4 drivers left behind – which is what was causing the ‘jumpstart dependancy error’. Once we removed the bad 5.5 VIBs all was well.

The root cause as to why the 5.5 drivers remained after the 6.0 install, is because whilst the driver was present in the HP utilities bundle for 6.0 – the driver version was not revised, so VUM just ignored this! We have run into this before, and fed back to HP that even if the driver is the ‘same’, the version should be revised (and particularly incremented) to ensure the driver gets updated. This is not an issue if you use a fresh-install of vSphere 6.0.

So – we fixed this across the environment (along with some other VIBs – more on this later) and hoped this would be the end of it. 2 days later, we get further PSODs but this time without the mlx4 dependancy error!

Back to square one. We updated logs with VMware, and this time opened a case with HP – as it’s technically reporting LINT1 / NMI hardware errors. Cue 3 days later, and finally get some solid information back – a very interesting discovery!

HP recommended this customer advisory – One I have seen before and a long time back. Strange to me as we have never seen this issue before, and it’s not a setting we change as standard. There was also a specific error called out in the advisory which we had asked for confirmation this was in the logs:

ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40.

Anyhow – the most crucial piece that the HP L2 tech informed us, is that ESXi 6.0 Patch ESXi600-201611401-BG changed the setting in the HP customer advisory from a default of false to true.

After running a script in PowerCLI – it appears that is certainly the case, all the hosts we had running ESXi 6.0 Build 4600944 had the iovDisableIR setting set to TRUE (so it is disabled). This is causing the PSODs according to HP.

Digging a little further, the iovDisableIR is a parameter which handles IRQ Interrupt remapping. This is a feature developed by Intel to improve performance. According to VMware this feature had it’s issues originally – particularly with certain Intel chipsets so recommend disabling it in certain circumstances. However HP do support this and infect per their advisory recommend this is enabled to prevent PSODs. The interesting piece – is where the VMware KB (1030265) linked from the HP Customer Advisory states that the error may occur with QLogic HBAs. This is the HBA we use for our FC storage in the environment, but also explains why we have not seen PSODs in our Rackmounts (were they use Direct or SAS attached storage). But – On our Gen9 hardware, we have not seen PSODs, instead the QLogic HBA failing, or the host just rebooting, so I believe these are related to the above setting.

So – to resolve this, we need to do the following across all our hosts running ESXi 6.0 Build 4600944 (I have also had word this is also the same in the ESXi 6.5 release):

esxcli system settings kernel set –setting=iovDisableIR -v FALSE

This requires a reboot to take effect. To determine if the host is affected, you can use the following PowerCLI script to gather a report of the current setting.

We are awaiting further information from HP/VMware who are now collaborating on our cases to determine the root cause and why this was changed (and is attributed), however have rolled this setting out across our blade environment and will continue to monitor. I will update this post when we know more!

*** Update 15th Feb ***

VMware have now released a KB article on this issue.
VMware KB (2149043)

Word of Warning:

We did some digging on this setting, and found that iovDisableIR has been Disabled (set to TRUE) on the ESXi 6.5 initial release. It does not appear to be configured unique to HP Custom ISO’s