vSAN on HPE Synergy Composeable Infrastructure – Part 2

Firstly, Apologies for the delay in getting the follow up to this series posted. I am getting all these together now to post in quicker succession. Hopefully will have all these posted around VMworld US!

So, this blog post is going to dive into the configuration and components of our vSAN setup using HPE Synergy, based on the overview in Part 1 where I mentioned this would be supporting VDI workloads.

Firstly, a little terminology to help you follow along with the rest of the blog articles:
Frame – Enclosure chassis which can hold up to 12 compute modules, and 8 interconnect modules.
Interconnect – Linkage between blade connectivity and datacentre such as Fibre Channel, Ethernet, and SAS.
Compute Module – Blade server containing CPU, Memory, PCI Expansion cards, and optionally Disk.
Storage Module – Storage chassis within above frame which can hold up to 40 SFF drives (SAS / SATA / SSD). Each storage module occupies 2 compute slots.
Stack – Between 1 and 3 Synergy frames combined together to form a single logical enclosure. Allows sharing of Ethernet uplinks to datacentre with Virtual Connect.
Virtual Connect – HPE technology to allow multiple compute nodes to share a smaller set of uplinks to datacentre networking infrastructure. Acts similar to a switch internal to the Synergy stack.

All of the vSAN nodes are contained within a single Synergy frame or chassis. Main reason behind this is that today, HPE do not support SAS connectivity across multiple frames within a stack, therefore the compute nodes accessing storage must be in the same frame as the storage module. You can mix the density of storage vs. compute within Synergy how you like. So, using a single storage module will leave 10 bays for compute.

Our vSAN configuration is set out as so:
1 x 12000 Synergy Frame with:

2 x SAS Interconnect Modules
2 x Synergy 20Gb Interconnect Link Modules
1 x D3940 Storage module with:

10 x 800GB SAS SSDs for Cache
30 x 3.84TB SATA SSDs for Capacity
2 x I/O Modules

10 x SY480 Gen 10 computer modules with:

768GB RAM (24 x 32GB DDR4 Sticks)
2 x Intel Xeon Gold 6140 CPUs (2.3Ghz x 18 cores)
2 x 300GB SSDs (for ESXi Installation)
Synergy 3820C CNA / Network Adapter
P416ie-m SmartArray Controller

The above frame is actually housed within a logical enclosure or stack containing 3 frames. This means the entire stack shares 2 redundant Virtual Connect Interconnects out to the physical switching infrastructure – but in our configuration these are in a different frame to that containing the vSAN nodes. The stack is interconnected with 20GB Interconnect modules to a pair of master Virtual Connects. For our environment, we have 4 x 40Gbe uplinks to the physical switching infrastructure per stack (2 per redundant Interconnect).

We keep our datacentre networking relatively simple, so all VLANs are trunked through the Virtual Connect switches directly to ESXi. We decided not to have any separation of networking, or internal networking configured within Virtual Connect. Therefore, vSAN replication traffic, and vMotion traffic will traverse out to the physical switching infrastructure, and hairpin back in, however this is of little concern given the bandwidth available to the stack.

That’s all for an overview of the hardware. But do let me know if there is any other detail you would like to see surrounding this topic! The next post will detail how a blade is ‘cabled’ to use the storage and network in a profile.

vSAN on HPE Synergy Composeable Infrastructure – Part 1

It’s been a while since I have posted, been pulled in many different directions with work priorities so blogging took a temporary side-line! I am now back and going to blog about a project I am currently working on to build out an extended VDI vSAN environment.

Our existing vSAN environment is running on DL380 G9 rackmounts, which whilst we had some initial teething issues have been particularly solid and reliable of late!

We are almost to the point of exhausting our CPU and Memory resources for this environment, along with about 60% utilized on the vSAN datastores across the 3 clusters. So with this it felt a natural fit to expand our vSAN environment as we continue the migration to Windows 10, and manage the explosive growth of the environment – aided by recently implementing Microsoft Azure MFA authentication vs 2-factor using a VPN connection.

As an organization, we are about to a refresh a number of HP Gen8 blades in our datacentre, and in looking at going to Gen10 knowing that this could be the last generation to support C7000 chassis, we thought it would be a good time to look at other solutions. This is where HPE Synergy composable infrastructure came in! After an initial purchase of 4 frames, and a business requirement causing us to expand this further – we felt that expanding vSAN could be a good fit into Synergy with the D3940 storage module.

Now we have the hardware in the datacentre and finally racked up, I am going to be going through a series of blogs on how vSAN looks in HPE Synergy composable infrastructure, our configuration, and some of the Synergy features / automation capabilities which make this easier to implement vs the traditional DL380 Gen9 rackmount hardware we have in place today. Stay tuned or follow my twitter handle for notifications for more on this series.

Get AHS Data from HPE iLO4+ using PowerShell

I discovered this possibility a year back, and it’s only now I invested the time to get this working! It turned out not to be as challenging as I thought, in fact the hardest bit was getting the authentication token.

I have written a PowerShell function called Get-AHSData which allows you to gather the AHS data from an HPE iLO 4 or newer. These AHS logs are frequently requested by HPE Support when logging calls for Proliant, and downloading these using the iLO UI can be cumbersome – and involves a mouse!

Get-AHSData will allow you to specify the server, iLO Credentials and a Start / End date for the log range if necessary. By default it will grab metrics for 1 day. You specify a folder to export the file to and it will go grab the file and save it there, returning a file list (if you run against multiple iLOs).

Code is out on GitHub Here

Sorry it’s a little long to embed here, and keeping it in GitHub will allow me to iterate it with improvements without having to circle back and update this page.

An example of this running:

Let me know on GitHub if you have any issues, or feel free to fork and improve! I already have a couple of enhancements from my colleagues which I will look to include.

iovDisableIR change in ESXi 6.0 U2 P04 causing PSOD on HPE Servers – Updated

So we have had an ‘interesting’ issue at work on the past few weeks!

We have had Gen8 / Gen9 blades in our environment randomly crashing over the last month. We had originally been sent down what seems an incorrect path, however seem we are on the right track now!

Symptoms

HP BL460c Gen8 and Gen9 blades with v2 / v3 processors, would randomly crash. There was no specific cause for them but it seemed to be more prevalent in higher I/O periods such as backups running. It started out that PSODs looked like this:

After logging a call with VMware, we were led down a path that the mlx4_core error in the above screenshot was causing the issue. After further investigation, it turned out that after upgrading from vSphere 5.5 to vSphere 6.0 (using VUM) there were mlx4 drivers left behind – which is what was causing the ‘jumpstart dependancy error’. Once we removed the bad 5.5 VIBs all was well.

The root cause as to why the 5.5 drivers remained after the 6.0 install, is because whilst the driver was present in the HP utilities bundle for 6.0 – the driver version was not revised, so VUM just ignored this! We have run into this before, and fed back to HP that even if the driver is the ‘same’, the version should be revised (and particularly incremented) to ensure the driver gets updated. This is not an issue if you use a fresh-install of vSphere 6.0.

So – we fixed this across the environment (along with some other VIBs – more on this later) and hoped this would be the end of it. 2 days later, we get further PSODs but this time without the mlx4 dependancy error!

Back to square one. We updated logs with VMware, and this time opened a case with HP – as it’s technically reporting LINT1 / NMI hardware errors. Cue 3 days later, and finally get some solid information back – a very interesting discovery!

HP recommended this customer advisory – One I have seen before and a long time back. Strange to me as we have never seen this issue before, and it’s not a setting we change as standard. There was also a specific error called out in the advisory which we had asked for confirmation this was in the logs:

ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40.

Anyhow – the most crucial piece that the HP L2 tech informed us, is that ESXi 6.0 Patch ESXi600-201611401-BG changed the setting in the HP customer advisory from a default of false to true.

After running a script in PowerCLI – it appears that is certainly the case, all the hosts we had running ESXi 6.0 Build 4600944 had the iovDisableIR setting set to TRUE (so it is disabled). This is causing the PSODs according to HP.

Digging a little further, the iovDisableIR is a parameter which handles IRQ Interrupt remapping. This is a feature developed by Intel to improve performance. According to VMware this feature had it’s issues originally – particularly with certain Intel chipsets so recommend disabling it in certain circumstances. However HP do support this and infect per their advisory recommend this is enabled to prevent PSODs. The interesting piece – is where the VMware KB (1030265) linked from the HP Customer Advisory states that the error may occur with QLogic HBAs. This is the HBA we use for our FC storage in the environment, but also explains why we have not seen PSODs in our Rackmounts (were they use Direct or SAS attached storage). But – On our Gen9 hardware, we have not seen PSODs, instead the QLogic HBA failing, or the host just rebooting, so I believe these are related to the above setting.

So – to resolve this, we need to do the following across all our hosts running ESXi 6.0 Build 4600944 (I have also had word this is also the same in the ESXi 6.5 release):

esxcli system settings kernel set –setting=iovDisableIR -v FALSE

This requires a reboot to take effect. To determine if the host is affected, you can use the following PowerCLI script to gather a report of the current setting.

$Hosts = Get-VMHost
$Report = $Hosts | % {
    $iovdata = ($_ | Get-EsxCli).system.settings.kernel.list() | ? {$_.Name -like 'iovDisableIR'}
    $_ | select name,parent,model,processorType,version,build,@{n='iovDisableIR_Conf';e={$iovdata.Configured}},@{n='iovDisableIR_Runtime';e={$iovdata.Runtime}}
}

We are awaiting further information from HP/VMware who are now collaborating on our cases to determine the root cause and why this was changed (and is attributed), however have rolled this setting out across our blade environment and will continue to monitor. I will update this post when we know more!

*** Update 15th Feb ***

VMware have now released a KB article on this issue.
VMware KB (2149043)

Word of Warning:

We did some digging on this setting, and found that iovDisableIR has been Disabled (set to TRUE) on the ESXi 6.5 initial release. It does not appear to be configured unique to HP Custom ISO’s