vSphere Appliance 6 update 1b issues – Regenerate your certificates!

Recently we decided to upgrade one of our clusters to the latest vSphere/ESXi 6.0 U1b.  While it’s early in the release cycle to apply these fixes to a production system, we’ve been having some issues with this cluster we were hoping the upgrade would resolve.  This cluster has 4 hosts running VSAN for storage, and is primarily used for VDI.  We use App Volumes pretty extensively.  This cluster uses the certs that were generated when the appliance was deployed.

Last Friday night, we mounted the ISO and ran the update.  Following the update things seemed fine for a while other than an apparently cosmetic “This node cannot communicate with all the other nodes in the VSAN cluster” in the VI native client which is documented in this VSAN Forum discussion.  However, after approximately 2 hours online, the vpxd service would end up unresponsive and VI and Web Client tools would hang.  It would intermittently come back, but you could only click on one or two things in the VI Client before it would become unresponsive again.  VCS servers would be unable to execute power operations.  If we reboot the VCSA, it would be responsive for a few hours until hanging again.  None of this behavior presented itself before the upgrade to U1b.

We opened up an SRR with VMware support, but after hours looking at logs the best they could suggest was rebuilding the applicance from scratch and re-importing our database from the broken appliance.  Ultimately, we started looking at SSO as the probable cause of our issues, we noted our SSO logs appeared much larger than they should be.  At the moment the appliance became unresponsive the SSO service started failing authentication requests.  Given that VMware’s solution was to rebuild the appliance, we decided we had little to loose by attempting to regenerate all of the certificates using the certificate-manager utility.  After giving that a go, the problem has resolved.  Our best guess is that one of the solution certificates was, to use the technical term “borked,” and that U1b either has some new throttling in place or handles broken solution certificates differently than 6.0a.

We’ll continue troubleshooting with VMware to attempt to determine the underlying cause and will update this post if we learn more.  It seems as though the common wisom was true, at least this time: “If thouest have performance issues with vSphere, SSO and certificates are thy cause.”