Wrestling with VMware High Availability (HA)

A few months back I had a little bit of trouble with an upgrade in our corporate VMware cluster that I thought I would share. The details of this upgrade was to add a new host to the mix and bring everything up to vSphere 4.1 update 1. It seemed pretty straight forward at the time but there were a few unexpected issues that sucked up more time than expected.

Now, we have several clusters here in our company and often times we move a host from one cluster to the next. Several of the clusters were originally setup to work within the various active directory domains. This has been rather annoying when moving a host from one domain to another and having to do all the DNS update foo. Its much easier to have one private domain to rule them all so most of our hosts have been updated and moved to this new private domain that can be resolved by all domains.

This is where the fun actually comes in. The cluster that I’m moving a new host into is a hold out in the old naming space. Adding a new node shouldn’t be a big deal as the new name resolves in the virtual center.

Now, here comes the rub.

When upgrading a particular host, for some reason I could not enable the HA configuration. It just wouldn’t work. No particular reason other than it just failed to find the primary node. Now, you would think that this was failing on the new host that was added to the cluster. Nope, that added just fine, no worries there what so ever. The node that was failing was actually the 3rd host in the group to be upgraded.

Apparently what was happening actually had nothing to do with the primary node. It happened to deal with the new node with the name in the private domain. The issue, somewhere deep in the bowels of vSphere, it attempts to look up by the “short name” meaning esx123456 instead of esx123456.domain.com. While I was able to resolve esx123456.private.domain.com, I was unable to resolve esx123456 as the rest of the cluster was still looking for esx123456.domain.com which didn’t exist.

So my advice to you is, when changing the domains of hosts in a cluster, make sure you have all entries in both the new and old domains so you can avoid this short name lookup failure.