Wednesday, October 13, 2010

Curse of the GrubUpdate: Upgrading from VMWare ESX 3.5 to vSphere 4 - The Experience

So, for the last 4 months my team and I have been working with VMWare to find a solution to an error we were receiving upgrading from ESX 3.5 to vSphere 4. Every single time we ran the update, which thanks to VMWare was like 15 times, we got to 24% right after the ISO file finishes uploading and the status would change to "Running grubupdate . . ." and the installation would fail. This is the error I saw in the logs:

grub> find /esx4-upgrade/vmlinuz

Error 15: File not found
grub>
info: END grub output
error: grub cannot find root hd number

You can read about the solution here. Or, you can wade through my diatribe on VMWare support below.

So, after some thorough troubleshooting we submitted a ticket to VMWare. Let me preface this by saying that we have upgraded a bunch of our VMWare ESX 3.5 hosts to vSphere 4.0 without any problems. I really like VMWare's products and have at times received decent support from them. However, the past 4 months I feel like I have been living in the twilight zone.

For the first month, we were asked to try the upgrade again by countless support reps as our request was passed around. I even had one rep call me to ask me for information on the problems I was having upgrading my Windows 2003 Virtual Machine (Seriously, did you even read the ticket?). Anyway, after about three attempts to upgrade without reason, I refused to attempt another upgrade until they offered some type of fix that made sense.

Wait 1 month . . .

Finally they got back to me and said that the BIOS version of our server was not supported (even though they admitted our other server that had successfully upgraded had a much older BIOS version). Anyway, I gave it a shot and it didn't work.

Wait another month . . .

After this I was frustrated so I even tried re-installing 3.5 preserving the existing datastores and the upgrade still failed. Then, VMWare said I had a corrupt partition table. I deleted and re-created datastore partitions and reinstalled so that I had re-created every partition on the server and still no luck.

You may ask at this point why I didn't just blow the machine away and start over. Well, lets just say it wasn't an option. We had some production machines on the server and no space anywhere else to put them. So, in deleting and creating partitions I was constantly jockeying these virtual machines around.

Anyway, I kept troubleshooting on my own because VMWare finally came back and said, let us know when you can get your production data off the server so we can fix the partition table because it could destroy all of your data. Finally, I stumbled accross what seemed to me like a probable solution.

It was simple actually. My grub.conf file was pointing at an extended partition instead of a primary partition. I was able to free up some space and a primary partition, reinstall esx 3.5 (preserving the existing vmfs datastores) with the boot and system partitions as primary partitions, and successfully upgrade the host.

So, after one of the worst (sadly not the worst) support experiences of my life. We finally have finished upgrading all of our hosts at this location. I will post a shorter, more detailed solution and link to it here in case people don't want to read my entire rant.

No comments:

Post a Comment