Wednesday, October 13, 2010

Curse of the GrubUpdate: Upgrading from VMWare ESX 3.5 to vSphere 4 - The Solution

My last post was a diatribe about the horrible support experience that I had with VMWare on this issue. It provided the solution, but I figured I would write a more pointed and detailed explanation.

The errors we were getting when trying to upgrade one of our VMWare ESX 3.5 hosts to VMWare vSphere 4 were as follow:

Error in Host Update Utility:
Grub update failed

Error in vua.log:
grub> find /esx4-upgrade/vmlinuz
Error 15: File not found
grub>
info: END grub output
error: grub cannot find root hd number

After many months of working with VMWare on this issue, I still did not have a good explanation of what the grubupdate process was or what might be causing it to fail. I got sick of constantly attempting the upgrade process at the request of VMWare even though there had been no change or very insignificant changes to the system. So, I started to look at the grub files more closely and compare them to servers that upgrade successfully.

The first attempt I made to correct the issue was to re-install ESX 3.5 while maintaining the existing datastores. I did this because I did not have a /var/log partition. I just had a /var partition with a log folder. The reason I thought this might be the problem is that the vSphere 4.0 upgrade always creates a /var/log partition for the ESX 3.5 failover install that you can use to boot 3.5. Anyway, this did not fix the problem.

After some more research, I noticed that all of my other servers that had been successfully upgraded had the following line in the grub.conf:

kernel /vmlinuz-version ro root=/dev/sda2

The server that was failing had the following line:

kernel /vmlinuz-version ro root=/dev/sda7

Well, I noticed sda2 on the upgraded servers was a primary partition and sda7 on the failing server was an extended partition. I hypothsized that vSphere 4 requires you to have your system partition on a primary partition. Once again, I re-installed 3.5 (maintaining the existing datastores) making sure that I installed the boot and system partitions as primary partitions and then the upgrade was successful.

If my hypothesis is true (just because it worked for me does not totally confirm my hypothesis), I cannot believe that this is not documented in the upgrade docs and that tech support was not able to help me find a solution. Anyway, I said enough about that in my previous post.

Curse of the GrubUpdate: Upgrading from VMWare ESX 3.5 to vSphere 4 - The Experience

So, for the last 4 months my team and I have been working with VMWare to find a solution to an error we were receiving upgrading from ESX 3.5 to vSphere 4. Every single time we ran the update, which thanks to VMWare was like 15 times, we got to 24% right after the ISO file finishes uploading and the status would change to "Running grubupdate . . ." and the installation would fail. This is the error I saw in the logs:

grub> find /esx4-upgrade/vmlinuz

Error 15: File not found
grub>
info: END grub output
error: grub cannot find root hd number

You can read about the solution here. Or, you can wade through my diatribe on VMWare support below.

So, after some thorough troubleshooting we submitted a ticket to VMWare. Let me preface this by saying that we have upgraded a bunch of our VMWare ESX 3.5 hosts to vSphere 4.0 without any problems. I really like VMWare's products and have at times received decent support from them. However, the past 4 months I feel like I have been living in the twilight zone.

For the first month, we were asked to try the upgrade again by countless support reps as our request was passed around. I even had one rep call me to ask me for information on the problems I was having upgrading my Windows 2003 Virtual Machine (Seriously, did you even read the ticket?). Anyway, after about three attempts to upgrade without reason, I refused to attempt another upgrade until they offered some type of fix that made sense.

Wait 1 month . . .

Finally they got back to me and said that the BIOS version of our server was not supported (even though they admitted our other server that had successfully upgraded had a much older BIOS version). Anyway, I gave it a shot and it didn't work.

Wait another month . . .

After this I was frustrated so I even tried re-installing 3.5 preserving the existing datastores and the upgrade still failed. Then, VMWare said I had a corrupt partition table. I deleted and re-created datastore partitions and reinstalled so that I had re-created every partition on the server and still no luck.

You may ask at this point why I didn't just blow the machine away and start over. Well, lets just say it wasn't an option. We had some production machines on the server and no space anywhere else to put them. So, in deleting and creating partitions I was constantly jockeying these virtual machines around.

Anyway, I kept troubleshooting on my own because VMWare finally came back and said, let us know when you can get your production data off the server so we can fix the partition table because it could destroy all of your data. Finally, I stumbled accross what seemed to me like a probable solution.

It was simple actually. My grub.conf file was pointing at an extended partition instead of a primary partition. I was able to free up some space and a primary partition, reinstall esx 3.5 (preserving the existing vmfs datastores) with the boot and system partitions as primary partitions, and successfully upgrade the host.

So, after one of the worst (sadly not the worst) support experiences of my life. We finally have finished upgrading all of our hosts at this location. I will post a shorter, more detailed solution and link to it here in case people don't want to read my entire rant.

Friday, October 8, 2010

Thin Clients & Terminal Servers - What to look out for or what are the stand-out issues?

I posted an answer on LinkedIn in response to a question and figured it would make an OK post. The question was "Have you ever done a Thin Client Implementation? What are the stand-out issues?".

In our thin client implementation, we used really cheap HP thin clients ~$185 and Microsoft Terminal Services (Read about it here. I think thin clients work well if you have a large amount of users that use the same applications (at least in a terminal server/citrix environment). VMWare VDI may support users with more varied requirements, but licensing on that was a little unclear when we did the analysis.

We currently run over 200 data entry personnel on thin clients (one application that uses very few resources so it is an ideal application for thin clients). We run another 150 call center agents on thin clients also. These users need more resources because they run some web-enabled applications that require more memory and processing power.

I agree with the comments above (I won't steal anyones thunder so if you want to see others answers you can search LinkedIn), but would add that you should disable Windows Error Reporting in any shared Windows environment. This article explains this and has links to configuration documentation. If you ever need it for debugging, you can always re-enable it.

Also, make sure you customize your group policy and login scripts for the terminal servers. You need to trim them down as much as you can because if you have a lot of users logging in at the same time, it can be pretty slow.

Make sure your helpdesk is trained on how to quickly identify what possible causes of slowdowns might be. Many times it is just a program with a memory leak or stuck on some process that is slowing down the entire server. If you can quickly identify the user and have them shut down the offending process, you can avoid too many complaints. Also, be proactive and set up performance logging and alerts to notify you of high utilization on the servers.

Finally, make sure your machines are protected (firewall, antivirus, IDS/IPS, etc.). I have spoken to others that have lost entire citrix/terminal server farms to a virus outbreak. While you get the huge benefit of reduced administrative effort by only having to support a fraction of the machines, you also increase your risk if you lose one or many.

Oh yeah and no DirectX support at all and no microphone (client-to-server audio) with terminal services without third party add-ons.