Thursday, December 16, 2010

IISReset and There Go My Changes

Have you ever made changes in IIS 6 and then issued an IIS reset through the command-line or through the GUI and lost all the changes you just made? Oh come on, I can't be the only one. Anyway, apparently IIS 6 caches all of your changes and then writes them to disk automatically at some interval (about 5 minutes according to my testing). So, even if there are websites out there (and there are) that say that the metabase is flushed to disk when you issue a reset through the Internet Information Services Manager (IIS Manager), my testing has proven otherwise. Is it possible that this behavior is specific to my environment? Sure, but I have run the tests on multiple web servers (two different domains and one that was not part of a domain) with the same results.

So, what do you do to make sure this doesn't happen? It is actually very easy. Righ-click the web server in IIS Manager, select "All Tasks", and select "Save Configuration to Disk" before you click "Restart IIS . . .". Or, if you want to do it from the command line, run "cscript.exe %SYSTEMROOT%\system32\iiscnfg.vbs /save" before you run IISReset.

The other solution would be to restart all of the services without using IISReset. See this KB article for more details.

Wednesday, October 13, 2010

Curse of the GrubUpdate: Upgrading from VMWare ESX 3.5 to vSphere 4 - The Solution

My last post was a diatribe about the horrible support experience that I had with VMWare on this issue. It provided the solution, but I figured I would write a more pointed and detailed explanation.

The errors we were getting when trying to upgrade one of our VMWare ESX 3.5 hosts to VMWare vSphere 4 were as follow:

Error in Host Update Utility:
Grub update failed

Error in vua.log:
grub> find /esx4-upgrade/vmlinuz
Error 15: File not found
grub>
info: END grub output
error: grub cannot find root hd number

After many months of working with VMWare on this issue, I still did not have a good explanation of what the grubupdate process was or what might be causing it to fail. I got sick of constantly attempting the upgrade process at the request of VMWare even though there had been no change or very insignificant changes to the system. So, I started to look at the grub files more closely and compare them to servers that upgrade successfully.

The first attempt I made to correct the issue was to re-install ESX 3.5 while maintaining the existing datastores. I did this because I did not have a /var/log partition. I just had a /var partition with a log folder. The reason I thought this might be the problem is that the vSphere 4.0 upgrade always creates a /var/log partition for the ESX 3.5 failover install that you can use to boot 3.5. Anyway, this did not fix the problem.

After some more research, I noticed that all of my other servers that had been successfully upgraded had the following line in the grub.conf:

kernel /vmlinuz-version ro root=/dev/sda2

The server that was failing had the following line:

kernel /vmlinuz-version ro root=/dev/sda7

Well, I noticed sda2 on the upgraded servers was a primary partition and sda7 on the failing server was an extended partition. I hypothsized that vSphere 4 requires you to have your system partition on a primary partition. Once again, I re-installed 3.5 (maintaining the existing datastores) making sure that I installed the boot and system partitions as primary partitions and then the upgrade was successful.

If my hypothesis is true (just because it worked for me does not totally confirm my hypothesis), I cannot believe that this is not documented in the upgrade docs and that tech support was not able to help me find a solution. Anyway, I said enough about that in my previous post.

Curse of the GrubUpdate: Upgrading from VMWare ESX 3.5 to vSphere 4 - The Experience

So, for the last 4 months my team and I have been working with VMWare to find a solution to an error we were receiving upgrading from ESX 3.5 to vSphere 4. Every single time we ran the update, which thanks to VMWare was like 15 times, we got to 24% right after the ISO file finishes uploading and the status would change to "Running grubupdate . . ." and the installation would fail. This is the error I saw in the logs:

grub> find /esx4-upgrade/vmlinuz

Error 15: File not found
grub>
info: END grub output
error: grub cannot find root hd number

You can read about the solution here. Or, you can wade through my diatribe on VMWare support below.

So, after some thorough troubleshooting we submitted a ticket to VMWare. Let me preface this by saying that we have upgraded a bunch of our VMWare ESX 3.5 hosts to vSphere 4.0 without any problems. I really like VMWare's products and have at times received decent support from them. However, the past 4 months I feel like I have been living in the twilight zone.

For the first month, we were asked to try the upgrade again by countless support reps as our request was passed around. I even had one rep call me to ask me for information on the problems I was having upgrading my Windows 2003 Virtual Machine (Seriously, did you even read the ticket?). Anyway, after about three attempts to upgrade without reason, I refused to attempt another upgrade until they offered some type of fix that made sense.

Wait 1 month . . .

Finally they got back to me and said that the BIOS version of our server was not supported (even though they admitted our other server that had successfully upgraded had a much older BIOS version). Anyway, I gave it a shot and it didn't work.

Wait another month . . .

After this I was frustrated so I even tried re-installing 3.5 preserving the existing datastores and the upgrade still failed. Then, VMWare said I had a corrupt partition table. I deleted and re-created datastore partitions and reinstalled so that I had re-created every partition on the server and still no luck.

You may ask at this point why I didn't just blow the machine away and start over. Well, lets just say it wasn't an option. We had some production machines on the server and no space anywhere else to put them. So, in deleting and creating partitions I was constantly jockeying these virtual machines around.

Anyway, I kept troubleshooting on my own because VMWare finally came back and said, let us know when you can get your production data off the server so we can fix the partition table because it could destroy all of your data. Finally, I stumbled accross what seemed to me like a probable solution.

It was simple actually. My grub.conf file was pointing at an extended partition instead of a primary partition. I was able to free up some space and a primary partition, reinstall esx 3.5 (preserving the existing vmfs datastores) with the boot and system partitions as primary partitions, and successfully upgrade the host.

So, after one of the worst (sadly not the worst) support experiences of my life. We finally have finished upgrading all of our hosts at this location. I will post a shorter, more detailed solution and link to it here in case people don't want to read my entire rant.

Friday, October 8, 2010

Thin Clients & Terminal Servers - What to look out for or what are the stand-out issues?

I posted an answer on LinkedIn in response to a question and figured it would make an OK post. The question was "Have you ever done a Thin Client Implementation? What are the stand-out issues?".

In our thin client implementation, we used really cheap HP thin clients ~$185 and Microsoft Terminal Services (Read about it here. I think thin clients work well if you have a large amount of users that use the same applications (at least in a terminal server/citrix environment). VMWare VDI may support users with more varied requirements, but licensing on that was a little unclear when we did the analysis.

We currently run over 200 data entry personnel on thin clients (one application that uses very few resources so it is an ideal application for thin clients). We run another 150 call center agents on thin clients also. These users need more resources because they run some web-enabled applications that require more memory and processing power.

I agree with the comments above (I won't steal anyones thunder so if you want to see others answers you can search LinkedIn), but would add that you should disable Windows Error Reporting in any shared Windows environment. This article explains this and has links to configuration documentation. If you ever need it for debugging, you can always re-enable it.

Also, make sure you customize your group policy and login scripts for the terminal servers. You need to trim them down as much as you can because if you have a lot of users logging in at the same time, it can be pretty slow.

Make sure your helpdesk is trained on how to quickly identify what possible causes of slowdowns might be. Many times it is just a program with a memory leak or stuck on some process that is slowing down the entire server. If you can quickly identify the user and have them shut down the offending process, you can avoid too many complaints. Also, be proactive and set up performance logging and alerts to notify you of high utilization on the servers.

Finally, make sure your machines are protected (firewall, antivirus, IDS/IPS, etc.). I have spoken to others that have lost entire citrix/terminal server farms to a virus outbreak. While you get the huge benefit of reduced administrative effort by only having to support a fraction of the machines, you also increase your risk if you lose one or many.

Oh yeah and no DirectX support at all and no microphone (client-to-server audio) with terminal services without third party add-ons.

Thursday, September 30, 2010

"Log On To" Why Doesn't Anybody Use It?

I decided to write an article on the "Log On To" feature in Microsoft's Active Directory because I have yet to find others that use this feature. I am not saying you aren't out there, but I think this is a much overlooked feature. We use this feature along with "Limit Login" (see this post) to restrict the computers our users can log in to and limit simultaneous sessions.

The Log On To feature can be found by going to the properties of the user object and selecting the "Account" tab. There is a button on that tab that says "Log On To...". You can use this button to open a dialog that allows you to specify all of the computers a user is allowed to (Have you guessed yet?) log on to.

Why is this important?

Well, why not? If there are users that only log into one computer every single day, why allow them to log into every single machine on the network.

What does it block (By Design)?

The "Log On To" feature stops the user from logging on the the console of a computer (whether sitting at the machine or through remote control software (Remote Desktop/RDP, PC Anywhere, VNC, etc.)).

What does it block (Undesired results)?

So, the feature is not without problems. If you are using any type of LDAP authentication, you will have to add your LDAP servers to the list of allowed computers. You will also have to add the server that hosts Outlook Web Access if you use Exchange for your mail server. Other stuff that you may have issues with are Radius servers and websites with Integrated Authentication.

What doesn't it block?

You can still use file/printer sharing on servers that are not on the list so you do not need to add your File/Print servers to the list. You do not need to add your Domain Controllers either unless you are using them for LDAP.

Microsoft Limit Login and Login Scripts on x64 Machines

We use Limit Login in our environment and I ran into some issues the other day when we deployed some 64-bit terminal servers at our Beijing, China location. For those unfamiliar with Limit Login, it is a utility provided by Microsoft that allows you to limit the number of simultaneous login attempts within an Active Directory environment. The utility works by extending the the Active Directory schema to store additional information related to logins. Therefore, you do not need to store the information in a separate database as required with past methods. The utility then uses web services and login scripts to update the information in Active Directory. For more information on the utility, please see this article.

We use Limit Login along with the "Log On To" property (see this post) of the Active Directory user object to limit the machines users can log on to and how many simultaneous sessions they are allowed.

Anyway, back to the issues. Once I set up all of the users and configured their user objects to limit the number of simultaneous logins, I performed some tests and noticed it wasn't adding the logins to Active Directory. After some troubleshooting, I noticed that the login scripts were not running correctly. They were getting errors because the objects used to connect to the web services were 32-bit controls. After additional troubleshooting, I found that I needed to run the login script under the 32-bit version of wscript (the object that runs script files like vbscript in windows). Apparently, the x64 version of Windows Server 2003 includes two different objects. The default object is stored in the system32 folder and is actually the 64-bit version (yeah awesome right). The 32-bit version is stored in SysWow64 (again awesome, but I am sure they have their reasons). Anyway, since I needed the script to run under the 32-bit object, I had to create a login script that first determined if the OS was x86 or x64 and then ran the original Limit Login script under the correct version of the wscript object for x64 servers.

Here is an example of the login script that calls the Limit Login script:

On Error Resume Next

Set WshShell = CreateObject("WScript.Shell")

OsType = WshShell.RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment\PROCESSOR_ARCHITECTURE")
If (OsType = "x86") Then
WSHShell.Run "wscript \\SERVERNAME\LLScripts$\lloginscript.vbs", , True
Else
WSHShell.Run "%windir%\SysWow64\wscript \\SERVERNAME\LLScripts$\lloginscript.vbs", , True
End If

Friday, July 30, 2010

SOLVED: Cannot Authenticate to IIS Locally using a DNS Name

I worked on a support issue the other day that was really strange and I had a tough time finding the answer even though there are two Microsoft articles that discuss the issue. Here is what was happening:

One of our developers was testing an application on a web server and set it up to only allow Integrated Authentication. He could log onto the site when he was on another computer, but could not log onto the website when he was on the web server itself. He could not log on if he was local, but he could if he was not local.

On the local machine, he would get prompted for log on and enter the user name and password a few times before getting the following error:

"HTTP 401.1 - Unauthorized: Logon Failed"

My first few searches to see if anyone else had this problem returned a bunch of irrelevent articles about people that had issues where they could log on locally, but could not log on from outside machines. I learned I was able to log onto the local web server if I used the IP address or the computer name, but was only unsuccessful if I used a DNS name.

I finally got lucky with one of my searches and came accross these two MS articles:

Article 1
Article 2

Article 1 explains two methods for solving the issue. Article 2 explains why you should use method 1 instead of method 2. Beware, article 1 says you only need to restart the IIS Admin service after you modify the registry. This is not correct. You need to reboot the server.

Tuesday, June 8, 2010

Login Script Not Working - Curse of the Variant Return Type

We were setting up some new users the other day because we are adding seats to our call center business. The setup was slightly different than our other setups have been for the call center because these agents are working out of our Draper, Utah location. We previously did not have any agents at this location.

Everything was working really well, until we found out the login script was not mapping the drive that the call center agents needed to run the call center applicaiton. Of course, we could map the drive manually, but it was super frustrating because we have used variations of the same login script for a long time and have never had any issues. So, I started looking at the script and troubleshooting the issue.

The login script is a pretty simple vbscript that loops through the user's group membership and maps drives based on those groups. I started by adding some messages to the vbscript to make sure the script was running. One message box displayed the groups as it looped through and determined drive mappings. For these agents, it displayed one message box, but it was blank where it should have had a group name. The weird part is that the call center supervisor's script was displaying his group membership and mapping the drive correctly. So, I started looking at the differences between the supervisor's user object and the agents' user objects.

The main difference between the user objects was that the call center supervisor was a member of two groups other than 'Domain Users', and, in the spirit of least privilege, the call center agents were only a member of one group other than 'Domain Users'. So, for a quick test I added another group to an agent, and, crazy as it sounds, the script started mapping the drives. Great so the problem was "fixed" (read patched), but the "solution" (read workaround) drove me crazy so I did some research to find out why the number or groups mattered.

I looked up the memberOf attribute of the active directory user object that I was using to get the array of groups on google and this article explained it all.

Apparently, memberOf returns an array if you have more than one group (other than 'Domain Users' because 'Domain Users' is the primary group and is not returned by memberOf). However, if you have no groups (other than 'Domain Users'), it returns an empty object. Finally, if you only have one group, it returns a 'String' variable. Seriously, an array, a string, or an empty object.

Long story short, my 'For Each' loop would not work on a String variable, so that is why the script was not running correctly. So, I changed my code to account for the different return types, and the login script worked as designed regardless of how many groups the users were a member of.

I am sure that there are programmers out there that swear by variants, but as a return type, I am not so sure it is the best coding practice. I am sure there are people that disagree and look forward to their comments.