Wednesday, March 23, 2011

You Might Not Want to Disable Opportunistic Locking If . . .

1. If you programmatically access large flat-file databases hosted on a file share in Windows 2003, you might not want to disable opportunistic locking ( OpLocks ).
2. OK, I guess I only have one confirmed reason right now.

Problem:
We use Melissa Data COM objects and databases to provide some of the address validation for our customers. One of our developers was making some changes to our processes and noticed that it was taking over a minute for his program to initialize the Canadian address database that was hosted on our production server ( Windows 2003 Enterprise R2 Server x86 ). Usually, this process takes between 1 and 3 seconds. So, I started troubleshooting the issue.

Troubleshooting:
First, I ran Process Monitor to look at the registry, file system, and network access and make sure there wasn't anything abnormal with the communication. We found some issues with missing config files, but it turns out that the files are not required.

Second, I tried copying the database files to another server and accessing them on that server. Once this change was in place, the database initialized in 1-3 seconds as expected. Interesting . . . So, I then started evaluating the differences between the servers and found that one was 64-bit and the other was 32-bit. I also noticed that the servers were on different physical servers. Anyway, I tried this on a few different servers including one that was the same version and patch level, on the same physical host, and using the same physical network adapter. Every server I copied the database files to worked flawlessly and the database initialized in 1-3 seconds.

Third, I ran a wireshark trace and compared the network traffic going to the production server vs. the network traffic to the other servers that were working as expected. When accessing the production server with the issue I found that the communication between the hosts included over 44,000 SMB packets. Most of these packets were only 512 bytes. When accessing the other servers, the communication between the hosts included less than 900 SMB packets and the majority of these packets were 32,768 bytes. Interesting . . . Why was there so much more SMB traffic to the production server?

Fourth, I decided to test adding a new virtual hard drive to the production server. I made this decision because I was still unsure what the problem was, and since we format the drive on our production server with a smaller block size, I wanted to make sure that this difference was not a factor. It wasn't.

Fifth (OK, so to be completely honest there were a few more steps, but they are not worth mentioning here), I had obviously been searching online for possible solution, but I had not found anything that I thought would make a difference. However, after much searching, I found that many posts and articles dealing with SMB referenced registry entries under lanmanserver/parameters.

Solution:
I compared the registry values under the registry key lanmanserver/parameters on a server that was working with the production server. I found that there was a value "EnableOpLocks" on the production server with a value of "0" that did not exist on the other server. After reading about oplocks and how they affected SMB traffic, I decided to test removing this entry which would re-enable the opportunistic locking on that server. Once this change was made and the production server was rebooted, everything worked as expected and the behavior was the same as the other servers.

2 comments:

  1. In case anyone else happens upon this post, here's a warning:

    Enabling oplocks with a flat-file DB will cause data loss, period. You'll be extremely lucky if all you end up with is index corruption, but you're probably going to lose entire tables, if not the database itself.

    Oplocks are enabled by default.
    There was a reason oplocks were disabled.
    Don't change registry keys unless you know why they were changed to begin with.

    ReplyDelete
    Replies
    1. Thanks for your feedback. In our situation where this was a read-only database, we never experienced any corruption. Also, this was not a mission critical database and could have been replaced in a matter of minutes. I do appreciate the information though since the post does not address the possible corruption issues.

      Delete