Connection Pooling is Busted

Posted by MightyZot on Geeks with Blogs See other posts from Geeks with Blogs or by MightyZot
Published on Tue, 09 Mar 2010 04:12:35 GMT Indexed on 2010/03/11 4:40 UTC
Read the original article Hit count: 576

Filed under:

A few weeks ago we started getting complaints about performance in an application that has performed very well for many years.  The application is a n-tier application that uses ADODB with the SQLOLEDB provider to talk to a SQL Server database.  Our object model is written in such a way that each public method validates security before performing requested actions, so there is a significant number of queries executed to get information about file cabinets, retrieve images, create workflows, etc.  (PaperWise is a document management and workflow system.)  A common factor for these customers is that they have remote offices connected via MPLS networks.

Naturally, the first thing we looked at was the query performance in SQL Profiler.  All of the queries were executing within expected timeframes, most of them were so fast that the duration in SQL Profiler was zero.  After getting nowhere with SQL Profiler, the situation was escalated to me.  I decided to take a peek with Process Monitor.  Procmon revealed some “gaps” in the TCP/IP traffic.  There were notable delays between send and receive pairs.  The send and receive pairs themselves were quite snappy, but quite often there was a notable delay between a receive and the next send.  You might expect some delay because, presumably, the application is doing some thinking in-between the pairs.  But, comparing the procmon data at the remote locations with the procmon data for workstations on the local network showed that the remote workstations were significantly delayed.  Procmon also showed a high number of disconnects.

Wireshark traces showed that connections to the database were taking between 75ms and 150ms.  Not only that, but connections to a file share containing images were taking 2 seconds!  So, I asked about a trust.  Sure enough there was a trust between two domains and the file share was on the second domain.  Joining a remote workstation to the domain hosting the share containing images alleviated the time delay in accessing the file share.  Removing the trust had no affect on the connections to the database.

Microsoft Network Monitor includes filters that parse TDS packets.  TDS is the protocol that SQL Server uses to communicate.  There is a certificate exchange and some SSL that occurs during authentication.  All of this was evident in the network traffic.  After staring at the network traffic for a while, and examining packets, I decided to call it a night.  On the way home that night, something about the traffic kept nagging at me.  Then it dawned on me…at the beginning of the dance of packets between the client and the server all was well.  Connection pooling was working and I could see multiple queries getting executed on the same connection and ethereal port.  After a particular query, connecting to two different servers, I noticed that ADODB and SQLOLEDB started making repeated connections to the database on different ethereal ports.  SQL Server would execute a single query and respond on a port, then open a new port and execute the next query.  Connection pooling appeared to be broken.

The next morning I wrote a test to confirm my hypothesis.  Turns out that the sequence causing the connection nastiness goes something like this:

  • Make a connection to the database.
  • Open a result set that returns enough records to require multiple roundtrips to the server.
  • For each result, query for some other data in the database (this will open a new implicit connection.)
  • Close the inner result set and repeat for every item in the original result set.
  • Close the original connection.

Provided that the first result set returns enough data to require multiple roundtrips to the server, ADODB and SQLOLEDB will start making new connections to the database for each query executed in the loop.  Originally, I thought this might be due to Microsoft’s denial of service (ddos) attack protection.  After turning those features off to no avail, I eventually thought to switch my queries to client-side cursors instead of server-side cursors.  Server-side cursors are the default, by the way.  Voila!  After switching to client-side cursors, the disconnects were gone and the above sequence yielded two connections as expected.

While the real problem is the amount of time it takes to make connections over these MPLS networks (100ms on average), switching to client-side cursors made the problem go away.  Believe it or not, this is actually documented by Microsoft, and rather difficult to find.  (At least it was while we were trying to troubleshoot the problem!)  So, if you’re noticing performance issues on slower networks, or networks with slower switching, take a look at the traffic in a tool like Microsoft Network Monitor.  If you notice a high number of disconnects, and you’re using fire-hose or server-side cursors, then try switching to client-side cursors and you may see the problem go away.

Most likely, Microsoft believes this to be appropriate behavior, because ADODB can’t guarantee that all of the data has been retrieved when you execute the inner queries.  I’m not convinced, though, because the problem remains even after replacing all of the implicit connections with explicit connections and closing those connections in-between each of the inner queries.  In that case, there doesn’t seem to be a reason why ADODB can’t use a single connection from the connection pool to make the additional queries, bringing the total number of connections to two.  Instead ADO appears to make an assumption about the state of the connection.

I’ve reported the behavior to Microsoft and am awaiting to hear from the appropriate team, so that I can demonstrate the problem.  Maybe they can explain to us why this is appropriate behavior.  :)

© Geeks with Blogs or respective owner