What can cause a spontaneous EPIPE error without either end calling close() or crashing?
        Posted  
        
            by Hongli
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Hongli
        
        
        
        Published on 2010-02-10T10:13:38Z
        Indexed on 
            2010/03/28
            20:23 UTC
        
        
        Read the original article
        Hit count: 259
        
I have an application that consists of two processes (let's call them A and B), connected to each other through Unix domain sockets. Most of the time it works fine, but some users report the following behavior:
- A sends a request to B. This works. A now starts reading the reply from B.
 - B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
 - A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.
 
Users have also reported variations of this behavior, e.g.:
- A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
 - B reads a partial request and then suddenly gets an EOF.
 
The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.
Things that I've already tried and considered:
- Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
 - Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.
 
What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.
© Stack Overflow or respective owner