Diagnosing Random Network Lag
        Posted  
        
            by 
                uesp
            
        on Server Fault
        
        See other posts from Server Fault
        
            or by uesp
        
        
        
        Published on 2011-11-14T15:54:27Z
        Indexed on 
            2011/11/18
            17:53 UTC
        
        
        Read the original article
        Hit count: 294
        
I'm having trouble diagnosing some random lag on a 6 server LAMP cluster serving a MediaWiki site. While we're serving some 100 pages/sec the servers themselves are running fine with less than 0.5 load, no locked processes, no paging, no errors being logged, etc....
- Lag is present on all servers and is random: one minute its fine the next it's there.
 - DNS lookups on the servers are randomly slow. For example 
time nslookup google.comvaries randomly from a few milliseconds to several seconds and sometimes times out entirely. While we use IP addresses internally on the cluster this may be a symptom of the root issue. We are not running our own DNS server. - The Apache 
server-statuspages randomly lag or time out. Benchmarking usingabbetween servers shows a few loads sometimes take 3000 ms (almost exactly). Benchmarkingserver-statuson the local server itself usually shows no issue (it showed a lag only once among a few hundred tests). 
The servers are sitting behind a switch and a firewall which I don't have any access to so I don't know their setup or status. While we are under heavier than normal load a 2 Mbps incoming and 20 Mbps outgoing traffic shouldn't be stressing the switch or firewall should it? My feeling is that it is the switch/firewall or something above them in the ISP like their DNS but can't confirm it.
I need some other tests or methods of diagnosing this lag to try and narrow down the ultimate cause.
© Server Fault or respective owner