How to find the reason for a weekly downtime on an Ubuntu web server hosted by AWS?

Posted by IceSheep on Server Fault See other posts from Server Fault or by IceSheep
Published on 2012-09-03T14:20:28Z Indexed on 2012/11/25 11:07 UTC
Read the original article Hit count: 213

Filed under:

linux

|

ubuntu

|

webserver

|

amazon-web-services

|

network-monitoring

We started monitoring our web server using Pingdom and found out that we have a downtime of a few minutes every Sunday at 0:00 UTC.

The test runs every minute and checks if a successful HTTP response (code 200) is returned on port 80. The test fails due to a timeout (no response after 30 seconds).

Here's what we've already checked – without success:

Since we run our webserver behind a load balancer, I've set the Pingdom test on the load balancer's public DNS and the webserver's public DNS in order to find out if there's a problem with the AWS load balancer – both tests return the same result
We set up Munin on our webserver. Everything looked fine even after the failure. Since the last failure lasted only 2 minutes I suppose Munin couldn't capture a potential problem (it only checks every 5 minutes)
I have checked /var/log/apache2/error.log and /var/log/syslog for suspicious entries
I have checked /etc/cron.weekly and /etc/crontab for suspicious entries
I have searched for files created or last-modified during 0:00 and 0:15 using this method:

touch -t 201209020000 start
touch -t 201209020015 end
find / -newer start -and ! -newer end

(nothing found)

Has anybody experienced a similar problem? Any proposals on how to find the reason for this behavior?

It's Ubuntu 10.04 LTS running on an AWS m1.large instance.

Thanks!

Developer IT