Forensic Analysis of the OOM-Killer

Posted by Oddthinking on Server Fault See other posts from Server Fault or by Oddthinking
Published on 2010-04-22T02:45:24Z Indexed on 2010/04/22 2:53 UTC
Read the original article Hit count: 441

Filed under:
|
|
|

Ubuntu's Out-Of-Memory Killer wreaked havoc on my server, quietly assassinating my applications, sendmail, apache and others.

I've managed to learn what the OOM Killer is, and about its "badness" rules. While my machine is small, my applications are even smaller, and typically only half of my physical memory is in use, let alone swap-space, so I was surprised. I am trying to work out the culprit, but I don't know how to read the OOM-Killer logs.

Can anyone please point me to a tutorial on how to read the data in the logs (what are ve, free and gen?), or help me parse these logs?

Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 1, exc 2326 0 goal 2326 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 1, exc 2326 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 2, exc 122 0 goal 383 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 2, exc 383 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 2
Apr 20 20:03:27 EL135 kernel: OOM killed process watchdog (pid=14490, ve=13516) exited, free=43104 gen=24501.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=4457, ve=13516) exited, free=43104 gen=24502.
Apr 20 20:03:27 EL135 kernel: OOM killed process ntpd (pid=10816, ve=13516) exited, free=43104 gen=24503.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=27401, ve=13516) exited, free=43104 gen=24504.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29009, ve=13516) exited, free=43104 gen=24505.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=10557, ve=13516) exited, free=49552 gen=24506.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=24983, ve=13516) exited, free=53117 gen=24507.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=29129, ve=13516) exited, free=68493 gen=24508.
Apr 20 20:03:27 EL135 kernel: OOM killed process sendmail-mta (pid=941, ve=13516) exited, free=68803 gen=24509.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=12418, ve=13516) exited, free=69330 gen=24510.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=22953, ve=13516) exited, free=72275 gen=24511.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=6624, ve=13516) exited, free=76398 gen=24512.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=23317, ve=13516) exited, free=94285 gen=24513.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29030, ve=13516) exited, free=95339 gen=24514.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=20583, ve=13516) exited, free=101663 gen=24515.
Apr 20 20:03:28 EL135 kernel: OOM killed process logger (pid=12894, ve=13516) exited, free=101694 gen=24516.
Apr 20 20:03:28 EL135 kernel: OOM killed process bash (pid=21119, ve=13516) exited, free=101849 gen=24517.
Apr 20 20:03:28 EL135 kernel: OOM killed process atd (pid=991, ve=13516) exited, free=101880 gen=24518.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=14649, ve=13516) exited, free=102748 gen=24519.
Apr 20 20:03:28 EL135 kernel: OOM killed process grep (pid=21375, ve=13516) exited, free=132167 gen=24520.
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 4, exc 4215 0 goal 4826 0...
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 1
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 4, exc 4826 0 red 189481 331
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 2
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 5, exc 3564 0 goal 3564 0...
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 1
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 5, exc 3564 0 red 189481 331
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 2
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 6, exc 8071 0 goal 8071 0...
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 1
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 6, exc 8071 0 red 189481 331
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 2

Watchdog is a watchdog task, that was idle; nothing in the logs to suggest it had done anything for days. Its job is to restart one of the applications if it dies, so a bit ironic that it is the first to get killed.

Tail was monitoring a few logs files. Unlikely to be consuming memory madly.

The apache web-server only serves pages to a little old lady who only uses it to get to church on Sundays a couple of developers who were in bed asleep, and hadn't visited a page on the site for a few weeks. The only traffic it might have had is from the port-scanners; all the content is password-protected and not linked from anywhere, so no spiders are interested.

Python is running two separate custom applications. Nothing in the logs to suggest they weren't humming along as normal. One of them was a relatively recent implementation, which makes suspect #1. It doesn't have any data-structures of any significance, and normally uses only about 8% of the total physical RAW. It hasn't misbehaved since.

The grep is suspect #2, and the one I want to be guilty, because it was a once-off command. The command (which piped the output of a grep -r to another grep) had been started at least 30 minutes earlier, and the fact it was still running is suspicious. However, I wouldn't have thought grep would ever use a significant amount of memory. It took a while for the OOM killer to get to it, which suggests it wasn't going mad, but the OOM killer stopped once it was killed, suggesting it may have been a memory-hog that finally satisfied the OOM killer's blood-lust.

© Server Fault or respective owner

Related posts about ubuntu

Related posts about linux