How to get more information from the system crash

Posted by viraptor on Server Fault See other posts from Server Fault or by viraptor
Published on 2011-01-05T18:03:37Z Indexed on 2011/01/05 18:56 UTC
Read the original article Hit count: 295

I'd like to debug an issue I'm having with a linux (debian stable) server, but I'm running out of ideas of how to confirm any diagnosis.

Some background: The servers are running DL160 class with hardware raid between two disks. They're running a lot of services, mostly utilising network interface and CPU. There are 8 cpus and 7 "main" most cpu-hungry processes are bound to one core each via cpu affinity. Other random background scripts are not forced anywhere. The filesystem is writing ~1.5k blocks/s the whole time (goes up above 2k/s in peak times). Normal CPU usage for those servers is ~60% on 7 cores and some minimal usage on the last (whatever's running on shells usually).

What actually happens is that the "main" services start using 100% CPU at some point, mainly stuck in kernel time. After a couple of seconds, LA goes over 400 and we lose any way to connect to the box (KVM is on it's way, but not there yet). Sometimes we see a kernel reporting hung task (but not always):

[118951.272884] INFO: task zsh:15911 blocked for more than 120 seconds.
[118951.272955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[118951.273037] zsh           D 0000000000000000     0 15911      1
[118951.273093]  ffff8101898c3c48 0000000000000046 0000000000000000 ffffffffa0155e0a
[118951.273183]  ffff8101a753a080 ffff81021f1c5570 ffff8101a753a308 000000051f0fd740
[118951.273274]  0000000000000246 0000000000000000 00000000ffffffbd 0000000000000001
[118951.273335] Call Trace:
[118951.273424]  [<ffffffffa0155e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
[118951.273510]  [<ffffffff804294f6>] schedule_timeout+0x1e/0xad
[118951.273563]  [<ffffffff8027577c>] __pagevec_free+0x21/0x2e
[118951.273613]  [<ffffffff80428b0b>] wait_for_common+0xcf/0x13a
[118951.273692]  [<ffffffff8022c168>] default_wake_function+0x0/0xe
....

This would point at raid / disk failure, however sometimes the tasks are hung on kernel's gettsc which would indicate some general weird hardware behaviour.

It's also running mysql (almost read-only, 99% cache hit), which seems to spawn a lot more threads during the system problems. During the day it does ~200kq/s (selects) and ~10q/s (writes).

The host is never running out of memory or swapping, no oom reports are spotted.

We've got many boxes with similar/same hardware and they all seem to behave that way, but I'm not sure which part fails, so it's probably not a good idea to just grab something more powerful and hope the problem goes away.

Applications themselves don't really report anything wrong when they're running. I can run anything safely on the same hardware in an isolated environment. What can I do to narrow down the problem? Where else should I look for explanation?

© Server Fault or respective owner

Related posts about linux

Related posts about debian