How to get more information from the system crash

Posted by viraptor on Server Fault See other posts from Server Fault or by viraptor
Published on 2011-01-05T18:03:37Z Indexed on 2011/01/05 18:56 UTC
Read the original article Hit count: 366

Filed under:

server-crashes

I'd like to debug an issue I'm having with a linux (debian stable) server, but I'm running out of ideas of how to confirm any diagnosis.

Some background: The servers are running DL160 class with hardware raid between two disks. They're running a lot of services, mostly utilising network interface and CPU. There are 8 cpus and 7 "main" most cpu-hungry processes are bound to one core each via cpu affinity. Other random background scripts are not forced anywhere. The filesystem is writing ~1.5k blocks/s the whole time (goes up above 2k/s in peak times). Normal CPU usage for those servers is ~60% on 7 cores and some minimal usage on the last (whatever's running on shells usually).

What actually happens is that the "main" services start using 100% CPU at some point, mainly stuck in kernel time. After a couple of seconds, LA goes over 400 and we lose any way to connect to the box (KVM is on it's way, but not there yet). Sometimes we see a kernel reporting hung task (but not always):

[118951.272884] INFO: task zsh:15911 blocked for more than 120 seconds.
[118951.272955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[118951.273037] zsh           D 0000000000000000     0 15911      1
[118951.273093]  ffff8101898c3c48 0000000000000046 0000000000000000 ffffffffa0155e0a
[118951.273183]  ffff8101a753a080 ffff81021f1c5570 ffff8101a753a308 000000051f0fd740
[118951.273274]  0000000000000246 0000000000000000 00000000ffffffbd 0000000000000001
[118951.273335] Call Trace:
[118951.273424]  [<ffffffffa0155e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
[118951.273510]  [<ffffffff804294f6>] schedule_timeout+0x1e/0xad
[118951.273563]  [<ffffffff8027577c>] __pagevec_free+0x21/0x2e
[118951.273613]  [<ffffffff80428b0b>] wait_for_common+0xcf/0x13a
[118951.273692]  [<ffffffff8022c168>] default_wake_function+0x0/0xe
....

This would point at raid / disk failure, however sometimes the tasks are hung on kernel's gettsc which would indicate some general weird hardware behaviour.

It's also running mysql (almost read-only, 99% cache hit), which seems to spawn a lot more threads during the system problems. During the day it does ~200kq/s (selects) and ~10q/s (writes).

The host is never running out of memory or swapping, no oom reports are spotted.

We've got many boxes with similar/same hardware and they all seem to behave that way, but I'm not sure which part fails, so it's probably not a good idea to just grab something more powerful and hope the problem goes away.

Applications themselves don't really report anything wrong when they're running. I can run anything safely on the same hardware in an isolated environment. What can I do to narrow down the problem? Where else should I look for explanation?

Developer IT

How to get more information from the system crash - Developer IT

How to get more information from the system crash

linux

debian

debugging

analysis

server-crashes

Related posts about linux

apt-get install and update fail

kernel module compiling error

Build-Essentials installation failing

Updating Debian kernel

Serial connection over a single USB cable (Windows to linux, or linux to linux)

Related posts about debian

Trying to update debian not working

Trouble with dns and debian update

Errors when installing Open Office

Installing PHP4 on a Debian (lenny) 7 32bit box

Debian keyring error: "No keyring installed"

Categories cloud