Can't sync filesystem without reboot

Posted by Fabio on Server Fault See other posts from Server Fault or by Fabio
Published on 2013-07-29T16:38:54Z Indexed on 2013/08/02 15:42 UTC
Read the original article Hit count: 344

Filed under:
|
|
|
|

I'm having an issue with a linux server. Once a week the running mysql instance hangs and there is no way to fully stop it. If I kill it, it remains in zombie status and init does not reap its pid.

The server is used for staging deployments and some internal tools, so it's not under heavy load. The only process constantly used id mysql and for this I think that it's the only process which suffer of this issue.

I've searched system logs for errors and the only thing I found is this error (repeated a couple of times) in dmesg output:

[706560.640085] INFO: task mysqld:31965 blocked for more than 120 seconds.
[706560.640198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[706560.640312] mysqld          D ffff88032fd93f40     0 31965      1 0x00000000
[706560.640317]  ffff880242a27d18 0000000000000086 ffff88031a50dd00 ffff880242a27fd8
[706560.640321]  ffff880242a27fd8 ffff880242a27fd8 ffff88031e549740 ffff88031a50dd00
[706560.640325]  ffff88031a50dd00 ffff88032fd947f8 0000000000000002 ffffffff8112f250
[706560.640328] Call Trace:
[706560.640338]  [<ffffffff8112f250>] ? __lock_page+0x70/0x70
[706560.640344]  [<ffffffff816cb1b9>] schedule+0x29/0x70
[706560.640347]  [<ffffffff816cb28f>] io_schedule+0x8f/0xd0
[706560.640350]  [<ffffffff8112f25e>] sleep_on_page+0xe/0x20
[706560.640353]  [<ffffffff816c9900>] __wait_on_bit+0x60/0x90
[706560.640356]  [<ffffffff8112f390>] wait_on_page_bit+0x80/0x90
[706560.640360]  [<ffffffff8107dce0>] ? autoremove_wake_function+0x40/0x40
[706560.640363]  [<ffffffff8112f891>] filemap_fdatawait_range+0x101/0x190
[706560.640366]  [<ffffffff81130975>] filemap_write_and_wait_range+0x65/0x70
[706560.640371]  [<ffffffff8122e441>] ext4_sync_file+0x71/0x320
[706560.640376]  [<ffffffff811c3e6d>] do_fsync+0x5d/0x90
[706560.640379]  [<ffffffff811c40d0>] sys_fsync+0x10/0x20
[706560.640383]  [<ffffffff816d495d>] system_call_fastpath+0x1a/0x1f

When this happens the only way to make everything working again is a full reboot, but in order to do that I'm forced to use this command after I've manually stopped all running processes

echo b > /proc/sysrq-trigger

otherwise normal reboot process hangs forever. I've tracked reboots script and I've found out that also the reboot process hangs on a sync call, this one in /etc/init.d/sendsigs (I'm on ubuntu)

# Flush the kernel I/O buffer before we start to kill
# processes, to make sure the IO of already stopped services to
# not slow down the remaining processes to a point where they
# are accidentily killed with SIGKILL because they did not
# manage to shut down in time.
sync

I'm almost sure that the cause of this is an hardware issue (the RAID controller???) also because I've other two machines with the same hardware and software configuration and they don't suffer of this, but I can't find any hint in syslog or dmesg. I've also installed smartmontools and mcelog packages but none of them did report any issue.

What can I do to track the cause of this issue?

Today is happened again, here is the status of system after triggering a reboot

init---console-kit-dae---64*[{console-kit-dae}]
     +-dbus-daemon
     +-mcelog
     +-mysqld---{mysqld}
     +-newrelic-daemon---newrelic-daemon---11*[{newrelic-daemon}]
     +-ntpd
     +-polkitd---{polkitd}
     +-python3
     +-rpc.idmapd
     +-rpc.statd
     +-rpcbind
     +-sh---rc---S20sendsigs---sync
     +-smartd
     +-snmpd
     +-sshd---sshd---zsh---sudo---zsh---pstree
     +-sshd---sshd---zsh---sudo---zsh

And here is the status of sync process

# ps aux | grep sync
root      3637  0.1  0.0   4352   372 ?        D    05:53   0:00 sync

i.e. Uninterruptible sleep...

Hardware specs as reported by lshw

I think the raid controller is a fake raid. I usually don't deal with hardware (and for the record I don't have physical access to it)

description: Computer
product: X7DBP ()
vendor: Supermicro
version: 0123456789
serial: 0123456789
width: 64 bits
capabilities: smbios-2.4 dmi-2.4 vsyscall32
configuration: administrator_password=disabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=53D19F64-D663-A017-8922-0030487C1FEE
*-core
   description: Motherboard
   product: X7DBP
   vendor: Supermicro
   physical id: 0
   version: PCB Version
   serial: 0123456789
 *-firmware
      description: BIOS
      vendor: Phoenix Technologies LTD
      physical id: 0
      version: 6.00
      date: 05/29/2007
      size: 106KiB
      capacity: 960KiB
      capabilities: pci pnp upgrade shadowing escd cdboot bootselect edd int13floppy2880 acpi usb ls120boot zipboot biosbootspecification

  *-storage
         description: RAID bus controller
         product: 631xESB/632xESB SATA RAID Controller
         vendor: Intel Corporation
         physical id: 1f.2
         bus info: pci@0000:00:1f.2
         version: 09
         width: 32 bits
         clock: 66MHz
         capabilities: storage pm bus_master cap_list
         configuration: driver=ahci latency=0
         resources: irq:19 ioport:18a0(size=8) ioport:1874(size=4) ioport:1878(size=8) ioport:1870(size=4) ioport:1880(size=32) memory:d8500400-d85007ff

© Server Fault or respective owner

Related posts about linux

Related posts about ubuntu