Strange Recurrent Excessive I/O Wait

Posted by Chris on Server Fault See other posts from Server Fault or by Chris
Published on 2013-10-12T19:26:29Z Indexed on 2013/10/20 21:57 UTC
Read the original article Hit count: 362

Filed under:
|
|

I know quite well that I/O wait has been discussed multiple times on this site, but all the other topics seem to cover constant I/O latency, while the I/O problem we need to solve on our server occurs at irregular (short) intervals, but is ever-present with massive spikes of up to 20k ms a-wait and service times of 2 seconds. The disk affected is /dev/sdb (Seagate Barracuda, for details see below).

A typical iostat -x output would at times look like this, which is an extreme sample but by no means rare:

iostat (Oct 6, 2013)
  tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
16.00      0.00    156.00      9.75     21.89    288.12     36.00     57.60
 5.50      0.00     44.00      8.00     48.79   2194.18    181.82    100.00
 2.00      0.00     16.00      8.00     46.49   3397.00    500.00    100.00
 4.50      0.00     40.00      8.89     43.73   5581.78    222.22    100.00
14.50      0.00    148.00     10.21     13.76   5909.24     68.97    100.00
 1.50      0.00     12.00      8.00      8.57   7150.67    666.67    100.00
 0.50      0.00      4.00      8.00      6.31  10168.00   2000.00    100.00
 2.00      0.00     16.00      8.00      5.27  11001.00    500.00    100.00
 0.50      0.00      4.00      8.00      2.96  17080.00   2000.00    100.00
34.00      0.00   1324.00      9.88      1.32    137.84      4.45     59.60
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22.00     44.00    204.00     11.27      0.01      0.27      0.27      0.60

Let me provide you with some more information regarding the hardware. It's a Dell 1950 III box with Debian as OS where uname -a reports the following:

Linux xx 2.6.32-5-amd64 #1 SMP Fri Feb 15 15:39:52 UTC 2013 x86_64 GNU/Linux

The machine is a dedicated server that hosts an online game without any databases or I/O heavy applications running. The core application consumes about 0.8 of the 8 GBytes RAM, and the average CPU load is relatively low. The game itself, however, reacts rather sensitive towards I/O latency and thus our players experience massive ingame lag, which we would like to address as soon as possible.

iostat:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.77    0.01    1.05    1.59    0.00   95.58

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb              13.16        25.42       135.12  504701011 2682640656
sda               1.52         0.74        20.63   14644533  409684488

Uptime is:

19:26:26 up 229 days, 17:26,  4 users,  load average: 0.36, 0.37, 0.32

Harddisk controller:

01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

Harddisks:

Array 1, RAID-1, 2x Seagate Cheetah 15K.5 73 GB SAS
Array 2, RAID-1, 2x Seagate ST3500620SS Barracuda ES.2 500GB 16MB 7200RPM SAS

Partition information from df:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            480191156  30715200 425083668   7% /home
/dev/sda2              7692908    437436   6864692   6% /
/dev/sda5             15377820   1398916  13197748  10% /usr
/dev/sda6             39159724  19158340  18012140  52% /var

Some more data samples generated with iostat -dx sdb 1 (Oct 11, 2013)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00    15.00    0.00   70.00     0.00   656.00     9.37     4.50    1.83   4.80  33.60
sdb               0.00     0.00    0.00    2.00     0.00    16.00     8.00    12.00  836.00 500.00 100.00
sdb               0.00     0.00    0.00    3.00     0.00    32.00    10.67     9.96 1990.67 333.33 100.00
sdb               0.00     0.00    0.00    4.00     0.00    40.00    10.00     6.96 3075.00 250.00 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     4.00    0.00   0.00 100.00
sdb               0.00     0.00    0.00    2.00     0.00    16.00     8.00     2.62 4648.00 500.00 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     2.00    0.00   0.00 100.00
sdb               0.00     0.00    0.00    1.00     0.00    16.00    16.00     1.69 7024.00 1000.00 100.00
sdb               0.00    74.00    0.00  124.00     0.00  1584.00    12.77     1.09   67.94   6.94  86.00

Characteristic charts generated with rrdtool can be found here:

iostat plot 1, 24 min interval: http://imageshack.us/photo/my-images/600/yqm3.png/

iostat plot 2, 120 min interval: http://imageshack.us/photo/my-images/407/griw.png/

As we have a rather large cache of 5.5 GBytes, we thought it might be a good idea to test if the I/O wait spikes would perhaps be caused by cache miss events. Therefore, we did a sync and then this to flush the cache and buffers:

echo 3 > /proc/sys/vm/drop_caches

and directly afterwards the I/O wait and service times virtually went through the roof, and everything on the machine felt like slow motion. During the next few hours the latency recovered and everything was as before - small to medium lags in short, unpredictable intervals.

Now my question is: does anybody have any idea what might cause this annoying behaviour? Is it the first indication of the disk array or the raid controller dying, or something that can be easily mended by rebooting? (At the moment we're very reluctant to do this, however, because we're afraid that the disks might not come back up again.)

Any help is greatly appreciated.

Thanks in advance, Chris.

Edited to add: we do see one or two processes go to 'D' state in top, one of which seems to be kjournald rather frequently. If I'm not mistaken, however, this does not indicate the processes causing the latency, but rather those affected by it - correct me if I'm wrong. Does the information about uninterruptibly sleeping processes help us in any way to address the problem?

@Andy Shinn requested smartctl data, here it is:

smartctl -a -d megaraid,2 /dev/sdb yields:

smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST3500620SS      Version: MS05
Serial number:
Device type: disk
Transport protocol: SAS
Local Time is: Mon Oct 14 20:37:13 2013 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     20 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 1236631092
  Blocks received from initiator = 1097862364
  Blocks read from cache and sent to initiator = 1383620256
  Number of read and write commands whose size <= segment size = 531295338
  Number of read and write commands whose size > segment size = 51986460
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 36556.93
  number of minutes until next internal SMART test = 32

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   509271032       47         0  509271079   509271079      20981.423           0
write:         0        0         0         0          0       5022.039           0
verify: 1870931090      196         0  1870931286   1870931286     100558.708           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16   36538                 - [-   -    -]
# 2  Background short  Completed                  16   36514                 - [-   -    -]
# 3  Background short  Completed                  16   36490                 - [-   -    -]
# 4  Background short  Completed                  16   36466                 - [-   -    -]
# 5  Background short  Completed                  16   36442                 - [-   -    -]
# 6  Background long   Completed                  16   36420                 - [-   -    -]
# 7  Background short  Completed                  16   36394                 - [-   -    -]
# 8  Background short  Completed                  16   36370                 - [-   -    -]
# 9  Background long   Completed                  16   36364                 - [-   -    -]
#10  Background short  Completed                  16   36361                 - [-   -    -]
#11  Background long   Completed                  16       2                 - [-   -    -]
#12  Background short  Completed                  16       0                 - [-   -    -]

Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

smartctl -a -d megaraid,3 /dev/sdb yields:

smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST3500620SS      Version: MS05
Serial number:
Device type: disk
Transport protocol: SAS
Local Time is: Mon Oct 14 20:37:26 2013 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     19 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 288745640
  Blocks received from initiator = 1097848399
  Blocks read from cache and sent to initiator = 1304149705
  Number of read and write commands whose size <= segment size = 527414694
  Number of read and write commands whose size > segment size = 51986460
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 36596.83
  number of minutes until next internal SMART test = 28

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   610862490       44         0  610862534   610862534      20470.133           0
write:         0        0         0         0          0       5022.480           0
verify: 2861227413      203         0  2861227616   2861227616     100872.443           0

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16   36580                 - [-   -    -]
# 2  Background short  Completed                  16   36556                 - [-   -    -]
# 3  Background short  Completed                  16   36532                 - [-   -    -]
# 4  Background short  Completed                  16   36508                 - [-   -    -]
# 5  Background short  Completed                  16   36484                 - [-   -    -]
# 6  Background long   Completed                  16   36462                 - [-   -    -]
# 7  Background short  Completed                  16   36436                 - [-   -    -]
# 8  Background short  Completed                  16   36412                 - [-   -    -]
# 9  Background long   Completed                  16   36404                 - [-   -    -]
#10  Background short  Completed                  16   36401                 - [-   -    -]
#11  Background long   Completed                  16       2                 - [-   -    -]
#12  Background short  Completed                  16       0                 - [-   -    -]

Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

© Server Fault or respective owner

Related posts about linux

Related posts about cache