3Ware 9650SE RAID-6, two degraded drives, one ECC, rebuild stuck

Posted by cswingle on Server Fault See other posts from Server Fault or by cswingle
Published on 2012-06-20T01:10:03Z Indexed on 2012/06/20 3:17 UTC
Read the original article Hit count: 1037

Filed under:
|
|

This morning I came in the office to discover that two of the drives on a RAID-6, 3ware 9650SE controller were marked as degraded and it was rebuilding the array. After getting to about 4%, it got ECC errors on a third drive (this may have happened when I attempted to access the filesystem on this RAID and got I/O errors from the controller). Now I'm in this state:

> /c2/u1 show

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u1       RAID-6    REBUILDING     4%(A)   -       -     64K     7450.5    
u1-0     DISK      OK             -       -       p5    -       931.312   
u1-1     DISK      OK             -       -       p2    -       931.312   
u1-2     DISK      OK             -       -       p1    -       931.312   
u1-3     DISK      OK             -       -       p4    -       931.312   
u1-4     DISK      OK             -       -       p11   -       931.312   
u1-5     DISK      DEGRADED       -       -       p6    -       931.312   
u1-6     DISK      OK             -       -       p7    -       931.312   
u1-7     DISK      DEGRADED       -       -       p3    -       931.312   
u1-8     DISK      WARNING        -       -       p9    -       931.312   
u1-9     DISK      OK             -       -       p10   -       931.312   
u1/v0    Volume    -              -       -       -     -       7450.5    

Examining the SMART data on the three drives in question, the two that are DEGRADED are in good shape (PASSED without any Current_Pending_Sector or Offline_Uncorrectable errors), but the drive listed as WARNING has 24 uncorrectable sectors.

And, the "rebuild" has been stuck at 4% for ten hours now.

So:

How do I get it to start actually rebuilding? This particular controller doesn't appear to support /c2/u1 resume rebuild, and the only rebuild command that appears to be an option is one that wants to know what disk to add (/c2/u1 start rebuild disk=<p:-p...> [ignoreECC] according to the help). I have two hot spares in the server, and I'm happy to engage them, but I don't understand what it would do with that information in the current state it's in.

Can I pull out the drive that is demonstrably failing (the WARNING drive), when I have two DEGRADED drives in a RAID-6? It seems to me that the best scenario would be for me to pull the WARNING drive and tell it to use one of my hot spares in the rebuild. But won't I kill the thing by pulling a "good" drive in a RAID-6 with two DEGRADED drives?

Finally, I've seen reference in other posts to a bad bug in this controller that causes good drives to be marked as bad and that upgrading the firmware may help. Is flashing the firmware a risky operation given the situation? Is it likely to help or hurt wrt the rebuilding-but-stuck-at-4% RAID? Am I experiencing this bug in action?

Advice outside the spiritual would be much appreciated. Thanks.

© Server Fault or respective owner

Related posts about raid

Related posts about 3ware