Dell PowerEdge R720 - Corrupted RAID

Posted by BT643 on Server Fault See other posts from Server Fault or by BT643
Published on 2014-05-29T12:13:10Z Indexed on 2014/06/03 15:31 UTC
Read the original article Hit count: 240

Apologies in advance for the lengthy question.

We have a Dell PowerEdge R720 server with:

  • 2 x 136GB SAS drives in RAID 1 for the OS (Ubuntu Server 12.04)
  • 6 x 3TB SATA drives in RAID 5 for data

A few days ago we were getting errors when trying to access files on the large RAID 5 partition. We rebooted the server and got a message about the raid controller has found a foriegn config. We've had this before, and just needed to use Dell's RAID configuration utility to import foreign config on the RAID. Last time this worked, but this time, it started doing a disk check then we got this:

FSCK has returned the following:

"/dev/sdb1 inode 364738 has a bad extended attribute block 7

/dev/sdb1 unexpected inconsistency run fsck manually (i.e without -a or -p options) 

MOUNTALL fsck /ourdatapartition [1019] terminated with status 4

MOUNTALL filesystem has errors /ourdatapartition

errors where found while checking the disk drive for /ourdatapartition

Press F to fix errors, I to Ignore or M for Manual Recovery"

We pressed F to try and fix the errors, but it eventually errored with:

Inode 275841084, i_blocks is 167080, should be 0. Fix? yes

Inode 275841141 has an invalid extend node (blk 2206761006, lblk 0)
Clear? yes

Inode 275841141, i_blocks is 227872, should be 0. Fix? yes

Inode 275842303 has an invalid extend node (blk 2206760975, lblk 0)
Clear? yes

....


Error storing directory block information (inode=275906766, block=0, num=2699516178):         Memory allocation failed

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
e2fsck: aborted

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
mountall: fsck /ourdatapartition [1286] terminated with status 9
mountall: Unrecoverable fsck error: /ourdatapartition

We noticed one of the drive lights was not lit at all, and thought this may have failed and be the problem. We replaced the drive with a spare, and tried "F" to repair it again, but we keep just getting the same error as above.

In the RAID configuration utility, all drives show as "online" and "optimal".

We do have this data on another replicated server, so we're not worried about "recovering" anything, we just want to get the system back online asap.

The server has 64 or 32GB memory, can't remember off the top of my head, but either way, with a 14TB RAID, I think it may still not be enough.

Thanks

EDIT - I checked the memory usage while fsck was running as suggested and after 2 or 3 minutes, it looked like this, using up nearly all of our servers memory:

During FSCK Memory Usage

When it failed after 5 minutes or so with the error in my post, the memory immediately freed up again:

After FSCK Error Memory Usage

© Server Fault or respective owner

Related posts about hard-drive

Related posts about ubuntu-12.04