Search Results

Search found 75 results on 3 pages for 'smartctl'.

Page 1/3 | 1 2 3  | Next Page >

  • smartctl short test doesn't seem to complete

    - by Cédric COPY
    I am working on project which involve automated HDD testing through smartctl. The station is working fine on most product, but I have two specific products that fail the smartctl test. Those two product are both WD product (WD2500BUDT series) Smartctl behaviour is quite strange, in fact the test is launched without any problem, i wait about 2min (test length), and when i check the smartctl, i have got no result at all. It's like I hadn't launched any test (no fail, no success in smartctl result). No error return on command, nothing in syslog, .. As i said before, the test is working for other product, thousands products worked well with this test. The main smartctl command used are : smarctl -t shortest /dev/sdX #Launch test smartctl -l selftest /dev/sdX #Look at test result I have tried to use: smartctl -s on /dev/sdX or smartctl -o on /dev/sdX But doesn't change anything. The system is using Debian 6.0, smartctl v5.40 (rev 3124) x86_64, HDD are plug through SATA to PCI controller. I have 4 HDD connected at a time. Well if anyone has some hints to give with this problem, because I have no idea how can i fix this. Thanks in advance. PS: Not sure if it was a serverfault topic, sorry if i was wrong!

    Read the article

  • smartctl not actually running self tests?

    - by canzar
    I want to run the smartctl self tests to check the health of the drives in my RAID array (PERC 5/i). The array is on sda and comprises six drives. I can check the status using sudo smartctl /dev/sda -d megaraid,0 -a And I see that SMART is available and enabled on all the drives. I have tried to run self tests using sudo smartctl /dev/sda -d megaraid,0 -t short and sudo smartctl /dev/sda -d megaraid,0 -t long I have also tried it on all of the drives 0-5. No matter what I try, when I run: sudo smartctl /dev/sda -d megaraid,0 -l selftest I always get the same result, which seems to always report that I have never run a self test. /dev/sda [megaraid_disk_00] [SAT]: Device open changed type from 'megaraid' to 'sat' ===START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] From what I read, I should have no problem running the short and long self tests on the array while it is mounted. Does anyone else have experience running these tests on a PERC 5/i raid array who could lend some insight into what is causing the problem? (smartmontools release 5.40 dated 2009-12-09 at 21:00:32 UTC)

    Read the article

  • smartctl or hddtemp for xvda [on hold]

    - by HST
    I'm trying to check the state of the drives on a remote server running Debian wheezy. I'm using a software RAID10 on top of, I guess, xen, since the entries in /dev are /dev/xvda and /dev/xvdb But it I try smartctl -a /dev/xvda I get /dev/xvda: Unable to detect device type Smartctl: please specify device type with the -d option. I've tried various device type guesses, none work Similar problem with hddtemp, which reports ERROR: /dev/xvda: can't determine bus type (or this bus type is unknown) I've searched the smartmontools documentation, but can't find any discussion of virtual disks. . . How do I get behind the virtualisation to something smart tools or hddtemp can work with?

    Read the article

  • smartctl -t long isn't finishing

    - by xenoterracide
    I been running smartctl -t long on a drive for about 2 days now and it seems to be stalled at 10%. short and conveyance both passed. I have to send 1 of 2 drives purchased back I found badblocks with badblocks (none on this drive and I'ts made over a pass already). I'm just wondering if I should be concerned about this. smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WMAV51582123 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon May 10 22:19:52 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (20100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 231) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2 3 Spin_Up_Time 0x0027 131 131 021 Pre-fail Always - 6408 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 148 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 174 194 Temperature_Celsius 0x0022 106 102 000 Old_age Always - 41 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 99 - # 2 Extended offline Interrupted (host reset) 10% 30 - # 3 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

    Read the article

  • Using smartctl to get vendor specific Attributes from ssd drive behind a SmartArray P410 controller

    - by Lairsdragon
    Recently I have deployed some HP server with SSD's behind a SmartArray P410 controller. While not official supported from HP the server work well sofar. Now I like to get wear level info's, error statistics etc from the drive. While the SA P410 supports a passthru of the SMART Command to a single drive in the array the output I was not able to the the interesting things from the drive. In this case especially the value the Wear level indicator is from interest for me (Attr.ID 233), but this is ony present if the drive is directly attanched to a SATA Controller. smartctl on directly connected ssd: # smartctl -A /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline In_the_past 0 4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline In_the_past 0 5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0002 100 100 000 Old_age Always - 8561 12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always - 55 192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always - 29 232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always - 0 233 Unknown_Attribute 0x0002 088 088 000 Old_age Always - 0 225 Load_Cycle_Count 0x0000 198 198 000 Old_age Offline - 508509 226 Load-in_Time 0x0002 255 000 000 Old_age Always In_the_past 0 227 Torq-amp_Count 0x0002 000 000 000 Old_age Always FAILING_NOW 0 228 Power-off_Retract_Count 0x0002 000 000 000 Old_age Always FAILING_NOW 0 smartctl on P410 connected ssd: # ./smartctl -A -d cciss,0 /dev/cciss/c1d0 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net (Right, it is complety empty) smartctl on P410 connected hdd: # ./smartctl -A -d cciss,0 /dev/cciss/c0d0 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Current Drive Temperature: 27 C Drive Trip Temperature: 68 C Vendor (Seagate) cache information Blocks sent to initiator = 1871654030 Blocks received from initiator = 1360012929 Blocks read from cache and sent to initiator = 2178203797 Number of read and write commands whose size <= segment size = 46052239 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 3363.25 number of minutes until next internal SMART test = 12 Do I hunt here a bug, or is this a limitation of the p410 SMART cmd Passthru?

    Read the article

  • Using smartctl to get vendor specific Attributes from ssd drive behind a SmartArray P410 controller

    - by Lairsdragon
    Hi! Recently I have deployed some HP server with SSD's behind a SmartArray P410 controller. While not official supported from HP the server work well sofar. Now I like to get wear level info's, error statistics etc from the drive. While the SA P410 supports a passthru of the SMART Command to a single drive in the array the output I was not able to the the interesting things from the drive. In this case especially the value the Wear level indicator is from interest for me (Attr.ID 233), but this is ony present if the drive is directly attanched to a SATA Controller. smartctl on directly connected ssd: # smartctl -A /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline In_the_past 0 4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline In_the_past 0 5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0002 100 100 000 Old_age Always - 8561 12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always - 55 192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always - 29 232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always - 0 233 Unknown_Attribute 0x0002 088 088 000 Old_age Always - 0 225 Load_Cycle_Count 0x0000 198 198 000 Old_age Offline - 508509 226 Load-in_Time 0x0002 255 000 000 Old_age Always In_the_past 0 227 Torq-amp_Count 0x0002 000 000 000 Old_age Always FAILING_NOW 0 228 Power-off_Retract_Count 0x0002 000 000 000 Old_age Always FAILING_NOW 0 smartctl on P410 connected ssd: # ./smartctl -A -d cciss,0 /dev/cciss/c1d0 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net (Right, it is complety empty) smartctl on P410 connected hdd: # ./smartctl -A -d cciss,0 /dev/cciss/c0d0 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Current Drive Temperature: 27 C Drive Trip Temperature: 68 C Vendor (Seagate) cache information Blocks sent to initiator = 1871654030 Blocks received from initiator = 1360012929 Blocks read from cache and sent to initiator = 2178203797 Number of read and write commands whose size <= segment size = 46052239 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 3363.25 number of minutes until next internal SMART test = 12 Do I hunt here a bug, or is this a limitation of the p410 SMART cmd Passthru?

    Read the article

  • smartctl not returning on HBA that's secure-erasing a different drive

    - by Stu2000
    Whenever I run smartctl -i /dev/sd* where * is a drive that is plugged into the same host bus adapter as another drive that is currently being erased with an hdparm secure erase command, the smart command will just 'hang' and not return (blocked) until the erasure of the other drive is finished. To make matters worse you can't cntrl-c out of it. Has anyone else had this issue? Is there another way to retrieve smart data from a drive, which doesn't block? I noticed that I can still use the udevadm command to retrieve the serial and model of the drive which is useful but doesn't appear to have any smart data. Any information relating to this matter is appreciated, especially if you can tell me another way to retrieve the S.M.A.R.T data that might work. Regards, Stuart

    Read the article

  • External drive hanging, load average through the roof

    - by Paul Tomblin
    I have an external USB drive, and I run an hourly rsync to it as a backup. This has been working fine for years. This weekend, I got two new 2Tb internal drives, and decided it was time to re-install Ubuntu from scratch to clear out all the old cruft. About once a day since the re-install, the backup script hangs hard, usually in the "rm -rf" I do before the rsync. By the time I notice the problem, my load average is in the stratosphere and climbing fast (one time, it was over 150), but anything that doesn't touch the drive seems to be running fine. One thing that I find suspicious is that something, I don't know what, is doing a "smartctl" and a "hdparm" command on the USB drive. I'm pretty sure smartctl isn't supposed to run on external drives. I can't figure out what's doing it, either. Here's part of ps auwwfx when it's hung: root 7310 0.0 0.0 4248 352 ? D 20:15 0:00 /sbin/hdparm -C /dev/sdd root 7808 0.0 0.0 17372 1632 ? D 20:15 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 8427 0.0 0.0 4248 356 ? D 20:20 0:00 /sbin/hdparm -C /dev/sdd root 8925 0.0 0.0 17372 1628 ? D 20:20 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 9529 0.0 0.0 4248 356 ? D 20:25 0:00 /sbin/hdparm -C /dev/sdd root 10026 0.0 0.0 17372 1628 ? D 20:25 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 10655 0.0 0.0 4248 356 ? D 20:30 0:00 /sbin/hdparm -C /dev/sdd root 11151 0.0 0.0 17372 1632 ? D 20:30 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 11774 0.0 0.0 4248 356 ? D 20:35 0:00 /sbin/hdparm -C /dev/sdd root 12271 0.0 0.0 17372 1628 ? D 20:35 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 12878 0.0 0.0 4248 352 ? D 20:40 0:00 /sbin/hdparm -C /dev/sdd root 13374 0.0 0.0 17372 1632 ? D 20:40 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 14011 0.0 0.0 4248 352 ? D 20:45 0:00 /sbin/hdparm -C /dev/sdd root 14507 0.0 0.0 17372 1628 ? D 20:45 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 15116 0.0 0.0 4248 352 ? D 20:50 0:00 /sbin/hdparm -C /dev/sdd root 15612 0.0 0.0 17372 1632 ? D 20:50 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 16223 0.0 0.0 4248 352 ? D 20:55 0:00 /sbin/hdparm -C /dev/sdd root 16734 0.0 0.0 17372 1632 ? D 20:55 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 17345 0.0 0.0 4248 352 ? D 21:00 0:00 /sbin/hdparm -C /dev/sdd root 17842 0.0 0.0 17372 1628 ? D 21:00 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 18463 0.0 0.0 4248 352 ? D 21:05 0:00 /sbin/hdparm -C /dev/sdd root 18960 0.0 0.0 17372 1628 ? D 21:05 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 19598 0.0 0.0 4248 356 ? D 21:10 0:00 /sbin/hdparm -C /dev/sdd root 20096 0.0 0.0 17372 1628 ? D 21:10 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 21280 0.0 0.0 4244 356 ? D 21:15 0:00 /sbin/hdparm -C /dev/sdd root 21784 0.0 0.0 17372 1632 ? D 21:15 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 22414 0.0 0.0 4244 356 ? D 21:20 0:00 /sbin/hdparm -C /dev/sdd root 22912 0.0 0.0 17372 1628 ? D 21:20 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 23541 0.0 0.0 4244 356 ? D 21:25 0:00 /sbin/hdparm -C /dev/sdd root 24038 0.0 0.0 17372 1632 ? D 21:25 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd root 24658 0.0 0.0 4244 356 ? D 21:30 0:00 /sbin/hdparm -C /dev/sdd root 25157 0.0 0.0 17372 1628 ? D 21:30 0:00 /usr/sbin/smartctl -a -n standby -A -i /dev/sdd Why is this happening, and how can I stop it?

    Read the article

  • A faulty Caviar Blue hard drive?

    - by Glister
    We have a small "homemade" server running fully updated Debian Wheezy (amd64). One hard drive installed: WDC WD6400AAKS. The motherboard is ASUS M4N68T V2. The usual load: CPU: an average of 20% Each week about 50GB of additional space is occupied. About 47GB of uploaded files and 3GB of MySQL data. I'm afraid that the hard drive may be about to fail. I saw Pre-fail on few places when I ran: root@SERVER:/tmp# smartctl -a /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA Device Model: WDC WD6400AAKS-XXXXXXX Serial Number: WD-XXXXXXXXXXXXXXXXXXX LU WWN Device Id: 5 0014ee XXXXXXXXXXXXX Firmware Version: 01.03B01 User Capacity: 640,135,028,736 bytes [640 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Oct 28 18:55:27 2013 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 247) Self-test routine in progress... 70% of test remaining. Total time to complete Offline data collection: (11580) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 136) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 157 146 021 Pre-fail Always - 5108 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2968 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 051 Old_age Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15445 10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2950 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 426 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2968 194 Temperature_Celsius 0x0022 111 095 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 160 000 Old_age Always - 21716 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 15444 - Error SMART Read Selective Self-Test Log failed: scsi error aborted command Smartctl: SMART Selective Self Test Log Read Failed root@SERVER:/tmp# In one tutorial I read that the pre-fail is a an indication of coming failure, in another tutorial I read that it is not true. Can you guys help me decode the output of smartctl? It would be also nice to share suggestions what should I do if I want to ensure data integrity (about 50GB of new data each week, up to 2TB for the whole period I'm interested in). Maybe I will go with 2x2TB Caviar Black in RAID4?

    Read the article

  • What is a Linux device name for RAID of sas drives?

    - by flashnik
    I have a RAID1 using Promise FastTrack TX2650 consisting of 2 SAS drives. What is a Linux device name for them? Like sda is for first sata drive. I have Windows server so I can't look it directly but need this information for smartctl usage. UPDATE. I found how to access RAID: smartctl -d scsi sdb (because I also have a SATA drive). But in this case I just get an information about just raid controller though I wantto get information about drives itself. Is it possible? Promises's control panel provides information only about their healthy status (boolean) and I want more. Mostly now I need information about temperature.

    Read the article

  • How do I easily repair a single unreadable block on a Linux disk?

    - by Nelson
    My Linux system has started throwing SMART errors in the syslog. I tracked it down and believe the problem is a single block on the disk. How do I go about easily getting the disk to reallocate that one block? I'd like to know what file got destroyed in the process. (I'm aware that if one block fails on a disk others are likely to follow; I have a good ongoing backup and just want to try to keep this disk working.) Searching the web leads to the Bad block HOWTO, which describes a manual process on an unmounted disk. It seems complicated and error-prone. Is there a tool to automate this process in Linux? My only other option is the manufacturer's diagnostic tool, but I presume that'll clobber the bad block without any reporting on what got destroyed. Worst case, it might be filesystem metadata. The disk in question is the primary system partition. Using ext3fs and LVM. Here's the error log from syslog and the relevant bit from smartctl. smartd[5226]: Device: /dev/hda, 1 Currently unreadable (pending) sectors Error 1 occurred at disk power-on lifetime: 17449 hours (727 days + 1 hours) ... Error: UNC at LBA = 0x00d39eee = 13868782 There's a full smartctl dump on pastebin.

    Read the article

  • Skipping scheduled self-tests and predicting drive EOL

    - by Steve Madsen
    For a few weeks now, smartd has been reporting that it is skipping some of its scheduled self-tests on the weekends: Apr 24 18:29:32 calvin smartd[4758]: Device: /dev/sda, skip scheduled Offline Immediate Test; 40% remaining of current Self-Test. Apr 24 18:29:33 calvin smartd[4758]: Device: /dev/sdb, skip scheduled Offline Immediate Test; 50% remaining of current Self-Test. The drives in this RAID-1 array are set to run an offline test four times a day, a short self-test at 2am every day, and a long self-test on Saturdays at 2am. For some reason, it looks like the long self-test is taking longer, causing the other scheduled tests to be skipped. First question: is this a sign of likely drive failure? Then today, smartd reported that a self-test failed. Here is the output of smartctl -a /dev/sdb: smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.8 family Device Model: ST3250823AS Serial Number: 3ND1GNBC Firmware Version: 3.03 User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Apr 25 13:15:34 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 84) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 047 039 006 Pre-fail Always - 168450357 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 33 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 654745480 9 Power_On_Hours 0x0032 055 055 000 Old_age Always - 40141 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 51 194 Temperature_Celsius 0x0022 037 062 000 Old_age Always - 37 (0 17 0 0) 195 Hardware_ECC_Recovered 0x001a 047 039 000 Old_age Always - 168450357 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 40131 - # 2 Extended offline Completed: read failure 30% 40129 379795511 # 3 Short offline Completed without error 00% 40084 - # 4 Short offline Completed without error 00% 40060 - # 5 Short offline Completed without error 00% 40036 - # 6 Short offline Completed without error 00% 40013 - # 7 Short offline Completed without error 00% 39990 - # 8 Extended offline Completed without error 00% 39977 - # 9 Short offline Completed without error 00% 39919 - #10 Short offline Completed without error 00% 39895 - #11 Short offline Completed without error 00% 39872 - #12 Short offline Completed without error 00% 39848 - #13 Short offline Completed without error 00% 39824 - #14 Short offline Completed without error 00% 39801 - #15 Extended offline Completed without error 00% 39789 - #16 Short offline Completed without error 00% 39754 - #17 Short offline Completed without error 00% 39732 - #18 Short offline Completed without error 00% 39707 - #19 Short offline Completed without error 00% 39683 - #20 Short offline Completed without error 00% 39660 - #21 Short offline Completed without error 00% 39636 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Given that this drive is about 4.5 years old, I am probably tempting fate by keeping it in service. SMART doesn't seem to get much respect as a reliable way to predict drive failure. What else can I use to get an early indication of drive failure?

    Read the article

  • Why are SMART error rates going down?

    - by Jeff Shattock
    I have a hard drive that's part of a Linux software raid5 array. SMART has reported that its multi_zone_error_rate was 0, then 1, then 3. So I figured I better start backing up more frequently and prepare to replace the drive. Now, today, the multi_zone_error_rate of that very same drive is back down to 1. It seems that 2 errors unhappened while I wasn't looking. I've also seen simliar behaviour by inspecting the syslog on the server. Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100 Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100 Jun 7 21:01:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100 Jun 8 02:31:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200 Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200 Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200 These are raw values, not the human-useful values that smartctl -a produces, but the behaviour is similar: error rates changing, then undoing the change. None of these are the drive that had the multi_zone weirdness. I haven't seen any problems from the RAID; its most recent scrub ( < 24 hours ago) came back totally clean. The only thing I can think of is that the SMART reporting circuitry on the drive isn't working properly all the time. The cables are in tight on the drive and board. What's going on here?

    Read the article

  • Do I need to be worried about these SMART drive temperatures?

    - by Steve Lorimer
    I have 5 hard drives in a machine sitting in a cupboard. /dev/sda is a 500GB Seagate drive, and is the boot disk. /dev/sd{b,c,d,e} are 2TB drives in a raid6 configuration. smartctl is showing significantly higher temperatures (like ~140 degrees celsius) on the raid drives than the boot drive. Do I need to be worried? /dev/sdb and /dev/sde are new Western Digital Black drives (new=1 week) /dev/sdc and /dev/sdd are 5 year old Hitachi drives /dev/sda [SAT], Temperature_Celsius changed from 40 to 39 /dev/sdc [SAT], Temperature_Celsius changed from 142 to 146 /dev/sdc [SAT], Temperature_Celsius changed from 146 to 142 /dev/sdd [SAT], Temperature_Celsius changed from 142 to 146 /dev/sda [SAT], Airflow_Temperature_Cel changed from 61 to 62 /dev/sda [SAT], Temperature_Celsius changed from 39 to 38 /dev/sde [SAT], Temperature_Celsius changed from 107 to 108 /dev/sdb [SAT], Temperature_Celsius changed from 108 to 109 /dev/sdc [SAT], Temperature_Celsius changed from 146 to 150 /dev/sdc [SAT], Temperature_Celsius changed from 146 to 150 /dev/sda [SAT], Airflow_Temperature_Cel changed from 62 to 61 /dev/sda [SAT], Temperature_Celsius changed from 38 to 39 Update: Adding detailed drive information as per request: /dev/sda =========================== smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.9.10-100.fc17.x86_64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Pipeline HD 5900.2 Device Model: ST3500312CS Serial Number: 5VV47HXA LU WWN Device Id: 5 000c50 02aad5ad6 Firmware Version: SC13 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 5900 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 1.5 Gb/s (current: 1.5 Gb/s) Local Time is: Tue Jun 3 10:54:11 2014 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled /dev/sdb =========================== smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.9.10-100.fc17.x86_64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD2003FZEX-00Z4SA0 Serial Number: WD-WMC1F1398726 LU WWN Device Id: 5 0014ee 003b8bd25 Firmware Version: 01.01A01 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Tue Jun 3 10:54:11 2014 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled /dev/sdc =========================== smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.9.10-100.fc17.x86_64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 7K3000 Device Model: Hitachi HDS723020BLA642 Serial Number: MN1220F30WSTUD LU WWN Device Id: 5 000cca 369cc9f5d Firmware Version: MN6OA580 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Tue Jun 3 10:54:11 2014 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled /dev/sdd =========================== smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.9.10-100.fc17.x86_64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 7K3000 Device Model: Hitachi HDS723020BLA642 Serial Number: MN1220F30WST4D LU WWN Device Id: 5 000cca 369cc9f48 Firmware Version: MN6OA580 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 6.0 Gb/s (current: 1.5 Gb/s) Local Time is: Tue Jun 3 10:54:11 2014 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled /dev/sde =========================== smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.9.10-100.fc17.x86_64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD2003FZEX-00Z4SA0 Serial Number: WD-WMC1F1483782 LU WWN Device Id: 5 0014ee 3002d235c Firmware Version: 01.01A01 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) Local Time is: Tue Jun 3 10:54:11 2014 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled

    Read the article

  • Bad motherboard / controller / HDs?

    - by quidpro
    On a leased server, I am running into some timing issues with an application that requires precise timing. Server is a Dual Xeon E5410 running on a Supermicro X7DVL-3 motherboard under CentOs 5.5 x64. The application I am running is timer sensitive and keeps sensing drift whether under load or at idle, but especially under load. I did some investigating with atop and dd and found some mind-blowing numbers. Mind you, I am no Linux guru but something sure seems out of whack. I ran: dd bs=4096 if=/dev/zero of=/bigtestfile to generate disk activity. Regardless whether I wrote it to sda or sdb my DSK value in atop would go over 100%, at one time peaking at 1700%. Again it does not matter if I am writing to sda or sdb. DSK | sdb | busy 675% | read 0 | write 110 | avio 78 ms | Here are the smartctl outputs: # smartctl -A /dev/sda smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 165 165 021 Pre-fail Always - 2750 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 21 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 200 200 051 Old_age Always - 0 9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25831 10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 194 Temperature_Celsius 0x0022 116 093 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 # smartctl -A /dev/sdb smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 180 180 021 Pre-fail Always - 3958 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 22 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 24087 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 194 Temperature_Celsius 0x0022 122 096 000 Old_age Always - 25 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 Any idea what's wrong here? Bad motherboard? It would seem rare that both drives are going bad (smartctl says they PASS_, so it leaves the mobo as the culprit in my eyes.

    Read the article

  • How to enable SMART?

    - by Pratik Koirala
    I want to conduct a SMART test on my drive but it was disabled. So, i used sudo smartctl -s on /dev/sda but the result was smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-26-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF ENABLE/DISABLE COMMANDS SECTION === Error SMART Enable failed: scsi error aborted command Smartctl: SMART Enable Failed. A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. How to overcome this problem?

    Read the article

  • How do you monitor SSD wear in Windows when the drives are presented as 'generic' devices?

    - by MikeyB
    Under Linux, we can monitor SSD wear fairly easily with smartmontools whether the drive is presented as a normal block device or a generic device (which happens when the drive has been hardware RAIDed by certain controllers such as the one on the IBM HS22). How can we do the equivalent under Windows? Does anyone actually use smartmontools? Or are there other packages out there? The problem is that SCSI Generic devices just don't show up in Windows. If the drives aren't RAIDed we can see them fine. How I'd do it in Linux: sles11-live:~ # lsscsi -g [1:0:0:0] disk SMART USB-IBM 8989 /dev/sda /dev/sg0 [2:0:0:0] disk ATA MTFDDAK256MAR-1K MA44 - /dev/sg1 [2:0:1:0] disk ATA MTFDDAK256MAR-1K MA44 - /dev/sg2 [2:1:8:0] disk LSILOGIC Logical Volume 3000 /dev/sdb /dev/sg3 sles11-live:~ # smartctl -l ssd /dev/sg1 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32.49-0.3-default] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net Device Statistics (GP Log 0x04) Page Offset Size Value Description 7 ===== = = == Solid State Device Statistics (rev 1) == 7 0x008 1 26~ Percentage Used Endurance Indicator |_ ~ normalized value sles11-live:~ # smartctl -l ssd /dev/sg2 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32.49-0.3-default] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net Device Statistics (GP Log 0x04) Page Offset Size Value Description 7 ===== = = == Solid State Device Statistics (rev 1) == 7 0x008 1 3~ Percentage Used Endurance Indicator |_ ~ normalized value

    Read the article

  • Home server hard drive: 186k start-stop cycles in 325 days?

    - by j-g-faustus
    I set up a home server about a year ago, using Ubuntu server (10.04 LTS at the moment), four disks in RAID 5 for storage (WD Green 1.5 TB) and a laptop drive for the OS. Today the output of smartctl, a command line utility for checking the SMART attributes of a hard drive, tells me that the primary OS drive has had no less than 186,000 start-stop cycles in 325 days and may be nearing the end of its lifespan. The smartctl output is in "normalized values", in this case a number between 200 and 000, where 200 is "brand new" and 000 means "worn out". My disk gets 001. So I wonder what happened: 186k start/stop cycles in 7820 hours is about one start/stop per 2.5 minutes around the clock. This seems somewhat excessive for a computer that sees actual use once or twice per day. (The RAID disks are normal, averaging to one start/stop per day, as expected.) Does anyone have similar experiences, or pointers to what might be the issue here? Specifically I'd like to know Why the massive start/stop count? Do I have some sort of configuration issue? Could there be a background service that is causing trouble? Could having a laptop disk as the OS drive be part of the problem? Can anyone confirm or deny this? Here is the /etc/hdparm.conf configuration /dev/sda { apm = 127 spindown_time = 120 } and the most relevant parts of smartctl --attributes /dev/sda: smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 001 001 000 Old_age Always - 185875 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7820 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 109 193 Load_Cycle_Count 0x0032 118 118 000 Old_age Always - 246833 194 Temperature_Celsius 0x0022 107 098 000 Old_age Always - 36 As I generally prefer my drives to last more than a year, any advice is appreciated.

    Read the article

  • System locking up with suspicious messages about hard disk

    - by Chris Conway
    My system has started behaving strangely, intermittently locking up. I see messages like the following in syslog: Nov 18 22:22:00 claypool kernel: [ 3428.078156] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 18 22:22:00 claypool kernel: [ 3428.078163] ata3.00: irq_stat 0x40000000 Nov 18 22:22:00 claypool kernel: [ 3428.078167] sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00 Nov 18 22:22:00 claypool kernel: [ 3428.078182] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov 18 22:22:00 claypool kernel: [ 3428.078184] res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) Nov 18 22:22:00 claypool kernel: [ 3428.078188] ata3.00: status: { DRDY } Nov 18 22:22:00 claypool kernel: [ 3428.080887] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 18 22:22:00 claypool kernel: [ 3428.080890] ata3.00: irq_stat 0x40000000 Nov 18 22:22:00 claypool kernel: [ 3428.080893] sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00 Nov 18 22:22:00 claypool kernel: [ 3428.080905] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov 18 22:22:00 claypool kernel: [ 3428.080906] res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) Nov 18 22:22:00 claypool kernel: [ 3428.080910] ata3.00: status: { DRDY } And then this: Nov 18 23:13:56 claypool kernel: [ 6544.000798] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Nov 18 23:13:56 claypool kernel: [ 6544.000804] ata1.00: failed command: FLUSH CACHE EXT Nov 18 23:13:56 claypool kernel: [ 6544.000814] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov 18 23:13:56 claypool kernel: [ 6544.000815] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) Nov 18 23:13:56 claypool kernel: [ 6544.000819] ata1.00: status: { DRDY } Nov 18 23:13:56 claypool kernel: [ 6544.000825] ata1: hard resetting link Nov 18 23:14:01 claypool kernel: [ 6549.360324] ata1: link is slow to respond, please be patient (ready=0) Nov 18 23:14:06 claypool kernel: [ 6554.008091] ata1: COMRESET failed (errno=-16) Nov 18 23:14:06 claypool kernel: [ 6554.008103] ata1: hard resetting link Nov 18 23:14:11 claypool kernel: [ 6559.372246] ata1: link is slow to respond, please be patient (ready=0) Nov 18 23:14:16 claypool kernel: [ 6564.020228] ata1: COMRESET failed (errno=-16) Nov 18 23:14:16 claypool kernel: [ 6564.020235] ata1: hard resetting link Nov 18 23:14:21 claypool kernel: [ 6569.380109] ata1: link is slow to respond, please be patient (ready=0) Nov 18 23:14:31 claypool kernel: [ 6579.460243] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 18 23:14:31 claypool kernel: [ 6579.486595] ata1.00: configured for UDMA/133 Nov 18 23:14:31 claypool kernel: [ 6579.486601] ata1.00: retrying FLUSH 0xea Emask 0x4 Nov 18 23:14:31 claypool kernel: [ 6579.486939] ata1.00: device reported invalid CHS sector 0 Nov 18 23:14:31 claypool kernel: [ 6579.486952] ata1: EH complete Nov 18 23:17:01 claypool CRON[3910]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Nov 18 23:17:01 claypool CRON[3908]: (CRON) error (grandchild #3910 failed with exit status 1) Nov 18 23:17:01 claypool postfix/sendmail[3925]: fatal: open /etc/postfix/main.cf: No such file or directory Nov 18 23:17:01 claypool CRON[3908]: (root) MAIL (mailed 1 byte of output; but got status 0x004b, #012) Nov 18 23:39:01 claypool CRON[4200]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) There are no messages marked after 23:39. When I next tried to use the machine, it would not return from the screensaver (blank screen), nor switch to another terminal, and I had to hard reboot it. [UPDATE] The output of smartctl is here. I had trouble getting this, because / is being mounted read-only (?!), which prevents most applications from running. Also, it may not be related, but I have the following worrying messages in dmesg: [ 10.084596] k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141 [ 10.098477] i2c i2c-0: nForce2 SMBus adapter at 0x600 [ 10.098483] ACPI: resource nForce2_smbus [io 0x0700-0x073f] conflicts with ACPI region SM00 [??? 0x00000700-0x0000073f flags 0x30] [ 10.098486] ACPI: This conflict may cause random problems and system instability [ 10.098487] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver [ 10.098509] i2c i2c-1: nForce2 SMBus adapter at 0x700 [ 10.112570] Linux agpgart interface v0.103 [ 10.155329] atk: Resources not safely usable due to acpi_enforce_resources kernel parameter [ 10.161506] it87: Found IT8712F chip at 0x290, revision 8 [ 10.161517] it87: VID is disabled (pins used for GPIO) [ 10.161527] it87: in3 is VCC (+5V) [ 10.161528] it87: in7 is VCCH (+5V Stand-By) [ 10.161560] ACPI: resource it87 [io 0x0295-0x0296] conflicts with ACPI region ECRE [??? 0x00000290-0x000002af flags 0x45] [ 10.161562] ACPI: This conflict may cause random problems and system instability [ 10.161564] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver [UPDATE 2] I swapped in a new SATA cable, per Phil's suggestion. The current output of smartctl is here, if it helps. [UPDATE 3] I don't think the cable fixed it. The system hasn't locked up yet, but my media player crashed a few minutes ago and I have the following in the syslog: Nov 20 16:07:17 claypool kernel: [ 2294.400033] ata1: link is slow to respond, please be patient (ready=0) Nov 20 16:07:47 claypool kernel: [ 2324.084581] ata1: COMRESET failed (errno=-16) Nov 20 16:07:47 claypool kernel: [ 2324.084588] ata1: limiting SATA link speed to 1.5 Gbps Nov 20 16:07:47 claypool kernel: [ 2324.084592] ata1: hard resetting link I get the following response from smartctl: $ sudo smartctl -a /dev/sda [sudo] password for chris: sudo: Can't open /var/lib/sudo/chris/0: Read-only file system smartctl 5.40 2010-03-16 r3077 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: /0:0:0:0 Version: scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

    Read the article

  • How to detect hard disk failure?

    - by Devator
    So, one of my servers has a hard disk failure. It's running software RAID, the system locked up and according to /proc/mdstat (and /var/log/messages), it's really down: Personalities : [raid1] md2 : active raid1 sdb2[1] 104320 blocks [2/1] [_U] md5 : active raid1 sdb5[1] 2104448 blocks [2/1] [_U] md6 : active raid1 sdb6[1] 830134656 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 143363968 blocks [2/1] [_U] and Nov 5 22:04:37 m38501 smartd[4467]: Device: /dev/sda, not capable of SMART self-check However when I do smartctl -H /dev/sda, it passes the test. It also passes the test with smartctl --test=short /dev/sda. So, is smartctl a broken testing tool, or am I doing something completely off?

    Read the article

  • Making sense of S.M.A.R.T

    - by James
    First of all, I think everyone knows that hard drives fail a lot more than the manufacturers would like to admit. Google did a study that indicates that certain raw data attributes that the S.M.A.R.T status of hard drives reports can have a strong correlation with the future failure of the drive. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in re- allocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabil- ities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. Seagate seems like it is trying to obscure this information about their drives by claiming that only their software can accurately determine the accurate status of their drive and by the way their software will not tell you the raw data values for the S.M.A.R.T attributes. Western digital has made no such claim to my knowledge but their status reporting tool does not appear to report raw data values either. I've been using HDtune and smartctl from smartmontools in order to gather the raw data values for each attribute. I've found that indeed... I am comparing apples to oranges when it comes to certain attributes. I've found for example that most Seagate drives will report that they have many millions of read errors while western digital 99% of the time shows 0 for read errors. I've also found that Seagate will report many millions of seek errors while Western Digital always seems to report 0. Now for my question. How do I normalize this data? Is Seagate producing millions of errors while Western digital is producing none? Wikipedia's article on S.M.A.R.T status says that manufacturers have different ways of reporting this data. Here is my hypothesis: I think I found a way to normalize (is that the right term?) the data. Seagate drives have an additional attribute that Western Digital drives do not have (Hardware ECC Recovered). When you subtract the Read error count from the ECC Recovered count, you'll probably end up with 0. This seems to be equivalent to Western Digitals reported "Read Error" count. This means that Western Digital only reports read errors that it cannot correct while Seagate counts up all read errors and tells you how many of those it was able to fix. I had a Seagate drive where the ECC Recovered count was less than the Read error count and I noticed that many of my files were becoming corrupt. This is how I came up with my hypothesis. The millions of seek errors that Seagate produces are still a mystery to me. Please confirm or correct my hypothesis if you have additional information. Here is the smart status of my western digital drive just so you can see what I'm talking about: james@ubuntu:~$ sudo smartctl -a /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD1001FALS-00E3A0 Serial Number: WD-WCATR0258512 Firmware Version: 05.01D05 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Jun 10 19:52:28 2010 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 175 021 Pre-fail Always - 4033 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 270 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1468 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 262 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 46 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 223 194 Temperature_Celsius 0x0022 105 102 000 Old_age Always - 42 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

    Read the article

  • Hard Disk DRDY error: is it a crash

    - by pranjal
    I am using IBM Thinkpad, 1.7GHz, 512 RAM with Linux Mint 9 installed. I have two partitions in addition to root. One of the partitions became read-only yesterday, after which I rebooted my system. It is extremely slow along with DRDY Error : Is my Hard disk crashed ? Error Log while booting. Differences between boot sector and its backup. failed command : READ DMA BMDMA : stat 0X25 ata 1.00 : status : { DRDY ERR } ata 1.00 : status :{ UNC } Buffer I/O error on logical device, logical block 65467 smartctl output for the partition: mint mint # smartctl -a /dev/sda1 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: TOSHIBA MK4026GAX RoHS Serial Number: X5LY1623T Firmware Version: PA107E User Capacity: 40,007,761,920 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Feb 17 06:48:25 2011 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 153) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 310 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 3968 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 40 7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 7257 10 Spin_Retry_Count 0x0033 179 100 030 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3484 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 489 193 Load_Cycle_Count 0x0032 064 064 000 Old_age Always - 367150 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 36 (Lifetime Min/Max 14/57) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 33 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 82 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0 220 Disk_Shift 0x0002 100 100 000 Old_age Always - 101 222 Loaded_Hours 0x0032 085 085 000 Old_age Always - 6146 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0 224 Load_Friction 0x0022 100 100 000 Old_age Always - 0 226 Load-in_Time 0x0026 100 100 000 Old_age Always - 227 240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0 SMART Error Log Version: 1 ATA Error Count: 2371 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2371 occurred at disk power-on lifetime: 7256 hours (302 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 05 1a 1b 00 e0 Error: UNC 5 sectors at LBA = 0x00001b1a = 6938 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 05 1a 1b 00 e0 00 00:03:10.061 READ DMA f8 00 00 00 00 00 e0 00 00:03:10.061 READ NATIVE MAX ADDRESS ec 00 00 00 00 00 a0 02 00:03:10.053 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 00:03:10.053 SET FEATURES [Set transfer mode] f8 00 00 00 00 00 e0 00 00:03:10.053 READ NATIVE MAX ADDRESS Error 2370 occurred at disk power-on lifetime: 7256 hours (302 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 05 1a 1b 00 e0 Error: UNC 5 sectors at LBA = 0x00001b1a = 6938 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 05 1a 1b 00 e0 00 00:03:03.328 READ DMA f8 00 00 00 00 00 e0 00 00:03:03.327 READ NATIVE MAX ADDRESS ec 00 00 00 00 00 a0 02 00:03:03.320 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 00:03:03.319 SET FEATURES [Set transfer mode] f8 00 00 00 00 00 e0 00 00:03:03.319 READ NATIVE MAX ADDRESS Error 2369 occurred at disk power-on lifetime: 7256 hours (302 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 05 1a 1b 00 e0 Error: UNC 5 sectors at LBA = 0x00001b1a = 6938 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 05 1a 1b 00 e0 00 00:02:56.582 READ DMA f8 00 00 00 00 00 e0 00 00:02:56.582 READ NATIVE MAX ADDRESS ec 00 00 00 00 00 a0 02 00:02:56.574 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 00:02:56.574 SET FEATURES [Set transfer mode] f8 00 00 00 00 00 e0 00 00:02:56.574 READ NATIVE MAX ADDRESS Error 2368 occurred at disk power-on lifetime: 7256 hours (302 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 05 1a 1b 00 e0 Error: UNC 5 sectors at LBA = 0x00001b1a = 6938 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 05 1a 1b 00 e0 00 00:02:49.809 READ DMA f8 00 00 00 00 00 e0 00 00:02:49.809 READ NATIVE MAX ADDRESS ec 00 00 00 00 00 a0 02 00:02:49.801 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 00:02:49.801 SET FEATURES [Set transfer mode] f8 00 00 00 00 00 e0 00 00:02:49.801 READ NATIVE MAX ADDRESS Error 2367 occurred at disk power-on lifetime: 7256 hours (302 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 05 1a 1b 00 e0 Error: UNC 5 sectors at LBA = 0x00001b1a = 6938 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 05 1a 1b 00 e0 00 00:02:43.056 READ DMA f8 00 00 00 00 00 e0 00 00:02:43.056 READ NATIVE MAX ADDRESS ec 00 00 00 00 00 a0 02 00:02:43.048 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 00:02:43.048 SET FEATURES [Set transfer mode] f8 00 00 00 00 00 e0 00 00:02:43.047 READ NATIVE MAX ADDRESS SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Device does not support Selective Self Tests/Logging Do I need to get a new Hard Disk my PC ?

    Read the article

  • Strange Recurrent Excessive I/O Wait

    - by Chris
    I know quite well that I/O wait has been discussed multiple times on this site, but all the other topics seem to cover constant I/O latency, while the I/O problem we need to solve on our server occurs at irregular (short) intervals, but is ever-present with massive spikes of up to 20k ms a-wait and service times of 2 seconds. The disk affected is /dev/sdb (Seagate Barracuda, for details see below). A typical iostat -x output would at times look like this, which is an extreme sample but by no means rare: iostat (Oct 6, 2013) tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.00 0.00 156.00 9.75 21.89 288.12 36.00 57.60 5.50 0.00 44.00 8.00 48.79 2194.18 181.82 100.00 2.00 0.00 16.00 8.00 46.49 3397.00 500.00 100.00 4.50 0.00 40.00 8.89 43.73 5581.78 222.22 100.00 14.50 0.00 148.00 10.21 13.76 5909.24 68.97 100.00 1.50 0.00 12.00 8.00 8.57 7150.67 666.67 100.00 0.50 0.00 4.00 8.00 6.31 10168.00 2000.00 100.00 2.00 0.00 16.00 8.00 5.27 11001.00 500.00 100.00 0.50 0.00 4.00 8.00 2.96 17080.00 2000.00 100.00 34.00 0.00 1324.00 9.88 1.32 137.84 4.45 59.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.00 44.00 204.00 11.27 0.01 0.27 0.27 0.60 Let me provide you with some more information regarding the hardware. It's a Dell 1950 III box with Debian as OS where uname -a reports the following: Linux xx 2.6.32-5-amd64 #1 SMP Fri Feb 15 15:39:52 UTC 2013 x86_64 GNU/Linux The machine is a dedicated server that hosts an online game without any databases or I/O heavy applications running. The core application consumes about 0.8 of the 8 GBytes RAM, and the average CPU load is relatively low. The game itself, however, reacts rather sensitive towards I/O latency and thus our players experience massive ingame lag, which we would like to address as soon as possible. iostat: avg-cpu: %user %nice %system %iowait %steal %idle 1.77 0.01 1.05 1.59 0.00 95.58 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 13.16 25.42 135.12 504701011 2682640656 sda 1.52 0.74 20.63 14644533 409684488 Uptime is: 19:26:26 up 229 days, 17:26, 4 users, load average: 0.36, 0.37, 0.32 Harddisk controller: 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) Harddisks: Array 1, RAID-1, 2x Seagate Cheetah 15K.5 73 GB SAS Array 2, RAID-1, 2x Seagate ST3500620SS Barracuda ES.2 500GB 16MB 7200RPM SAS Partition information from df: Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 480191156 30715200 425083668 7% /home /dev/sda2 7692908 437436 6864692 6% / /dev/sda5 15377820 1398916 13197748 10% /usr /dev/sda6 39159724 19158340 18012140 52% /var Some more data samples generated with iostat -dx sdb 1 (Oct 11, 2013) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 15.00 0.00 70.00 0.00 656.00 9.37 4.50 1.83 4.80 33.60 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 12.00 836.00 500.00 100.00 sdb 0.00 0.00 0.00 3.00 0.00 32.00 10.67 9.96 1990.67 333.33 100.00 sdb 0.00 0.00 0.00 4.00 0.00 40.00 10.00 6.96 3075.00 250.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 2.62 4648.00 500.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 1.00 0.00 16.00 16.00 1.69 7024.00 1000.00 100.00 sdb 0.00 74.00 0.00 124.00 0.00 1584.00 12.77 1.09 67.94 6.94 86.00 Characteristic charts generated with rrdtool can be found here: iostat plot 1, 24 min interval: http://imageshack.us/photo/my-images/600/yqm3.png/ iostat plot 2, 120 min interval: http://imageshack.us/photo/my-images/407/griw.png/ As we have a rather large cache of 5.5 GBytes, we thought it might be a good idea to test if the I/O wait spikes would perhaps be caused by cache miss events. Therefore, we did a sync and then this to flush the cache and buffers: echo 3 > /proc/sys/vm/drop_caches and directly afterwards the I/O wait and service times virtually went through the roof, and everything on the machine felt like slow motion. During the next few hours the latency recovered and everything was as before - small to medium lags in short, unpredictable intervals. Now my question is: does anybody have any idea what might cause this annoying behaviour? Is it the first indication of the disk array or the raid controller dying, or something that can be easily mended by rebooting? (At the moment we're very reluctant to do this, however, because we're afraid that the disks might not come back up again.) Any help is greatly appreciated. Thanks in advance, Chris. Edited to add: we do see one or two processes go to 'D' state in top, one of which seems to be kjournald rather frequently. If I'm not mistaken, however, this does not indicate the processes causing the latency, but rather those affected by it - correct me if I'm wrong. Does the information about uninterruptibly sleeping processes help us in any way to address the problem? @Andy Shinn requested smartctl data, here it is: smartctl -a -d megaraid,2 /dev/sdb yields: smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:13 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 20 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1236631092 Blocks received from initiator = 1097862364 Blocks read from cache and sent to initiator = 1383620256 Number of read and write commands whose size <= segment size = 531295338 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36556.93 number of minutes until next internal SMART test = 32 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 509271032 47 0 509271079 509271079 20981.423 0 write: 0 0 0 0 0 5022.039 0 verify: 1870931090 196 0 1870931286 1870931286 100558.708 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36538 - [- - -] # 2 Background short Completed 16 36514 - [- - -] # 3 Background short Completed 16 36490 - [- - -] # 4 Background short Completed 16 36466 - [- - -] # 5 Background short Completed 16 36442 - [- - -] # 6 Background long Completed 16 36420 - [- - -] # 7 Background short Completed 16 36394 - [- - -] # 8 Background short Completed 16 36370 - [- - -] # 9 Background long Completed 16 36364 - [- - -] #10 Background short Completed 16 36361 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes] smartctl -a -d megaraid,3 /dev/sdb yields: smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:26 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 19 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 288745640 Blocks received from initiator = 1097848399 Blocks read from cache and sent to initiator = 1304149705 Number of read and write commands whose size <= segment size = 527414694 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36596.83 number of minutes until next internal SMART test = 28 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 610862490 44 0 610862534 610862534 20470.133 0 write: 0 0 0 0 0 5022.480 0 verify: 2861227413 203 0 2861227616 2861227616 100872.443 0 Non-medium error count: 1 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36580 - [- - -] # 2 Background short Completed 16 36556 - [- - -] # 3 Background short Completed 16 36532 - [- - -] # 4 Background short Completed 16 36508 - [- - -] # 5 Background short Completed 16 36484 - [- - -] # 6 Background long Completed 16 36462 - [- - -] # 7 Background short Completed 16 36436 - [- - -] # 8 Background short Completed 16 36412 - [- - -] # 9 Background long Completed 16 36404 - [- - -] #10 Background short Completed 16 36401 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

    Read the article

1 2 3  | Next Page >