[DRBD-user] Disk errors at smartctl -a

chambal 2iow-li6l at dea.spamcon.org
Tue Nov 23 20:55:17 CET 2010


When I do "smartctl -a /dev/sda" on this system, it usually
triggers errors in the system log.

In addition, self-tests never seem to complete, or the log is
misleading.

The platform is CentOS 5.5. running on a Zotac nForce 610i
Mini-ITX with a Core2 E5400, with an OCZ Vertex2 SSD.

No problems in syslog when smartd was started:

Nov 19 13:12:19 harry smartd[26558]: smartd 5.41 2010-11-15 r3208 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Nov 19 13:12:19 harry smartd[26558]: Opened configuration file /etc/smartd.conf
Nov 19 13:12:19 harry smartd[26558]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Nov 19 13:12:19 harry smartd[26558]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Nov 19 13:12:19 harry smartd[26558]: Device: /dev/sda [SAT], opened
Nov 19 13:12:19 harry smartd[26558]: Device: /dev/sda [SAT], found in smartd database.
Nov 19 13:12:19 harry smartd[26558]: Device: /dev/sda [SAT], can't monitor Current Pending Sector count - no Attribute 197
Nov 19 13:12:19 harry smartd[26558]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Nov 19 13:12:19 harry smartd[26558]: Monitoring 1 ATA and 0 SCSI devices
Nov 19 13:12:19 harry smartd[26560]: smartd has fork()ed into background mode. New PID=26560.

But when I ran smartctl -a /dev/sda, syslog showed:

Nov 19 13:13:12 harry kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 19 13:13:12 harry kernel: ata1.00: cmd b0/d5:01:06:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Nov 19 13:13:12 harry kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 13:13:12 harry kernel: ata1.00: status: { DRDY }
Nov 19 13:13:12 harry kernel: ata1: hard resetting link
Nov 19 13:13:12 harry kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 19 13:13:12 harry kernel: ata1.00: configured for UDMA/133
Nov 19 13:13:12 harry kernel: sd 0:0:0:0: timing out command, waited 20s
Nov 19 13:13:12 harry kernel: ata1: EH complete
Nov 19 13:13:12 harry kernel: SCSI device sda: 117231408 512-byte hdwr sectors (60022 MB)
Nov 19 13:13:12 harry kernel: sda: Write Protect is off
Nov 19 13:13:12 harry kernel: SCSI device sda: drive cache: write back

I do not see any disk errors like this in any other use of the
system.

There was a long pause at the end of the smartctl console output
before it returned to the command prompt.  Here's the output:

[root at harry smartmontools]# smartctl -a /dev/sda
smartctl 5.41 2010-11-15 r3208 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     OCZ-VERTEX2
Serial Number:    OCZ-8F31137N5HO99D5O
Firmware Version: 1.24
User Capacity:    60,022,480,896 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Fri Nov 19 13:16:05 2010 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (   0) seconds.
Offline data collection
capabilities:                    (0x7f) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   050    Pre-fail  Always       -       0/17417935
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       180h+48m+28.000s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       28
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   001   129   000    Old_age   Always       -       1 (0 127 0 129)
195 ECC_Uncorr_Error_Count  0x001c   108   099   000    Old_age   Offline      -       0/17417935
196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always       -       0
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       384
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       2816
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       2816
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       64

Error SMART Error Log Read failed: Input/output error
Smartctl: SMART Error Log Read Failed
Error SMART Error Self-Test Log Read failed: Input/output error
Smartctl: SMART Self Test Log Read Failed
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I then tried to do self-tests on the drive - first a short test,
then later a long test.

Strangely, doing smartctl -a again, this time there are no syslog
errors or delay!

Yet some days later, doing smartctl -a does again trigger the
errors and delay.

There is something odd with the self-test status - long after the
tests should have completed, and would seeme to be complete from
first part of -a output:

Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.

The extended self test info shows (used extended because the
regular -a output didn't show anything):

[root at harry smartmontools]# smartctl -l xselftest /dev/sda
smartctl 5.41 2010-11-15 r3208 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
General Purpose Logging (GPL) feature set supported
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%       181         -
# 2  Short offline       Self-test routine in progress 90%       181         -


Why the bogus in-progress/remaining info?

If you need me to gather more info, let me know.




More information about the drbd-user mailing list