Bonjour,
je viens vers vous pour connaitre la procédure a suivre en cas d'erreurs disque.
J'ai un serveur à la maison qui tourne sous debian 9 avec des Westerns Digital RED 3To. Le disque concerné est en 1 seul partition primaire (sdc1) et est en BTRFS monté sur le dossier /media/backup/
le daemon smartctl m'envoie régulièrement des mails comme ca :
mail 1
Device: /dev/sdc [SAT], 1 Offline uncorrectable sectors
Device info:
WDC WD30EFRX-68EUZN0, S/N:WD-WCC4N6AUF094, WWN:5-0014ee-261b9ceef, FW:82.00A82, 3.00 TB
mail 2
The following warning/error was logged by the smartd daemon:
Device: /dev/sdc [SAT], 1 Currently unreadable (pending) sectors
Device info:
WDC WD30EFRX-68EUZN0, S/N:WD-WCC4N6AUF094, WWN:5-0014ee-261b9ceef, FW:82.00A82, 3.00 TB
mail 3
The following warning/error was logged by the smartd daemon:
Device: /dev/sdc [SAT], Self-Test Log error count increased from 5 to 6
Device info:
WDC WD30EFRX-68EUZN0, S/N:WD-WCC4N6AUF094, WWN:5-0014ee-261b9ceef, FW:82.00A82, 3.00 TB
pour le diagnostique je vous montre le resultat de quelque commande :
# btrfs scrub status /media/backup/
scrub status for 787da89b-5590-4523-8cb4-af471aa50b27
scrub started at Sun Jun 24 12:45:14 2018 and finished after 01:23:33
total bytes scrubbed: 665.96GiB with 6 errors
error details: read=6
corrected errors: 6, uncorrectable errors: 0, unverified errors: 0
# fdisk -l /dev/sdc
Disque /dev/sdc : 2,7 TiB, 3000592982016 octets, 5860533168 secteurs
Unités : secteur de 1 × 512 = 512 octets
Taille de secteur (logique / physique) : 512 octets / 4096 octets
taille d'E/S (minimale / optimale) : 4096 octets / 4096 octets
Type d'étiquette de disque : gpt
Identifiant de disque : ABDC0329-2DFA-4AEF-96EA-630AACEE5C7E
Périphérique Début Fin Secteurs Taille Type
/dev/sdc1 2048 5860532223 5860530176 2,7T Système de fichiers Linux
# smartctl -a /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N6AUF094
LU WWN Device Id: 5 0014ee 261b9ceef
Firmware Version: 82.00A82
User Capacity: 3 000 592 982 016 bytes [3,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jun 27 12:55:43 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (39840) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 399) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 190
3 Spin_Up_Time 0x0027 186 179 021 Pre-fail Always - 5666
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 20053
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 187 187 000 Old_age Always - 39740
194 Temperature_Celsius 0x0022 118 111 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 68
SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 8 occurred at disk power-on lifetime: 19559 hours (814 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 b0 b0 00 e0 Error: UNC at LBA = 0x0000b0b0 = 45232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 b0 b0 00 e0 08 5d+07:58:21.428 READ SECTOR(S)
Error 7 occurred at disk power-on lifetime: 19559 hours (814 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 b0 b0 00 e0 Error: UNC at LBA = 0x0000b0b0 = 45232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 b0 b0 00 e0 08 5d+07:58:15.003 READ SECTOR(S)
Error 6 occurred at disk power-on lifetime: 19559 hours (814 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 b0 b0 00 e0 Error: UNC at LBA = 0x0000b0b0 = 45232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 b0 b0 00 e0 08 5d+07:58:09.231 READ SECTOR(S)
Error 5 occurred at disk power-on lifetime: 19559 hours (814 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 b0 b0 00 e0 Error: UNC at LBA = 0x0000b0b0 = 45232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 b0 b0 00 e0 08 5d+07:57:59.410 READ SECTOR(S)
Error 4 occurred at disk power-on lifetime: 19559 hours (814 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 b0 b0 00 e0 Error: UNC at LBA = 0x0000b0b0 = 45232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 b0 b0 00 e0 08 5d+07:57:50.330 READ SECTOR(S)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 10% 20052 2218848
# 2 Short offline Completed: read failure 60% 20004 14432
# 3 Short offline Completed: read failure 50% 19980 2198496
# 4 Short offline Completed: read failure 50% 19932 2201888
# 5 Short offline Completed: read failure 50% 19884 2208672
# 6 Short offline Completed: read failure 50% 19836 2208672
# 7 Short offline Completed: read failure 50% 19813 2212064
# 8 Extended offline Completed without error 00% 19773 -
# 9 Short offline Completed without error 00% 19765 -
#10 Short offline Completed without error 00% 19717 -
#11 Short offline Completed without error 00% 19669 -
#12 Short offline Completed without error 00% 19645 -
#13 Short offline Completed without error 00% 19597 -
#14 Extended offline Completed without error 00% 19580 -
#15 Extended offline Completed: read failure 90% 19560 45232
#16 Extended offline Aborted by host 90% 19559 -
#17 Extended offline Aborted by host 90% 19559 -
#18 Extended offline Completed: read failure 90% 19558 45232
#19 Short offline Completed: read failure 60% 19549 45232
#20 Short offline Completed without error 00% 19501 -
#21 Short offline Completed without error 00% 19477 -
3 of 10 failed self-tests are outdated by newer successful extended offline self-test # 8
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
# cat /etc/smartd.conf
DEVICESCAN -a -W 0,38,40 -s (S/../../(1|3|5|7)/12|L/../(01|15|28)/./13) -m c****x@gmail.com -M exec /usr/share/smartmontools/smartd-runner
Voilà mes questions :-)
1°) on disque n'a pas encore 3ans pensez-vous qu'il est mort ? (je m'inquiète pas trop j'ai d'autres sauvegardes)
2°) j'ai déjà lancé un btrfs scrub start /media/backup
me conseillez vous de lancer d'autres commandes ?
3°) peut-on savoir a quoi correspond le secteur LBA of first error ? est-il possible de savoir a quel fichier il correspond sur ma partition btrfs ?
4°) pourquoi tout ces erreurs arrivent d'un coup et pourquoi les long-test ne detectent rien ?
5°) est-ce que les secteurs doivent-être isolés manuellement, si oui comment ?
Merci beaucoup pour votre aide. :-)