Salut a tous,
J'ai un petit souci sur mon HP Microserver Gen8. Ca fait la 2eme fois que le disque /dev/sda disparait completement du systeme.
Pour expliquer un peu plus en detail : j'ai 4 disques dans le serveur, avec 3 raid (2 raid1 et 1 raid5) donc md0 (raid1), md1 (raid1) et md2 (raid5). Les raids sont logiciels.
Les 4 disques sont partitionnes de la meme maniere avec 4 partitions chacun : sdx1, sdx2, sdx3, sdx4
L'idee est que a un certain moment il n'y a plus de /dev/sda qui apparait, le disque disparait completement du systeme, donc je recois un mail du genre :
This is an automatically generated mail message from mdadm running on server
A Fail event had been detected on md device /dev/md/0.
It could be related to component device /dev/sda2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [raid6] [raid5] [raid4]
md2 : active raid5 sda4[0](F) sdd4[4] sdc4[2] sdb4[1]
16106075136 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
bitmap: 7/40 pages [28KB], 65536KB chunk
md1 : active raid1 sda3[0](F) sdd3[3] sdc3[2] sdb3[1]
488150016 blocks super 1.2 [4/3] [_UUU]
bitmap: 3/4 pages [12KB], 65536KB chunk
md0 : active raid1 sda2[4](F) sdd2[3] sdc2[2] sdb2[1]
2927616 blocks super 1.2 [4/3] [_UUU]
unused devices: <none>
et dans les logs il y a plein de messages du genre :
Jul 5 20:45:37 server smartd[894]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 153 to 139
Jul 5 20:45:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 58
Jul 5 20:45:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 42
Jul 5 20:45:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 29 to 28
Jul 5 20:45:37 server smartd[894]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 146 to 139
Jul 5 21:15:37 server smartd[894]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 139 to 136
Jul 5 21:15:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 78 to 80
Jul 5 21:15:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 28 to 29
Jul 5 21:45:38 server smartd[894]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
Jul 5 22:15:37 server smartd[894]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 136 to 133
Jul 5 22:15:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 82
Jul 5 22:15:37 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 29 to 28
Jul 5 22:32:13 server kernel: [ 9064.526675] ata1.00: exception Emask 0x50 SAct 0x3000 SErr 0x4890800 action 0xe frozen
Jul 5 22:32:13 server kernel: [ 9064.526684] ata1.00: irq_stat 0x08400040, interface fatal error, connection status changed
Jul 5 22:32:13 server kernel: [ 9064.526690] ata1: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jul 5 22:32:13 server kernel: [ 9064.526697] ata1.00: failed command: WRITE FPDMA QUEUED
Jul 5 22:32:13 server kernel: [ 9064.526709] ata1.00: cmd 61/20:60:a0:74:af/03:00:5c:00:00/40 tag 12 ncq 409600 out
Jul 5 22:32:13 server kernel: [ 9064.526709] res 40/00:00:00:74:af/00:00:5c:00:00/40 Emask 0x50 (ATA bus error)
Jul 5 22:32:13 server kernel: [ 9064.526715] ata1.00: status: { DRDY }
Jul 5 22:32:13 server kernel: [ 9064.526721] ata1.00: failed command: WRITE FPDMA QUEUED
Jul 5 22:32:13 server kernel: [ 9064.526732] ata1.00: cmd 61/30:68:00:74:af/00:00:5c:00:00/40 tag 13 ncq 24576 out
Jul 5 22:32:13 server kernel: [ 9064.526732] res 40/00:00:00:74:af/00:00:5c:00:00/40 Emask 0x50 (ATA bus error)
Jul 5 22:32:13 server kernel: [ 9064.526737] ata1.00: status: { DRDY }
Jul 5 22:32:13 server kernel: [ 9064.526746] ata1: hard resetting link
Jul 5 22:32:13 server kernel: [ 9065.250305] ata1: SATA link down (SStatus 0 SControl 300)
Jul 5 22:32:14 server kernel: [ 9065.518799] ata1: hard resetting link
Jul 5 22:32:24 server kernel: [ 9075.540999] ata1: softreset failed (device not ready)
Jul 5 22:32:24 server kernel: [ 9075.541006] ata1: hard resetting link
Jul 5 22:32:24 server kernel: [ 9076.152696] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 22:32:24 server kernel: [ 9076.152983] ata1.00: both IDENTIFYs aborted, assuming NODEV
Jul 5 22:32:24 server kernel: [ 9076.152986] ata1.00: revalidation failed (errno=-2)
Jul 5 22:32:29 server kernel: [ 9081.150132] ata1: hard resetting link
Jul 5 22:32:30 server kernel: [ 9081.657867] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 22:32:30 server kernel: [ 9081.658154] ata1.00: both IDENTIFYs aborted, assuming NODEV
Jul 5 22:32:30 server kernel: [ 9081.658158] ata1.00: revalidation failed (errno=-2)
Jul 5 22:32:30 server kernel: [ 9081.658161] ata1.00: disabled
Jul 5 22:32:35 server kernel: [ 9086.655302] ata1: hard resetting link
Jul 5 22:32:35 server kernel: [ 9087.155100] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 22:32:35 server kernel: [ 9087.155472] ata1.00: both IDENTIFYs aborted, assuming NODEV
Jul 5 22:32:35 server kernel: [ 9087.163085] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163090] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 5 22:32:35 server kernel: [ 9087.163093] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163096] Sense Key : Aborted Command [current] [descriptor]
Jul 5 22:32:35 server kernel: [ 9087.163100] Descriptor sense data with sense descriptors (in hex):
Jul 5 22:32:35 server kernel: [ 9087.163102] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jul 5 22:32:35 server kernel: [ 9087.163112] 5c af 74 00
Jul 5 22:32:35 server kernel: [ 9087.163118] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163120] Add. Sense: No additional sense information
Jul 5 22:32:35 server kernel: [ 9087.163122] sd 0:0:0:0: [sda] CDB:
Jul 5 22:32:35 server kernel: [ 9087.163124] Write(16): 8a 00 00 00 00 00 5c af 74 a0 00 00 03 20 00 00
Jul 5 22:32:35 server kernel: [ 9087.163137] end_request: I/O error, dev sda, sector 1555002528
Jul 5 22:32:35 server kernel: [ 9087.163188] sd 0:0:0:0: rejecting I/O to offline device
Jul 5 22:32:35 server kernel: [ 9087.163192] sd 0:0:0:0: [sda] killing request
Jul 5 22:32:35 server kernel: [ 9087.163201] sd 0:0:0:0: rejecting I/O to offline device
Jul 5 22:32:35 server kernel: [ 9087.163205] md: super_written gets error=-5, uptodate=0
Jul 5 22:32:35 server kernel: [ 9087.163225] md/raid:md2: Disk failure on sda4, disabling device.
Jul 5 22:32:35 server kernel: [ 9087.163225] md/raid:md2: Operation continuing on 3 devices.
Jul 5 22:32:35 server kernel: [ 9087.163236] md: super_written gets error=-5, uptodate=0
Jul 5 22:32:35 server kernel: [ 9087.163249] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163252] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 5 22:32:35 server kernel: [ 9087.163255] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163258] Sense Key : Aborted Command [current] [descriptor]
Jul 5 22:32:35 server kernel: [ 9087.163262] Descriptor sense data with sense descriptors (in hex):
Jul 5 22:32:35 server kernel: [ 9087.163264] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jul 5 22:32:35 server kernel: [ 9087.163278] 5c af 74 00
Jul 5 22:32:35 server kernel: [ 9087.163285] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163287] Add. Sense: No additional sense information
Jul 5 22:32:35 server kernel: [ 9087.163291] sd 0:0:0:0: [sda] CDB:
Jul 5 22:32:35 server kernel: [ 9087.163293] Write(16): 8a 00 00 00 00 00 5c af 74 00 00 00 00 30 00 00
Jul 5 22:32:35 server kernel: [ 9087.163311] end_request: I/O error, dev sda, sector 1555002368
Jul 5 22:32:35 server kernel: [ 9087.163324] ata1: EH complete
Jul 5 22:32:35 server kernel: [ 9087.163343] ata1.00: detaching (SCSI 0:0:0:0)
Jul 5 22:32:35 server kernel: [ 9087.163376] sd 0:0:0:0: [sda] Unhandled error code
Jul 5 22:32:35 server kernel: [ 9087.163378] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.163381] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 5 22:32:35 server kernel: [ 9087.163383] sd 0:0:0:0: [sda] CDB:
Jul 5 22:32:35 server kernel: [ 9087.163384] Write(16): 8a 00 00 00 00 00 00 68 50 10 00 00 00 03 00 00
Jul 5 22:32:35 server kernel: [ 9087.163396] end_request: I/O error, dev sda, sector 6836240
Jul 5 22:32:35 server kernel: [ 9087.163398] md: super_written gets error=-5, uptodate=0
Jul 5 22:32:35 server kernel: [ 9087.163402] md/raid1:md1: Disk failure on sda3, disabling device.
Jul 5 22:32:35 server kernel: [ 9087.163402] md/raid1:md1: Operation continuing on 3 devices.
Jul 5 22:32:35 server kernel: [ 9087.169258] sd 0:0:0:0: [sda] Stopping disk
Jul 5 22:32:35 server kernel: [ 9087.169297] sd 0:0:0:0: [sda] START_STOP FAILED
Jul 5 22:32:35 server kernel: [ 9087.169300] sd 0:0:0:0: [sda]
Jul 5 22:32:35 server kernel: [ 9087.169303] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 5 22:32:35 server kernel: [ 9087.258058] RAID conf printout:
Jul 5 22:32:35 server kernel: [ 9087.258064] --- level:5 rd:4 wd:3
Jul 5 22:32:35 server kernel: [ 9087.258068] disk 0, o:0, dev:sda4
Jul 5 22:32:35 server kernel: [ 9087.258070] disk 1, o:1, dev:sdb4
Jul 5 22:32:35 server kernel: [ 9087.258072] disk 2, o:1, dev:sdc4
Jul 5 22:32:35 server kernel: [ 9087.258075] disk 3, o:1, dev:sdd4
Jul 5 22:32:36 server kernel: [ 9087.278140] RAID conf printout:
Jul 5 22:32:36 server kernel: [ 9087.278148] --- level:5 rd:4 wd:3
Jul 5 22:32:36 server kernel: [ 9087.278154] disk 1, o:1, dev:sdb4
Jul 5 22:32:36 server kernel: [ 9087.278158] disk 2, o:1, dev:sdc4
Jul 5 22:32:36 server kernel: [ 9087.278161] disk 3, o:1, dev:sdd4
Jul 5 22:32:36 server kernel: [ 9087.296037] RAID1 conf printout:
Jul 5 22:32:36 server kernel: [ 9087.296042] --- wd:3 rd:4
Jul 5 22:32:36 server kernel: [ 9087.296046] disk 0, wo:1, o:0, dev:sda3
Jul 5 22:32:36 server kernel: [ 9087.296049] disk 1, wo:0, o:1, dev:sdb3
Jul 5 22:32:36 server kernel: [ 9087.296051] disk 2, wo:0, o:1, dev:sdc3
Jul 5 22:32:36 server kernel: [ 9087.296053] disk 3, wo:0, o:1, dev:sdd3
Jul 5 22:32:36 server kernel: [ 9087.308096] RAID1 conf printout:
Jul 5 22:32:36 server kernel: [ 9087.308102] --- wd:3 rd:4
Jul 5 22:32:36 server kernel: [ 9087.308106] disk 1, wo:0, o:1, dev:sdb3
Jul 5 22:32:36 server kernel: [ 9087.308108] disk 2, wo:0, o:1, dev:sdc3
Jul 5 22:32:36 server kernel: [ 9087.308110] disk 3, wo:0, o:1, dev:sdd3
Jul 5 22:45:37 server smartd[894]: Device: /dev/sda [SAT], open() failed: No such device
Jul 5 22:45:37 server smartd[894]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Jul 5 22:45:38 server smartd[894]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Jul 5 22:45:38 server smartd[894]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Jul 5 22:45:38 server smartd[894]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 28 to 29
Jul 5 23:15:37 server smartd[894]: Device: /dev/sda [SAT], open() failed: No such device
Jul 5 23:15:38 server smartd[894]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 73
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda [SAT], opened
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda [SAT], WDC WD6002FFWX-68TZ4N0, S/N:K1G0P1VB, WWN:5-000cca-255c04f0e, FW:83.H0A83, 6.00 TB
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda [SAT], not found in smartd database.
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Jul 6 19:52:09 server acpid: waiting for events: event logging is off
Jul 6 19:52:09 server smartd[891]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD6002FFWX_68TZ4N0-K1G0P1VB.ata.state
Jul 6 19:52:09 server smartd[891]: Device: /dev/sdb, type changed from 'scsi' to 'sat'
Jul 6 19:52:09 server smartd[891]: Device: /dev/sdb [SAT], opened
Jul 6 19:52:09 server smartd[891]: Device: /dev/sdb [SAT], ST6000VN0011-1UL17Z, S/N:Z4D3VL6G, WWN:5-000c50-090a44602, FW:AN02, 6.00 TB
Jul 6 19:52:09 server smartd[891]: Device: /dev/sdb [SAT], not found in smartd database.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.ST6000VN0011_1UL17Z-Z4D3VL6G.ata.state
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc, type changed from 'scsi' to 'sat'
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc [SAT], opened
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc [SAT], HGST HDN726060ALE614, S/N:NCGMY3XS, WWN:5-000cca-24dc91063, FW:APGNW7JH, 6.00 TB
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc [SAT], not found in smartd database.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.HGST_HDN726060ALE614-NCGMY3XS.ata.state
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd, type changed from 'scsi' to 'sat'
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd [SAT], opened
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd [SAT], TOSHIBA HDWE160, S/N:26KBK20SF56D, WWN:5-000039-6dbc0152a, FW:FS2A, 6.00 TB
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd [SAT], not found in smartd database.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd [SAT], is SMART capable. Adding to "monitor" list.
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdd [SAT], state read from /var/lib/smartmontools/smartd.TOSHIBA_HDWE160-26KBK20SF56D.ata.state
Jul 6 19:52:10 server smartd[891]: Monitoring 4 ATA and 0 SCSI devices
Jul 6 19:52:10 server smartd[891]: Device: /dev/sda [SAT], open device worked again, warning condition reset after 1 email
Jul 6 19:52:10 server smartd[891]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 133 to 157
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 77
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 64
Jul 6 19:52:10 server smartd[891]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 36
Jul 6 19:52:11 server smartd[891]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 139 to 157
Jul 6 19:52:11 server smartd[891]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD6002FFWX_68TZ4N0-K1G0P1VB.ata.state
Jul 6 19:52:11 server smartd[891]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.ST6000VN0011_1UL17Z-Z4D3VL6G.ata.state
Jul 6 19:52:11 server smartd[891]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.HGST_HDN726060ALE614-NCGMY3XS.ata.state
Jul 6 19:52:11 server smartd[891]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.TOSHIBA_HDWE160-26KBK20SF56D.ata.state
Si j'eteins le serveur et je le rallume le /dev/sda et detecte a nouveau et je peux reconstruire les raids sans probleme. J'ai realiser un test rapide avec smartctl sur /dev/sda et apparemment tout va bien.
Lors de la disparition du /dev/sda (les 2 fois), le serveur etait en train de copier (avec rsync) beaucoup de donnees depuis un disque dur externe vers le /dev/md2 (donc le raid5), mais ca peut n'etre qu'une coincidence. Par contre si je lance une commande rsync j'ai des load average qui peuvent monter a 5 (ce qui n'est pas super d'apres ce que j'ai compris).
En faisant un :
nice -n 19 comandersync --bwlimit=20000
c'est a dire en limitant la priorite du process et la bande passante a 20 Mo/s, je continue a avoir des pics de load average a 2, mais le plus souvent c'est un peu plus de 1 (ca vient pour sur de rsync parce que si je l'arrete le serveur atteint des load average de 0.2 ou 0.3).
J'ai pas trouve de moyen efficace de reduire le load average tout en executant une commande rsync.
Je parle de ca car je pense que ca viens de la le fait que le /dev/sda disparait, une trop grosse charge de rsync sur le systeme en general (cpu, raid etc). Possible?
Apres si vous avez un debut d'idee sur comment je peux identifier la cause de cette disparition et l'eviter ulterieurement je suis preneur, parce que si avec juste un rsync le systeme s'affole et je perd un disque dur, je douterais clairement de la fiabilite du raid5 lorsque je ferais tourner des trucs genre seedbox ou machines virtuelles.
Merci d'avoir pris le temps de me lire.