Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I am using DRBD on two storage servers (conveniently called storage1 and storage2) running Ubuntu 10.04 (Lucid), kernel 2.6.32-52-server #114-Ubuntu SMP Wed Sep 11 19:06:34 UTC 2013 x86_64 GNU/Linux. The DRBD-packages from Ubuntu are being used: drbd8-source 2:8.3.7-1ubuntu2.3 drbd8-utils 2:8.3.7-1ubuntu2.3 A total of 5 DRBD devices are being exported using iSCSI. iscsitarget 1.4.19+svn275-ubuntu2 Yesterday a disk crashed on the storage server that was running as secondary node. I am using protocol C, and have all resources configured to detach on-io-error, so I would expect DRBD to detach and fail. What really happened was that the primary storage became unreachable for a while, and all iSCSI initiators dropped their connection. Storage1 was running as the primary node for all resources: $ cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at storage1, 2013-10-14 18:02:27 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:225119520 nr:29521088 dw:254906008 dr:153349220 al:1245808 bm:5596 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:300036368 nr:235743136 dw:536263072 dr:169442488 al:1137112 bm:6453 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:453141460 nr:560604808 dw:1013564376 dr:1210927948 al:3837751 bm:27775 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:20692932 nr:1109684 dw:21805356 dr:3662516 al:13046 bm:278 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:15837132 nr:27042340 dw:41799904 dr:11196128 al:7989 bm:589 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Storage2 was running as secondary: $ drbd-overview 0:r0 Connected Secondary/Primary UpToDate/UpToDate C r---- 1:r1 Connected Secondary/Primary UpToDate/UpToDate C r---- 2:r2 Connected Secondary/Primary UpToDate/UpToDate C r---- 3:r3 Connected Secondary/Primary UpToDate/UpToDate C r---- 4:r4 Connected Secondary/Primary UpToDate/UpToDate C r---- Then, on storage2 a disk got corrupted: Oct 29 14:07:42 storage2 kernel: [450726.516971] sd 1:0:0:0: [sdb] Unhandled sense code Oct 29 14:07:42 storage2 kernel: [450726.516976] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Oct 29 14:07:42 storage2 kernel: [450726.516981] sd 1:0:0:0: [sdb] Sense Key : Hardware Error [current] Oct 29 14:07:42 storage2 kernel: [450726.516985] sd 1:0:0:0: [sdb] Add. Sense: Internal target failure Oct 29 14:07:42 storage2 kernel: [450726.516991] sd 1:0:0:0: [sdb] CDB: Write(16): 8a 00 00 00 00 01 00 50 13 60 00 00 00 08 00 00 Oct 29 14:07:42 storage2 kernel: [450726.517278] block drbd2: write: error=-5 s=4299164128s Oct 29 14:07:42 storage2 kernel: [450726.517283] block drbd2: disk( UpToDate -> Failed ) Oct 29 14:07:42 storage2 kernel: [450726.565989] sd 1:0:0:0: [sdb] Unhandled sense code Oct 29 14:07:42 storage2 kernel: [450726.565992] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Oct 29 14:07:42 storage2 kernel: [450726.565995] sd 1:0:0:0: [sdb] Sense Key : Hardware Error [current] Oct 29 14:07:42 storage2 kernel: [450726.565998] sd 1:0:0:0: [sdb] Add. Sense: Internal target failure Oct 29 14:07:42 storage2 kernel: [450726.566002] sd 1:0:0:0: [sdb] CDB: Write(16): 8a 00 00 00 00 01 00 50 13 38 00 00 00 08 00 00 Oct 29 14:07:42 storage2 kernel: [450726.566245] block drbd2: write: error=-5 s=4299164088s DRBD seems to notice write errors on this node. At the same time on primary node storage1, error logs show: Oct 29 14:07:42 storage1 kernel: [710356.374508] block drbd2: Got NegAck packet. Peer is in troubles? Oct 29 14:07:42 storage1 kernel: [710356.423171] block drbd2: Got NegAck packet. Peer is in troubles? Oct 29 14:07:42 storage1 kernel: [710356.448183] block drbd2: Got NegAck packet. Peer is in troubles? Oct 29 14:07:42 storage1 kernel: [710356.448677] block drbd2: Got NegAck packet. Peer is in troubles? Oct 29 14:07:42 storage1 kernel: [710356.449177] block drbd2: Got NegAck packet. Peer is in troubles? Oct 29 14:07:47 storage1 kernel: [710361.376116] block drbd2: 2109 messages suppressed in /var/lib/dkms/drbd8/8.3.7/build/drbd/drbd_receiver.c:4249. Oct 29 14:07:47 storage1 kernel: [710361.376120] block drbd2: Got NegAck packet. Peer is in troubles? ... Oct 29 14:09:08 storage1 kernel: [710442.699897] iscsi_trgt: cmnd_abort(1167) 6c000010 1 0 42 8192 0 0 Oct 29 14:09:08 storage1 kernel: [710442.712764] iscsi_trgt: Abort Task (01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task) Oct 29 14:09:08 storage1 kernel: [710442.713092] iscsi_trgt: cmnd_abort(1167) 3a000010 1 0 42 8192 0 0 Oct 29 14:09:08 storage1 kernel: [710442.725665] iscsi_trgt: Abort Task (01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task) Oct 29 14:09:08 storage1 kernel: [710442.725874] iscsi_trgt: cmnd_abort(1167) 21000010 1 0 138 8192 0 0 Oct 29 14:09:08 storage1 kernel: [710442.738107] iscsi_trgt: Abort Task (01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete) Oct 29 14:09:08 storage1 kernel: [710442.738356] iscsi_trgt: cmnd_abort(1167) 56000010 1 0 138 8192 0 0 Oct 29 14:09:08 storage1 kernel: [710442.750336] iscsi_trgt: Abort Task (01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete) Oct 29 14:09:08 storage1 kernel: [710442.750587] iscsi_trgt: cmnd_abort(1167) 24000010 1 0 138 8192 0 0 Oct 29 14:09:08 storage1 kernel: [710442.762244] iscsi_trgt: Abort Task (01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete) My question is why didn't DRBD detach the secondary node? How could I improve this behaviour? I have added some configuration files, but if you need more information, I would be happy to provide them to you. Cheers, Sebastiaan r2.res ---- resource r2 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; } startup { wfc-timeout 180; degr-wfc-timeout 120; } disk { on-io-error detach; } net { timeout 40; # (unit = 0.1 seconds) connect-int 10; # (unit = 1 seconds) ping-int 5; # (unit = 1 seconds) ping-timeout 5; max-buffers 2048; after-sb-0pri discard-younger-primary; after-sb-1pri consensus; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 60M; al-extents 257; } on storage1 { device /dev/drbd2; disk /dev/mapper/vg02-lun2; address 10.0.0.5:7789; meta-disk /dev/mapper/vg02-lun2meta [0]; } on storage2 { device /dev/drbd2; disk /dev/mapper/vg02-lun2; address 10.0.0.6:7789; meta-disk /dev/mapper/vg02-lun2meta [0]; } } global_common.conf ---- global { usage-count no; # minor-count dialog-refresh disable-ip-verification } common { protocol C; handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb; } disk { # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes # no-disk-drain no-md-flushes max-bio-bvecs } net { # snd‐buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork } syncer { # rate after al-extents use-rle cpu-mask verify-alg csums-alg rate 60M; } } $ pvs PV VG Fmt Attr PSize PFree /dev/sda3 vg00 lvm2 a- 811.84g 0 /dev/sda4 vg01 lvm2 a- 845.38g 0 /dev/sdb1 vg02 lvm2 a- 3.63t 0 /dev/sdc1 vg03 lvm2 a- 1.09t 0 /dev/sdd1 vg04 lvm2 a- 1.09t 0 $ lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lun0 vg00 -wi-ao 811.34g lun0meta vg00 -wi-ao 512.00m lun1 vg01 -wi-ao 844.88g lun1meta vg01 -wi-ao 512.00m lun2 vg02 -wi-ao 3.63t lun2meta vg02 -wi-ao 512.00m lun3 vg03 -wi-ao 1.09t lun3meta vg03 -wi-ao 512.00m lun4 vg04 -wi-ao 1.09t lun4meta vg04 -wi-ao 512.00m -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20131030/f8debde3/attachment.pgp>