Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
I am using DRBD on two storage servers (conveniently called storage1 and
storage2) running Ubuntu 10.04 (Lucid), kernel 2.6.32-52-server
#114-Ubuntu SMP Wed Sep 11 19:06:34 UTC 2013 x86_64 GNU/Linux. The
DRBD-packages from Ubuntu are being used:
drbd8-source 2:8.3.7-1ubuntu2.3
drbd8-utils 2:8.3.7-1ubuntu2.3
A total of 5 DRBD devices are being exported using iSCSI.
iscsitarget 1.4.19+svn275-ubuntu2
Yesterday a disk crashed on the storage server that was running as
secondary node. I am using protocol C, and have all resources configured
to detach on-io-error, so I would expect DRBD to detach and fail. What
really happened was that the primary storage became unreachable for a
while, and all iSCSI initiators dropped their connection.
Storage1 was running as the primary node for all resources:
$ cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at storage1, 2013-10-14 18:02:27
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:225119520 nr:29521088 dw:254906008 dr:153349220 al:1245808
bm:5596 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:300036368 nr:235743136 dw:536263072 dr:169442488 al:1137112
bm:6453 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:453141460 nr:560604808 dw:1013564376 dr:1210927948 al:3837751
bm:27775 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:20692932 nr:1109684 dw:21805356 dr:3662516 al:13046 bm:278 lo:0
pe:0 ua:0 ap:0 ep:1 wo:b oos:0
4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:15837132 nr:27042340 dw:41799904 dr:11196128 al:7989 bm:589 lo:0
pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Storage2 was running as secondary:
$ drbd-overview
0:r0 Connected Secondary/Primary UpToDate/UpToDate C r----
1:r1 Connected Secondary/Primary UpToDate/UpToDate C r----
2:r2 Connected Secondary/Primary UpToDate/UpToDate C r----
3:r3 Connected Secondary/Primary UpToDate/UpToDate C r----
4:r4 Connected Secondary/Primary UpToDate/UpToDate C r----
Then, on storage2 a disk got corrupted:
Oct 29 14:07:42 storage2 kernel: [450726.516971] sd 1:0:0:0: [sdb]
Unhandled sense code
Oct 29 14:07:42 storage2 kernel: [450726.516976] sd 1:0:0:0: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 29 14:07:42 storage2 kernel: [450726.516981] sd 1:0:0:0: [sdb] Sense
Key : Hardware Error [current]
Oct 29 14:07:42 storage2 kernel: [450726.516985] sd 1:0:0:0: [sdb] Add.
Sense: Internal target failure
Oct 29 14:07:42 storage2 kernel: [450726.516991] sd 1:0:0:0: [sdb] CDB:
Write(16): 8a 00 00 00 00 01 00 50 13 60 00 00 00 08 00 00
Oct 29 14:07:42 storage2 kernel: [450726.517278] block drbd2: write:
error=-5 s=4299164128s
Oct 29 14:07:42 storage2 kernel: [450726.517283] block drbd2: disk(
UpToDate -> Failed )
Oct 29 14:07:42 storage2 kernel: [450726.565989] sd 1:0:0:0: [sdb]
Unhandled sense code
Oct 29 14:07:42 storage2 kernel: [450726.565992] sd 1:0:0:0: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 29 14:07:42 storage2 kernel: [450726.565995] sd 1:0:0:0: [sdb] Sense
Key : Hardware Error [current]
Oct 29 14:07:42 storage2 kernel: [450726.565998] sd 1:0:0:0: [sdb] Add.
Sense: Internal target failure
Oct 29 14:07:42 storage2 kernel: [450726.566002] sd 1:0:0:0: [sdb] CDB:
Write(16): 8a 00 00 00 00 01 00 50 13 38 00 00 00 08 00 00
Oct 29 14:07:42 storage2 kernel: [450726.566245] block drbd2: write:
error=-5 s=4299164088s
DRBD seems to notice write errors on this node. At the same time on
primary node storage1, error logs show:
Oct 29 14:07:42 storage1 kernel: [710356.374508] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.423171] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.448183] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.448677] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.449177] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:47 storage1 kernel: [710361.376116] block drbd2: 2109
messages suppressed in
/var/lib/dkms/drbd8/8.3.7/build/drbd/drbd_receiver.c:4249.
Oct 29 14:07:47 storage1 kernel: [710361.376120] block drbd2: Got NegAck
packet. Peer is in troubles?
...
Oct 29 14:09:08 storage1 kernel: [710442.699897] iscsi_trgt:
cmnd_abort(1167) 6c000010 1 0 42 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.712764] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task)
Oct 29 14:09:08 storage1 kernel: [710442.713092] iscsi_trgt:
cmnd_abort(1167) 3a000010 1 0 42 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.725665] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task)
Oct 29 14:09:08 storage1 kernel: [710442.725874] iscsi_trgt:
cmnd_abort(1167) 21000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.738107] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)
Oct 29 14:09:08 storage1 kernel: [710442.738356] iscsi_trgt:
cmnd_abort(1167) 56000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.750336] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)
Oct 29 14:09:08 storage1 kernel: [710442.750587] iscsi_trgt:
cmnd_abort(1167) 24000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.762244] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)
My question is why didn't DRBD detach the secondary node? How could I
improve this behaviour? I have added some configuration files, but if
you need more information, I would be happy to provide them to you.
Cheers,
Sebastiaan
r2.res
----
resource r2 {
protocol C;
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
}
startup {
wfc-timeout 180;
degr-wfc-timeout 120;
}
disk {
on-io-error detach;
}
net {
timeout 40; # (unit = 0.1 seconds)
connect-int 10; # (unit = 1 seconds)
ping-int 5; # (unit = 1 seconds)
ping-timeout 5;
max-buffers 2048;
after-sb-0pri discard-younger-primary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 60M;
al-extents 257;
}
on storage1 {
device /dev/drbd2;
disk /dev/mapper/vg02-lun2;
address 10.0.0.5:7789;
meta-disk /dev/mapper/vg02-lun2meta [0];
}
on storage2 {
device /dev/drbd2;
disk /dev/mapper/vg02-lun2;
address 10.0.0.6:7789;
meta-disk /dev/mapper/vg02-lun2meta [0];
}
}
global_common.conf
----
global {
usage-count no;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout
wait-after-sb;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier
no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
# snd‐buf-size rcvbuf-size timeout connect-int ping-int
ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries
cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri
data-integrity-alg no-tcp-cork
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg
csums-alg
rate 60M;
}
}
$ pvs
PV VG Fmt Attr PSize PFree
/dev/sda3 vg00 lvm2 a- 811.84g 0
/dev/sda4 vg01 lvm2 a- 845.38g 0
/dev/sdb1 vg02 lvm2 a- 3.63t 0
/dev/sdc1 vg03 lvm2 a- 1.09t 0
/dev/sdd1 vg04 lvm2 a- 1.09t 0
$ lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lun0 vg00 -wi-ao 811.34g
lun0meta vg00 -wi-ao 512.00m
lun1 vg01 -wi-ao 844.88g
lun1meta vg01 -wi-ao 512.00m
lun2 vg02 -wi-ao 3.63t
lun2meta vg02 -wi-ao 512.00m
lun3 vg03 -wi-ao 1.09t
lun3meta vg03 -wi-ao 512.00m
lun4 vg04 -wi-ao 1.09t
lun4meta vg04 -wi-ao 512.00m
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20131030/f8debde3/attachment.pgp>