[DRBD-user] DRBD not detaching on I/O error; Got NegAck packet. Peer is in troubles?

Wed Oct 30 09:57:09 CET 2013

Hi,

I am using DRBD on two storage servers (conveniently called storage1 and
storage2) running Ubuntu 10.04 (Lucid), kernel 2.6.32-52-server
#114-Ubuntu SMP Wed Sep 11 19:06:34 UTC 2013 x86_64 GNU/Linux. The
DRBD-packages from Ubuntu are being used:

drbd8-source                     2:8.3.7-1ubuntu2.3
drbd8-utils                      2:8.3.7-1ubuntu2.3

A total of 5 DRBD devices are being exported using iSCSI.

iscsitarget                      1.4.19+svn275-ubuntu2

Yesterday a disk crashed on the storage server that was running as
secondary node. I am using protocol C, and have all resources configured
to detach on-io-error, so I would expect DRBD to detach and fail. What
really happened was that the primary storage became unreachable for a
while, and all iSCSI initiators dropped their connection.

Storage1 was running as the primary node for all resources:

$ cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at storage1, 2013-10-14 18:02:27
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:225119520 nr:29521088 dw:254906008 dr:153349220 al:1245808
bm:5596 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:300036368 nr:235743136 dw:536263072 dr:169442488 al:1137112
bm:6453 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:453141460 nr:560604808 dw:1013564376 dr:1210927948 al:3837751
bm:27775 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:20692932 nr:1109684 dw:21805356 dr:3662516 al:13046 bm:278 lo:0
pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:15837132 nr:27042340 dw:41799904 dr:11196128 al:7989 bm:589 lo:0
pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Storage2 was running as secondary:
$ drbd-overview
  0:r0  Connected Secondary/Primary UpToDate/UpToDate C r----
  1:r1  Connected Secondary/Primary UpToDate/UpToDate C r----
  2:r2  Connected Secondary/Primary UpToDate/UpToDate C r----
  3:r3  Connected Secondary/Primary UpToDate/UpToDate C r----
  4:r4  Connected Secondary/Primary UpToDate/UpToDate C r----

Then, on storage2 a disk got corrupted:
Oct 29 14:07:42 storage2 kernel: [450726.516971] sd 1:0:0:0: [sdb]
Unhandled sense code
Oct 29 14:07:42 storage2 kernel: [450726.516976] sd 1:0:0:0: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 29 14:07:42 storage2 kernel: [450726.516981] sd 1:0:0:0: [sdb] Sense
Key : Hardware Error [current]
Oct 29 14:07:42 storage2 kernel: [450726.516985] sd 1:0:0:0: [sdb] Add.
Sense: Internal target failure
Oct 29 14:07:42 storage2 kernel: [450726.516991] sd 1:0:0:0: [sdb] CDB:
Write(16): 8a 00 00 00 00 01 00 50 13 60 00 00 00 08 00 00
Oct 29 14:07:42 storage2 kernel: [450726.517278] block drbd2: write:
error=-5 s=4299164128s
Oct 29 14:07:42 storage2 kernel: [450726.517283] block drbd2: disk(
UpToDate -> Failed )
Oct 29 14:07:42 storage2 kernel: [450726.565989] sd 1:0:0:0: [sdb]
Unhandled sense code
Oct 29 14:07:42 storage2 kernel: [450726.565992] sd 1:0:0:0: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 29 14:07:42 storage2 kernel: [450726.565995] sd 1:0:0:0: [sdb] Sense
Key : Hardware Error [current]
Oct 29 14:07:42 storage2 kernel: [450726.565998] sd 1:0:0:0: [sdb] Add.
Sense: Internal target failure
Oct 29 14:07:42 storage2 kernel: [450726.566002] sd 1:0:0:0: [sdb] CDB:
Write(16): 8a 00 00 00 00 01 00 50 13 38 00 00 00 08 00 00
Oct 29 14:07:42 storage2 kernel: [450726.566245] block drbd2: write:
error=-5 s=4299164088s

DRBD seems to notice write errors on this node. At the same time on
primary node storage1, error logs show:

Oct 29 14:07:42 storage1 kernel: [710356.374508] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.423171] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.448183] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.448677] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:42 storage1 kernel: [710356.449177] block drbd2: Got NegAck
packet. Peer is in troubles?
Oct 29 14:07:47 storage1 kernel: [710361.376116] block drbd2: 2109
messages suppressed in
/var/lib/dkms/drbd8/8.3.7/build/drbd/drbd_receiver.c:4249.
Oct 29 14:07:47 storage1 kernel: [710361.376120] block drbd2: Got NegAck
packet. Peer is in troubles?
...
Oct 29 14:09:08 storage1 kernel: [710442.699897] iscsi_trgt:
cmnd_abort(1167) 6c000010 1 0 42 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.712764] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task)
Oct 29 14:09:08 storage1 kernel: [710442.713092] iscsi_trgt:
cmnd_abort(1167) 3a000010 1 0 42 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.725665] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Unknown Task)
Oct 29 14:09:08 storage1 kernel: [710442.725874] iscsi_trgt:
cmnd_abort(1167) 21000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.738107] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)
Oct 29 14:09:08 storage1 kernel: [710442.738356] iscsi_trgt:
cmnd_abort(1167) 56000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.750336] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)
Oct 29 14:09:08 storage1 kernel: [710442.750587] iscsi_trgt:
cmnd_abort(1167) 24000010 1 0 138 8192 0 0
Oct 29 14:09:08 storage1 kernel: [710442.762244] iscsi_trgt: Abort Task
(01) issued on tid:3 lun:2 by sid:19421773464142336 (Function Complete)

My question is why didn't DRBD detach the secondary node? How could I
improve this behaviour? I have added some configuration files, but if
you need more information, I would be happy to provide them to you.
Cheers,

Sebastiaan

r2.res
----
resource r2 {
        protocol C;

        handlers {
                pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
                pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
                local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
                outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
        }

        startup {
                wfc-timeout 180;
                degr-wfc-timeout 120;
        }

        disk {
                on-io-error   detach;
        }

        net {
                timeout       40; # (unit = 0.1 seconds)
                connect-int   10; # (unit = 1 seconds)
                ping-int      5;  # (unit = 1 seconds)
                ping-timeout  5;
                max-buffers   2048;
                after-sb-0pri discard-younger-primary;
                after-sb-1pri consensus;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
        }

        syncer {
                rate 60M;
                al-extents 257;
        }

        on storage1 {
                device     /dev/drbd2;
                disk       /dev/mapper/vg02-lun2;
                address    10.0.0.5:7789;
                meta-disk /dev/mapper/vg02-lun2meta [0];
        }

        on storage2 {
                device    /dev/drbd2;
                disk       /dev/mapper/vg02-lun2;
                address   10.0.0.6:7789;
                meta-disk /dev/mapper/vg02-lun2meta [0];
        }
}

global_common.conf
----
global {
        usage-count no;
        # minor-count dialog-refresh disable-ip-verification
}

common {
        protocol C;

        handlers {
                pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
                pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";
                # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                # before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                # after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        }

        startup {
                # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
wait-after-sb;
        }

        disk {
                # on-io-error fencing use-bmbv no-disk-barrier
no-disk-flushes
                # no-disk-drain no-md-flushes max-bio-bvecs
        }

        net {
                # snd‐buf-size rcvbuf-size timeout connect-int ping-int
ping-timeout max-buffers
                # max-epoch-size ko-count allow-two-primaries
cram-hmac-alg shared-secret
                # after-sb-0pri after-sb-1pri after-sb-2pri
data-integrity-alg no-tcp-cork
        }

        syncer {
                # rate after al-extents use-rle cpu-mask verify-alg
csums-alg
                rate 60M;
        }
}

$ pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda3  vg00 lvm2 a-   811.84g    0
  /dev/sda4  vg01 lvm2 a-   845.38g    0
  /dev/sdb1  vg02 lvm2 a-     3.63t    0
  /dev/sdc1  vg03 lvm2 a-     1.09t    0
  /dev/sdd1  vg04 lvm2 a-     1.09t    0

$ lvs
  LV       VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  lun0     vg00 -wi-ao 811.34g
  lun0meta vg00 -wi-ao 512.00m
  lun1     vg01 -wi-ao 844.88g
  lun1meta vg01 -wi-ao 512.00m
  lun2     vg02 -wi-ao   3.63t
  lun2meta vg02 -wi-ao 512.00m
  lun3     vg03 -wi-ao   1.09t
  lun3meta vg03 -wi-ao 512.00m
  lun4     vg04 -wi-ao   1.09t
  lun4meta vg04 -wi-ao 512.00m

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20131030/f8debde3/attachment.pgp>