[DRBD-user] Timed out waiting for missing ack packets; disconnecting

Thu Jun 5 20:01:39 CEST 2014

I'd like to preface this by saying that I'm not overly experienced with
DRBD, and I've being piecing a lot of information together from various
documentation sources, including those on the DRBD site.

I have two servers, both the same hardware configuration, connected to two
identical MD1200's. I sync these two servers in primary/primary mode using
DRBD with OCFS2. Though they are in primary/primary mode, one of the
servers is the primary, and the other is the secondary. Some tasks are run
on the secondary that modify the drbd/ocfs2 volume, but it is kept pretty
light work. The servers each have two ethernet ports, one for their LAN
connection, the other for a direct connection from one to the other that is
used solely for DRBD and OCFS2 communication.

The primary server, named bellerophon, is a web, database and file storage
server, with the database (postgresql) and documents stored on the
drbd/ocfs2 volume. The secondary server, named bia, is mainly used as a hot
backup.

Up until May 29th, we've had very good success with the setup, and have
even deployed another server pair with the same configuration, though with
a different purpose. On the 29th of May, however, we started seeing
timeouts:

Feb 28 15:13:01 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Feb 28 16:19:06 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> May 29 16:54:17 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> May 30 11:22:58 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  2 10:19:16 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  2 11:30:46 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  3 13:37:47 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  3 14:07:31 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  3 14:52:39 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  3 15:19:16 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  3 15:58:51 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  4 11:11:39 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  4 11:29:18 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  4 12:42:40 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  5 07:32:35 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  5 08:01:41 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  5 09:02:12 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting

I resolve these by telling the secondary server to reconnect with discard
so that the primary server syncs everything over to the secondary, then we
put everything back into primary/primary mode. Because of these issues,
I've moved all work that happens on the secondary server over to the
primary so that when I do the resync we aren't losing anything since all
the work is happening on the primary server anyway.

We haven't changed anything in the configuration of the servers on or
before the 29th of May, so I'm at a loss as to why this is happening now.
I've restarted both of the servers last night, but as you can see it went
out of sync three times today already.

If it matters, here's the output of /proc/drbd:

version: 8.4.3 (api:1/proto:86-101)
> built-in
>  1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
>     ns:1500684 nr:3532 dw:73274267 dr:187548582 al:29429 bm:594 lo:0 pe:0
> ua:0 ap:0 ep:1 wo:d oos:0

We are using gentoo as our host OS, with kernel 3.10.25, which (as seen
above) has DRBD 8.4.3. I am planning on moving up to the current stable
kernel, which is 3.12.20, but that also is still at 8.4.3 (though with some
code changes between them judging from the diff I did between the two
kernel sources).

Here are the kernel messages that I'm seeing in the syslog for the
disconnect that happened at 7:32 AM.

Jun  5 07:32:35 bellerophon kernel: block drbd1: Timed out waiting for
> missing ack packets; disconnecting
> Jun  5 07:32:35 bellerophon kernel: d-con r0: error receiving Data, e:
> -110 l: 512!
> Jun  5 07:32:35 bellerophon kernel: d-con r0: peer( Primary -> Unknown )
> conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
> Jun  5 07:32:35 bellerophon kernel: block drbd1: new current UUID
> C68B8BD454AAF64B:E240A278745017DB:9914908994F205E5:9913908994F205E5
> Jun  5 07:32:35 bellerophon kernel: d-con r0: asender terminated
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_a_r0
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Connection closed
> Jun  5 07:32:35 bellerophon kernel: d-con r0: conn( ProtocolError ->
> Unconnected )
> Jun  5 07:32:35 bellerophon kernel: d-con r0: receiver terminated
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Restarting receiver thread
> Jun  5 07:32:35 bellerophon kernel: d-con r0: receiver (re)started
> Jun  5 07:32:35 bellerophon kernel: d-con r0: conn( Unconnected ->
> WFConnection )
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Handshake successful: Agreed
> network protocol version 101
> Jun  5 07:32:35 bellerophon kernel: d-con r0: conn( WFConnection ->
> WFReportParams )
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Starting asender thread
> (from drbd_r_r0 [2699])
> Jun  5 07:32:35 bellerophon kernel: block drbd1: drbd_sync_handshake:
> Jun  5 07:32:35 bellerophon kernel: block drbd1: self
> C68B8BD454AAF64B:E240A278745017DB:9914908994F205E5:9913908994F205E5
> bits:236 flags:0
> Jun  5 07:32:35 bellerophon kernel: block drbd1: peer
> 5AE3A4FE88D7DBA3:E240A278745017DB:9914908994F205E4:9913908994F205E5 bits:1
> flags:0
> Jun  5 07:32:35 bellerophon kernel: block drbd1: uuid_compare()=100 by
> rule 90
> Jun  5 07:32:35 bellerophon kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1
> Jun  5 07:32:35 bellerophon kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
> Jun  5 07:32:35 bellerophon kernel: block drbd1: Split-Brain detected but
> unresolved, dropping connection!
> Jun  5 07:32:35 bellerophon kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1
> Jun  5 07:32:35 bellerophon kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Jun  5 07:32:35 bellerophon kernel: d-con r0: conn( WFReportParams ->
> Disconnecting )
> Jun  5 07:32:35 bellerophon kernel: d-con r0: error receiving ReportState,
> e: -5 l: 0!
> Jun  5 07:32:35 bellerophon kernel: d-con r0: asender terminated
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_a_r0
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Connection closed
> Jun  5 07:32:35 bellerophon kernel: d-con r0: conn( Disconnecting ->
> StandAlone )
> Jun  5 07:32:35 bellerophon kernel: d-con r0: receiver terminated
> Jun  5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_r_r0

Here is the configuration file that I'm using:

# cat /etc/drbd.d/global_common.conf
> global {
>         usage-count yes;
>         # minor-count dialog-refresh disable-ip-verification
> }
> common {
>         handlers {
>                 # These are EXAMPLE handlers only.
>                 # They may have severe implications,
>                 # like hard resetting the node under certain circumstances.
>                 # Be careful when chosing your poison.
>                 # pri-on-incon-degr
> "/usr/lib64/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 # pri-lost-after-sb
> "/usr/lib64/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 # local-io-error "/usr/lib64/drbd/notify-io-error.sh;
> /usr/lib64/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
> ; halt -f";
>                 # fence-peer "/usr/lib64/drbd/crm-fence-peer.sh";
>                 # split-brain "/usr/lib64/drbd/notify-split-brain.sh root";
>                 # out-of-sync "/usr/lib64/drbd/notify-out-of-sync.sh root";
>                 # before-resync-target
> "/usr/lib64/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>                 # after-resync-target
> /usr/lib64/drbd/unsnapshot-resync-target-lvm.sh;
>         }
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
>         }
>         options {
>                 # cpu-mask on-no-data-accessible
>         }
>         disk {
>                 # size max-bio-bvecs on-io-error fencing disk-barrier
> disk-flushes
>                 # disk-drain md-flushes resync-rate resync-after al-extents
>                 # c-plan-ahead c-delay-target c-fill-target c-max-rate
>                 # c-min-rate disk-timeout
>         }
>         net {
>                 # protocol timeout max-epoch-size max-buffers
> unplug-watermark
>                 # connect-int ping-int sndbuf-size rcvbuf-size ko-count
>                 # allow-two-primaries cram-hmac-alg shared-secret
> after-sb-0pri
>                 # after-sb-1pri after-sb-2pri always-asbp rr-conflict
>                 # ping-timeout data-integrity-alg tcp-cork on-congestion
>                 # congestion-fill congestion-extents csums-alg verify-alg
>                 # use-rle
>         }
> }
> # cat /etc/drbd.d/dms.res
> resource r0 {
>         disk {
>                 al-extents 3389;
>                 disk-barrier no;
>                 disk-flushes no;
>         }
>         startup {
>                 wfc-timeout  15;
>                 degr-wfc-timeout 60;
>                 become-primary-on both;
>         }
>         net {
> # allow-two-primaries - Generally, DRBD has a primary and a secondary node.
> # In this case, we will allow both nodes to have the filesystem mounted at
> # the same time. Do this only with a clustered filesystem. If you do this
> # with a non-clustered filesystem like ext2/ext3/ext4 or reiserfs, you will
> # have data corruption.
>                 allow-two-primaries;
> # after-sb-0pri discard-zero-changes - DRBD detected a split-brain
> scenario,
> # but none of the nodes think they're a primary. DRBD will take the newest
> # modifications and apply them to the node that didn't have any changes.
>                 after-sb-0pri discard-zero-changes;
> # after-sb-1pri discard-secondary - DRBD detected a split-brain scenario,
> # but one node is the primary and the other is the secondary. In this case,
> # DRBD will decide that the secondary node is the victim and it will sync
> data
> # from the primary to the secondary automatically.
>                 after-sb-1pri discard-secondary;
> # after-sb-2pri disconnect - DRBD detected a split-brain scenario, but it
> can't
> # figure out which node has the right data. It tries to protect the
> consistency
> # of both nodes by disconnecting the DRBD volume entirely. You'll have to
> tell
> # DRBD which node has the valid data in order to reconnect the volume.
>                 after-sb-2pri disconnect;
>                 max-buffers 8000;
>                 max-epoch-size 8000;
>                 sndbuf-size 512k;
>         }
>         on bellerophon {
>                 device    /dev/drbd1;
>                 disk      /dev/sda1;
>                 address   172.16.0.10:7789;
>                 meta-disk internal;
>         }
>         on bia {
>                 device    /dev/drbd1;
>                 disk      /dev/sda1;
>                 address   172.16.0.11:7789;
>                 meta-disk internal;
>         }
> }

If anyone has any information that might assist me in figuring out what the
issue is here, I'd really appreciate it.

Adam.

-- 
Adam Randall
"To err is human... to really foul up requires the root password."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140605/b41e9929/attachment.htm>