[DRBD-user] DRBD spontaneously loses connection

Fri Dec 4 17:41:10 CET 2015

Hello Everyone.

I have set up 2 servers with 2 drbd resources. Servers start fine and the 
connection is established and everything works fine for a while, but at 
some point (it could be hours but never more than 1 day) the drbd 
resources fall into a StandAlone status.

On /var/log/messages I can see the following as the connection gets lost:
Dec  3 13:56:20 host2 kernel: block drbd1: sock was shut down by peer
Dec  3 13:56:20 host2 kernel: block drbd1: peer( Primary -> Unknown ) 
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Dec  3 13:56:20 host2 kernel: block drbd1: short read expecting header on 
sock: r=0
Dec  3 13:56:20 host2 kernel: block drbd1: new current UUID 
0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79
Dec  3 13:56:20 host2 kernel: block drbd1: PingAck did not arrive in time.
Dec  3 13:56:20 host2 kernel: block drbd1: asender terminated
Dec  3 13:56:20 host2 kernel: block drbd1: Terminating drbd1_asender
Dec  3 13:56:20 host2 kernel: block drbd1: Connection closed
Dec  3 13:56:20 host2 kernel: block drbd1: conn( BrokenPipe -> Unconnected 
) 
Dec  3 13:56:20 host2 kernel: block drbd1: receiver terminated
Dec  3 13:56:20 host2 kernel: block drbd1: Restarting drbd1_receiver
Dec  3 13:56:20 host2 kernel: block drbd1: receiver (re)started
Dec  3 13:56:20 host2 kernel: block drbd1: conn( Unconnected -> 
WFConnection ) 
Dec  3 13:56:21 host2 kernel: block drbd1: Handshake successful: Agreed 
network protocol version 97
Dec  3 13:56:21 host2 kernel: block drbd1: conn( WFConnection -> 
WFReportParams ) 
Dec  3 13:56:21 host2 kernel: block drbd1: Starting asender thread (from 
drbd1_receiver [2860])
Dec  3 13:56:21 host2 kernel: block drbd1: data-integrity-alg: <not-used>
Dec  3 13:56:21 host2 kernel: block drbd1: drbd_sync_handshake:
Dec  3 13:56:21 host2 kernel: block drbd1: self 
0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0 
flags:0
Dec  3 13:56:21 host2 kernel: block drbd1: peer 
6FB7C41C2FB85275:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0 
flags:0
Dec  3 13:56:21 host2 kernel: block drbd1: uuid_compare()=100 by rule 90
Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm 
initial-split-brain minor-1
Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm 
initial-split-brain minor-1 exit code 0 (0x0)
Dec  3 13:56:21 host2 kernel: block drbd1: Split-Brain detected but 
unresolved, dropping connection!
Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm 
split-brain minor-1
Dec  3 13:56:21 host2 notify-split-brain.sh[6540]: invoked for vms1
Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm 
split-brain minor-1 exit code 0 (0x0)
Dec  3 13:56:21 host2 kernel: block drbd1: conn( WFReportParams -> 
Disconnecting ) 
Dec  3 13:56:21 host2 kernel: block drbd1: error receiving ReportState, l: 
4!
Dec  3 13:56:21 host2 kernel: block drbd1: asender terminated
Dec  3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_asender
Dec  3 13:56:21 host2 kernel: block drbd1: Connection closed
Dec  3 13:56:21 host2 kernel: block drbd1: conn( Disconnecting -> 
StandAlone ) 
Dec  3 13:56:21 host2 kernel: block drbd1: receiver terminated
Dec  3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_receiver

As you can see this is for one resource. If I do nothing (usually I 
restart drbd to recover) eventually the second resource fails too. The 
order in which the resources fail has been completely random

The connection between the 2 servers is directly through a single cable 
(straight, not a crossover) 

I have monitored ping between the servers while it happens and I get no 
lost packages at all. 

I also have NIS (ypserv) configured and that connection doesn't get lost 
either.

The connection doesn't re-establish by itself, the way to get it back has 
been to restart drbd service on both servers.

Any Ideas of what might be causing this instability?

Here are some general configuration info the might shine a bit of light on 
the issue 

 # rpm -qa|grep drbd
drbd83-utils-8.3.16-1.el6.elrepo.x86_64
kmod-drbd83-8.3.16-3.el6.elrepo.x86_64

# cat /etc/redhat-release 
Scientific Linux release 6.7 (Carbon)

# drbdadm dump all

# /etc/drbd.conf
common {
    protocol               C;
    net {
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    syncer {
        rate             33M;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
        local-io-error   "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; 
halt -f";
        split-brain      "/usr/lib/drbd/notify-split-brain.sh root";
        out-of-sync      "/usr/lib/drbd/notify-out-of-sync.sh root";
    }
}

# resource vms1 on host2: not ignored, not stacked
resource vms1 {
    on host1 {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda2;
        address          ipv4 192.168.100.60:7789;
        meta-disk        internal;
    }
    on host2 {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda2;
        address          ipv4 192.168.100.61:7789;
        meta-disk        internal;
    }
    net {
        allow-two-primaries;
    }
    startup {
        become-primary-on both;
    }
}

# resource vms2 on host2: not ignored, not stacked
resource vms2 {
    on host1 {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda3;
        address          ipv4 192.168.100.60:7790;
        meta-disk        internal;
    }
    on host2 {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda3;
        address          ipv4 192.168.100.61:7790;
        meta-disk        internal;
    }
    net {
        allow-two-primaries;
    }
    startup {
        become-primary-on both;
    }

Thank you in advance for your help

Fabrizio Zelaya 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151204/25542865/attachment.htm>