[DRBD-user] DRBD spontaneously loses connection

Sun Dec 6 04:17:22 CET 2015

Try setting ping-timeout to 5

-- 
Adam Randall
http://www.xaren.net
AIM: blitz574
Twitter: @randalla0622

"To err is human... to really foul up requires the root password."
On Dec 5, 2015 5:23 AM, "Fabrizio Zelaya" <FZelaya at ta-petro.com> wrote:

> Hello Everyone.
>
> I have set up 2 servers with 2 drbd resources. Servers start fine and the
> connection is established and everything works fine for a while, but at
> some point (it could be hours but never more than 1 day) the drbd resources
> fall into a StandAlone status.
>
> On /var/log/messages I can see the following as the connection gets lost:
> Dec  3 13:56:20 host2 kernel: block drbd1: sock was shut down by peer
> Dec  3 13:56:20 host2 kernel: block drbd1: peer( Primary -> Unknown )
> conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
> Dec  3 13:56:20 host2 kernel: block drbd1: short read expecting header on
> sock: r=0
> Dec  3 13:56:20 host2 kernel: block drbd1: new current UUID
> 0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79
> Dec  3 13:56:20 host2 kernel: block drbd1: PingAck did not arrive in time.
> Dec  3 13:56:20 host2 kernel: block drbd1: asender terminated
> Dec  3 13:56:20 host2 kernel: block drbd1: Terminating drbd1_asender
> Dec  3 13:56:20 host2 kernel: block drbd1: Connection closed
> Dec  3 13:56:20 host2 kernel: block drbd1: conn( BrokenPipe -> Unconnected
> )
> Dec  3 13:56:20 host2 kernel: block drbd1: receiver terminated
> Dec  3 13:56:20 host2 kernel: block drbd1: Restarting drbd1_receiver
> Dec  3 13:56:20 host2 kernel: block drbd1: receiver (re)started
> Dec  3 13:56:20 host2 kernel: block drbd1: conn( Unconnected ->
> WFConnection )
> Dec  3 13:56:21 host2 kernel: block drbd1: Handshake successful: Agreed
> network protocol version 97
> Dec  3 13:56:21 host2 kernel: block drbd1: conn( WFConnection ->
> WFReportParams )
> Dec  3 13:56:21 host2 kernel: block drbd1: Starting asender thread (from
> drbd1_receiver [2860])
> Dec  3 13:56:21 host2 kernel: block drbd1: data-integrity-alg: <not-used>
> Dec  3 13:56:21 host2 kernel: block drbd1: drbd_sync_handshake:
> Dec  3 13:56:21 host2 kernel: block drbd1: self
> 0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0
> flags:0
> Dec  3 13:56:21 host2 kernel: block drbd1: peer
> 6FB7C41C2FB85275:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0
> flags:0
> Dec  3 13:56:21 host2 kernel: block drbd1: uuid_compare()=100 by rule 90
> Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
> initial-split-brain minor-1
> Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
> initial-split-brain minor-1 exit code 0 (0x0)
> Dec  3 13:56:21 host2 kernel: block drbd1: Split-Brain detected but
> unresolved, dropping connection!
> Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
> split-brain minor-1
> Dec  3 13:56:21 host2 notify-split-brain.sh[6540]: invoked for vms1
> Dec  3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
> split-brain minor-1 exit code 0 (0x0)
> Dec  3 13:56:21 host2 kernel: block drbd1: conn( WFReportParams ->
> Disconnecting )
> Dec  3 13:56:21 host2 kernel: block drbd1: error receiving ReportState, l:
> 4!
> Dec  3 13:56:21 host2 kernel: block drbd1: asender terminated
> Dec  3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_asender
> Dec  3 13:56:21 host2 kernel: block drbd1: Connection closed
> Dec  3 13:56:21 host2 kernel: block drbd1: conn( Disconnecting ->
> StandAlone )
> Dec  3 13:56:21 host2 kernel: block drbd1: receiver terminated
> Dec  3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_receiver
>
> As you can see this is for one resource. If I do nothing (usually I
> restart drbd to recover) eventually the second resource fails too. The
> order in which the resources fail has been completely random
>
> The connection between the 2 servers is directly through a single cable
> (straight, not a crossover)
>
> I have monitored ping between the servers while it happens and I get no
> lost packages at all.
>
> I also have NIS (ypserv) configured and that connection doesn't get lost
> either.
>
> The connection doesn't re-establish by itself, the way to get it back has
> been to restart drbd service on both servers.
>
> Any Ideas of what might be causing this instability?
>
> Here are some general configuration info the might shine a bit of light on
> the issue
>
>  # rpm -qa|grep drbd
> *drbd83-utils-8.3.16-1.el6.elrepo.x86_64*
> *kmod-drbd83-8.3.16-3.el6.elrepo.x86_64*
>
> # cat /etc/redhat-release
> *Scientific Linux release 6.7 (Carbon)*
>
>
> # drbdadm dump all
>
> *# /etc/drbd.conf*
> *common {*
> *    protocol               C;*
> *    net {*
> *        after-sb-0pri    discard-zero-changes;*
> *        after-sb-1pri    discard-secondary;*
> *        after-sb-2pri    disconnect;*
> *    }*
> *    syncer {*
> *        rate             33M;*
> *    }*
> *    handlers {*
> *        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";*
> *        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";*
> *        local-io-error   "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";*
> *        split-brain      "/usr/lib/drbd/notify-split-brain.sh root";*
> *        out-of-sync      "/usr/lib/drbd/notify-out-of-sync.sh root";*
> *    }*
> *}*
>
> *# resource vms1 on host2: not ignored, not stacked*
> *resource vms1 {*
> *    on host1 {*
> *        device           /dev/drbd1 minor 1;*
> *        disk             /dev/sda2;*
> *        address          ipv4 192.168.100.60:7789
> <http://192.168.100.60:7789>;*
> *        meta-disk        internal;*
> *    }*
> *    on host2 {*
> *        device           /dev/drbd1 minor 1;*
> *        disk             /dev/sda2;*
> *        address          ipv4 192.168.100.61:7789
> <http://192.168.100.61:7789>;*
> *        meta-disk        internal;*
> *    }*
> *    net {*
> *        allow-two-primaries;*
> *    }*
> *    startup {*
> *        become-primary-on both;*
> *    }*
> *}*
>
> *# resource vms2 on host2: not ignored, not stacked*
> *resource vms2 {*
> *    on host1 {*
> *        device           /dev/drbd2 minor 2;*
> *        disk             /dev/sda3;*
> *        address          ipv4 192.168.100.60:7790
> <http://192.168.100.60:7790>;*
> *        meta-disk        internal;*
> *    }*
> *    on host2 {*
> *        device           /dev/drbd2 minor 2;*
> *        disk             /dev/sda3;*
> *        address          ipv4 192.168.100.61:7790
> <http://192.168.100.61:7790>;*
> *        meta-disk        internal;*
> *    }*
> *    net {*
> *        allow-two-primaries;*
> *    }*
> *    startup {*
> *        become-primary-on both;*
> *    }*
>
>
> Thank you in advance for your help
>
> Fabrizio Zelaya
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151205/9f33f193/attachment.htm>