Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Everyone. I have set up 2 servers with 2 drbd resources. Servers start fine and the connection is established and everything works fine for a while, but at some point (it could be hours but never more than 1 day) the drbd resources fall into a StandAlone status. On /var/log/messages I can see the following as the connection gets lost: Dec 3 13:56:20 host2 kernel: block drbd1: sock was shut down by peer Dec 3 13:56:20 host2 kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Dec 3 13:56:20 host2 kernel: block drbd1: short read expecting header on sock: r=0 Dec 3 13:56:20 host2 kernel: block drbd1: new current UUID 0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 Dec 3 13:56:20 host2 kernel: block drbd1: PingAck did not arrive in time. Dec 3 13:56:20 host2 kernel: block drbd1: asender terminated Dec 3 13:56:20 host2 kernel: block drbd1: Terminating drbd1_asender Dec 3 13:56:20 host2 kernel: block drbd1: Connection closed Dec 3 13:56:20 host2 kernel: block drbd1: conn( BrokenPipe -> Unconnected ) Dec 3 13:56:20 host2 kernel: block drbd1: receiver terminated Dec 3 13:56:20 host2 kernel: block drbd1: Restarting drbd1_receiver Dec 3 13:56:20 host2 kernel: block drbd1: receiver (re)started Dec 3 13:56:20 host2 kernel: block drbd1: conn( Unconnected -> WFConnection ) Dec 3 13:56:21 host2 kernel: block drbd1: Handshake successful: Agreed network protocol version 97 Dec 3 13:56:21 host2 kernel: block drbd1: conn( WFConnection -> WFReportParams ) Dec 3 13:56:21 host2 kernel: block drbd1: Starting asender thread (from drbd1_receiver [2860]) Dec 3 13:56:21 host2 kernel: block drbd1: data-integrity-alg: <not-used> Dec 3 13:56:21 host2 kernel: block drbd1: drbd_sync_handshake: Dec 3 13:56:21 host2 kernel: block drbd1: self 0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0 flags:0 Dec 3 13:56:21 host2 kernel: block drbd1: peer 6FB7C41C2FB85275:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0 flags:0 Dec 3 13:56:21 host2 kernel: block drbd1: uuid_compare()=100 by rule 90 Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0) Dec 3 13:56:21 host2 kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 Dec 3 13:56:21 host2 notify-split-brain.sh[6540]: invoked for vms1 Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) Dec 3 13:56:21 host2 kernel: block drbd1: conn( WFReportParams -> Disconnecting ) Dec 3 13:56:21 host2 kernel: block drbd1: error receiving ReportState, l: 4! Dec 3 13:56:21 host2 kernel: block drbd1: asender terminated Dec 3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_asender Dec 3 13:56:21 host2 kernel: block drbd1: Connection closed Dec 3 13:56:21 host2 kernel: block drbd1: conn( Disconnecting -> StandAlone ) Dec 3 13:56:21 host2 kernel: block drbd1: receiver terminated Dec 3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_receiver As you can see this is for one resource. If I do nothing (usually I restart drbd to recover) eventually the second resource fails too. The order in which the resources fail has been completely random The connection between the 2 servers is directly through a single cable (straight, not a crossover) I have monitored ping between the servers while it happens and I get no lost packages at all. I also have NIS (ypserv) configured and that connection doesn't get lost either. The connection doesn't re-establish by itself, the way to get it back has been to restart drbd service on both servers. Any Ideas of what might be causing this instability? Here are some general configuration info the might shine a bit of light on the issue # rpm -qa|grep drbd drbd83-utils-8.3.16-1.el6.elrepo.x86_64 kmod-drbd83-8.3.16-3.el6.elrepo.x86_64 # cat /etc/redhat-release Scientific Linux release 6.7 (Carbon) # drbdadm dump all # /etc/drbd.conf common { protocol C; net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } syncer { rate 33M; } handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; } } # resource vms1 on host2: not ignored, not stacked resource vms1 { on host1 { device /dev/drbd1 minor 1; disk /dev/sda2; address ipv4 192.168.100.60:7789; meta-disk internal; } on host2 { device /dev/drbd1 minor 1; disk /dev/sda2; address ipv4 192.168.100.61:7789; meta-disk internal; } net { allow-two-primaries; } startup { become-primary-on both; } } # resource vms2 on host2: not ignored, not stacked resource vms2 { on host1 { device /dev/drbd2 minor 2; disk /dev/sda3; address ipv4 192.168.100.60:7790; meta-disk internal; } on host2 { device /dev/drbd2 minor 2; disk /dev/sda3; address ipv4 192.168.100.61:7790; meta-disk internal; } net { allow-two-primaries; } startup { become-primary-on both; } Thank you in advance for your help Fabrizio Zelaya -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151204/25542865/attachment.htm>