Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Everyone.
I have set up 2 servers with 2 drbd resources. Servers start fine and the
connection is established and everything works fine for a while, but at
some point (it could be hours but never more than 1 day) the drbd
resources fall into a StandAlone status.
On /var/log/messages I can see the following as the connection gets lost:
Dec 3 13:56:20 host2 kernel: block drbd1: sock was shut down by peer
Dec 3 13:56:20 host2 kernel: block drbd1: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Dec 3 13:56:20 host2 kernel: block drbd1: short read expecting header on
sock: r=0
Dec 3 13:56:20 host2 kernel: block drbd1: new current UUID
0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79
Dec 3 13:56:20 host2 kernel: block drbd1: PingAck did not arrive in time.
Dec 3 13:56:20 host2 kernel: block drbd1: asender terminated
Dec 3 13:56:20 host2 kernel: block drbd1: Terminating drbd1_asender
Dec 3 13:56:20 host2 kernel: block drbd1: Connection closed
Dec 3 13:56:20 host2 kernel: block drbd1: conn( BrokenPipe -> Unconnected
)
Dec 3 13:56:20 host2 kernel: block drbd1: receiver terminated
Dec 3 13:56:20 host2 kernel: block drbd1: Restarting drbd1_receiver
Dec 3 13:56:20 host2 kernel: block drbd1: receiver (re)started
Dec 3 13:56:20 host2 kernel: block drbd1: conn( Unconnected ->
WFConnection )
Dec 3 13:56:21 host2 kernel: block drbd1: Handshake successful: Agreed
network protocol version 97
Dec 3 13:56:21 host2 kernel: block drbd1: conn( WFConnection ->
WFReportParams )
Dec 3 13:56:21 host2 kernel: block drbd1: Starting asender thread (from
drbd1_receiver [2860])
Dec 3 13:56:21 host2 kernel: block drbd1: data-integrity-alg: <not-used>
Dec 3 13:56:21 host2 kernel: block drbd1: drbd_sync_handshake:
Dec 3 13:56:21 host2 kernel: block drbd1: self
0DA9D7241DAA80E7:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0
flags:0
Dec 3 13:56:21 host2 kernel: block drbd1: peer
6FB7C41C2FB85275:C4DC8617C18594B1:FBC08C5F22389C79:FBBF8C5F22389C79 bits:0
flags:0
Dec 3 13:56:21 host2 kernel: block drbd1: uuid_compare()=100 by rule 90
Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1
Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1 exit code 0 (0x0)
Dec 3 13:56:21 host2 kernel: block drbd1: Split-Brain detected but
unresolved, dropping connection!
Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1
Dec 3 13:56:21 host2 notify-split-brain.sh[6540]: invoked for vms1
Dec 3 13:56:21 host2 kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1 exit code 0 (0x0)
Dec 3 13:56:21 host2 kernel: block drbd1: conn( WFReportParams ->
Disconnecting )
Dec 3 13:56:21 host2 kernel: block drbd1: error receiving ReportState, l:
4!
Dec 3 13:56:21 host2 kernel: block drbd1: asender terminated
Dec 3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_asender
Dec 3 13:56:21 host2 kernel: block drbd1: Connection closed
Dec 3 13:56:21 host2 kernel: block drbd1: conn( Disconnecting ->
StandAlone )
Dec 3 13:56:21 host2 kernel: block drbd1: receiver terminated
Dec 3 13:56:21 host2 kernel: block drbd1: Terminating drbd1_receiver
As you can see this is for one resource. If I do nothing (usually I
restart drbd to recover) eventually the second resource fails too. The
order in which the resources fail has been completely random
The connection between the 2 servers is directly through a single cable
(straight, not a crossover)
I have monitored ping between the servers while it happens and I get no
lost packages at all.
I also have NIS (ypserv) configured and that connection doesn't get lost
either.
The connection doesn't re-establish by itself, the way to get it back has
been to restart drbd service on both servers.
Any Ideas of what might be causing this instability?
Here are some general configuration info the might shine a bit of light on
the issue
# rpm -qa|grep drbd
drbd83-utils-8.3.16-1.el6.elrepo.x86_64
kmod-drbd83-8.3.16-3.el6.elrepo.x86_64
# cat /etc/redhat-release
Scientific Linux release 6.7 (Carbon)
# drbdadm dump all
# /etc/drbd.conf
common {
protocol C;
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
syncer {
rate 33M;
}
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
}
}
# resource vms1 on host2: not ignored, not stacked
resource vms1 {
on host1 {
device /dev/drbd1 minor 1;
disk /dev/sda2;
address ipv4 192.168.100.60:7789;
meta-disk internal;
}
on host2 {
device /dev/drbd1 minor 1;
disk /dev/sda2;
address ipv4 192.168.100.61:7789;
meta-disk internal;
}
net {
allow-two-primaries;
}
startup {
become-primary-on both;
}
}
# resource vms2 on host2: not ignored, not stacked
resource vms2 {
on host1 {
device /dev/drbd2 minor 2;
disk /dev/sda3;
address ipv4 192.168.100.60:7790;
meta-disk internal;
}
on host2 {
device /dev/drbd2 minor 2;
disk /dev/sda3;
address ipv4 192.168.100.61:7790;
meta-disk internal;
}
net {
allow-two-primaries;
}
startup {
become-primary-on both;
}
Thank you in advance for your help
Fabrizio Zelaya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151204/25542865/attachment.htm>