[DRBD-user] ko-count default inaccurate (8.3.11) and other bugs

Fri May 11 04:15:18 CEST 2012

version: 8.3.11 (api:88/proto:86-96)
May 10 21:11:55 scale-192-168-54-14 kernel: block drbd1:
[drbd1_worker/8936] sock_sendmsg time expired, ko = 4294967295
... (long countdown that will never get anywhere)
This is a "dual primary" setup (underneath GPFS) over a failover-bonded
network interface.
Everything works fine (read/write/reboot/etc) until I attempt a verify.

My configuration has no reference to ko-count, which from the documentation
suggests it should be 0 and be disabled.  Does the documentation actually
intend to say that the default is 2^32?
I'm building/running this all on a clone of RHEL6.2.

This is occurring during an attempt to 'verify' a dual primary DRBD device.
 Originally I received this message on every attempt at verify, but after I
reduced syncer { rate }, this message only props up after a few iterations.
 There is no network/connectivity problem during this time period, yet drbd
commands hang such as:

strace -f drbdsetup 1 disconnect --force
...
stat("/proc/drbd", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
open("/var/lock/drbd-147-1", O_RDWR|O_CREAT, 0600) = 3
rt_sigaction(SIGALRM, {0x406b30, [], SA_RESTORER, 0x3935232900}, {SIG_DFL,
[], 0}, 8) = 0
alarm(1)                                = 0
fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
alarm(0)                                = 1
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x3935232900}, NULL, 8) = 0
socket(PF_NETLINK, SOCK_DGRAM, 11)      = 4
getpid()                                = 13360
bind(4, {sa_family=AF_NETLINK, pid=13360, groups=ffffffff}, 12) = 0
sendto(4,
"9\0\0\0\3\0\0\0\1\0\0\00004\0\0\4\0\0\0\1\0\0\0\1\0\0\00004\0\0"..., 57,
0, NULL, 0) = 57
poll([{fd=4, events=POLLIN}], 1, 120000
<< this is where it hangs and exits after a terminate (ctrl-c) >>
All that's going on in the dmesg output is sock_sendmsg expiration reports.

The documentation here also would be better if *count *and *number* were
consistent (either 'count' or 'number').

> ko-count *number
> *In case the secondary node fails to complete a single write request for *
> count* times the *timeout*, it is expelled from the cluster. (I.e. the
> primary node goes into StandAlone mode.) The default value is 0, which
> disables this feature.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120510/af272b70/attachment.htm>