[DRBD-user] ko-count default inaccurate (8.3.11) and other bugs

Sat May 12 00:23:08 CEST 2012

On Thu, May 10, 2012 at 7:15 PM, Brian Chrisman <brchrisman at gmail.com>wrote:

> version: 8.3.11 (api:88/proto:86-96)
> May 10 21:11:55 scale-192-168-54-14 kernel: block drbd1:
> [drbd1_worker/8936] sock_sendmsg time expired, ko = 4294967295
> ... (long countdown that will never get anywhere)
> This is a "dual primary" setup (underneath GPFS) over a failover-bonded
> network interface.
> Everything works fine (read/write/reboot/etc) until I attempt a verify.
>
> My configuration has no reference to ko-count, which from the
> documentation suggests it should be 0 and be disabled.  Does the
> documentation actually intend to say that the default is 2^32?
> I'm building/running this all on a clone of RHEL6.2.
>
> This is occurring during an attempt to 'verify' a dual primary DRBD
> device.  Originally I received this message on every attempt at verify, but
> after I reduced syncer { rate }, this message only props up after a few
> iterations.  There is no network/connectivity problem during this time
> period, yet drbd commands hang such as:
>
>
Apologies for repost, but I think I've cleared up my out-of-sync messages.
 I'm bypassing the initial sync (throwing away data on a second drive), and
it looks like the preexisting-but-abandoned data is hosing up the verify (I
checked by zeroing all blocks before creating the device in the same
fashion, after which the verify works fine).
I bypass this because I'm using dual-primary and to get into dual-primary
the sync must be finished (with 2TB SATA drives, takes a *long* time).
I'm trying to think of ways to get around this issue but it's a tough one.

-brian

strace -f drbdsetup 1 disconnect --force
> ...
> stat("/proc/drbd", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> open("/var/lock/drbd-147-1", O_RDWR|O_CREAT, 0600) = 3
> rt_sigaction(SIGALRM, {0x406b30, [], SA_RESTORER, 0x3935232900}, {SIG_DFL,
> [], 0}, 8) = 0
> alarm(1)                                = 0
> fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
> alarm(0)                                = 1
> rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x3935232900}, NULL, 8) =
> 0
> socket(PF_NETLINK, SOCK_DGRAM, 11)      = 4
> getpid()                                = 13360
> bind(4, {sa_family=AF_NETLINK, pid=13360, groups=ffffffff}, 12) = 0
> sendto(4,
> "9\0\0\0\3\0\0\0\1\0\0\00004\0\0\4\0\0\0\1\0\0\0\1\0\0\00004\0\0"..., 57,
> 0, NULL, 0) = 57
> poll([{fd=4, events=POLLIN}], 1, 120000
> << this is where it hangs and exits after a terminate (ctrl-c) >>
> All that's going on in the dmesg output is sock_sendmsg expiration reports.
>
>
> The documentation here also would be better if *count *and *number* were
> consistent (either 'count' or 'number').
>
>> ko-count *number
>> *In case the secondary node fails to complete a single write request for
>> *count* times the *timeout*, it is expelled from the cluster. (I.e. the
>> primary node goes into StandAlone mode.) The default value is 0, which
>> disables this feature.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120511/ed6fe294/attachment.htm>