[DRBD-user] State switches, flaky connection?

Fri Aug 17 13:17:04 CEST 2012

On Fri, Aug 17, 2012 at 01:27:17PM +0300, Jarno Elonen wrote:
> >I'm running DRBD 8.3.13 on Debian Wheezy, Linux 3.2.20 and
> >every now and then my DRBD resources spontaneously switch from
> >cs:Connected to cs:WFConnection or the various syncing states and back
> >(according to "watch cat /proc/drbd").
> >
> >I've sometimes seen "broken pipe" or even "protocol error"(!?) flashing
> >by briefly.
> 
> No luck debugging this so far. I've tried changing network cards,
> switching between bonding modes, reverting back to regular ethX
> (instead of bonding), various MTU and txqueuelen values, using
> resource-only-fencing (corosync) and not. Nothing has helped so far
> - this connection unstability just seems to come and go.
> 
> Any better debugging ideas? Or maybe this is not a network issue at all?
> Excerpt from DRBD configuration:

No excerpts please.

> 
>         net {
>                 timeout 20;
>                 max-epoch-size  8192;
>                 max-buffers     128k;
>                 connect-int     2;
>                 ping-int        2;
>                 sndbuf-size     10M;
>                 rcvbuf-size     10M;
>                 ko-count        5;
>                 after-sb-0pri   discard-zero-changes;
>                 after-sb-1pri   discard-secondary;
>                 ping-timeout    2;
>         }
> 
>         syncer {
>                 rate    100M;
>                 al-extents      3389;
>                 csums-alg       crc32c;
>                 verify-alg      crc32c;
>         }
> 
> 
> Here's a syslog snippet demonstrating one whole cycle of this behavior:

cat /proc/drbd please.
I suspect you do not really use the drbd kernel module version
version you think you do. Double check that it is in fact 8.3.13,
actually for that debian kernel, it should be some commits further even.

Also, how does your overall IO stack look like,
what is below drbd, what is above it?

> kernel: [ 9827.966027] block drbd6: conn( SyncTarget -> Connected )
> disk( Inconsistent -> UpToDate )
> kernel: [ 9828.199039] block drbd6: helper command: /sbin/drbdadm
> after-resync-target minor-6
> crm-unfence-peer.sh[24132]: invoked for drbd-serv-mail
> crm-unfence-peer.sh[24132]: WARNING drbd-fencing could not determine
> the master id of drbd resource drbd-serv-mail
> kernel: [ 9828.238394] block drbd6: helper command: /sbin/drbdadm
> after-resync-target minor-6 exit code 1 (0x100)
> kernel: [ 9828.298906] block drbd6: bitmap WRITE of 83 pages took 15 jiffies
> kernel: [ 9828.503024] block drbd6: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> kernel: [ 9831.788745] block drbd6: magic?? on data m: 0xa0816800 c:
> 5120 l: 0

This would indicate either data corruption in memory or on the wire,
or more likely a certain stacking of devices on top of DRBD,
and missing at least this patch:
 commit 95153072a19dfef10a2cde98c0719cf0f5d72a68
 Author: Lars Ellenberg <lars.ellenberg at linbit.com>
 Date:   Thu Mar 8 16:43:45 2012 +0100

 drbd: fix potential data corruption and protocol error

> kernel: [ 9832.457733] block drbd6: data-integrity-alg: <not-used>

Or you could enable data-integrity checking...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com