Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Aug 17, 2012 at 01:27:17PM +0300, Jarno Elonen wrote: > >I'm running DRBD 8.3.13 on Debian Wheezy, Linux 3.2.20 and > >every now and then my DRBD resources spontaneously switch from > >cs:Connected to cs:WFConnection or the various syncing states and back > >(according to "watch cat /proc/drbd"). > > > >I've sometimes seen "broken pipe" or even "protocol error"(!?) flashing > >by briefly. > > No luck debugging this so far. I've tried changing network cards, > switching between bonding modes, reverting back to regular ethX > (instead of bonding), various MTU and txqueuelen values, using > resource-only-fencing (corosync) and not. Nothing has helped so far > - this connection unstability just seems to come and go. > > Any better debugging ideas? Or maybe this is not a network issue at all? > Excerpt from DRBD configuration: No excerpts please. > > net { > timeout 20; > max-epoch-size 8192; > max-buffers 128k; > connect-int 2; > ping-int 2; > sndbuf-size 10M; > rcvbuf-size 10M; > ko-count 5; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > ping-timeout 2; > } > > syncer { > rate 100M; > al-extents 3389; > csums-alg crc32c; > verify-alg crc32c; > } > > > Here's a syslog snippet demonstrating one whole cycle of this behavior: cat /proc/drbd please. I suspect you do not really use the drbd kernel module version version you think you do. Double check that it is in fact 8.3.13, actually for that debian kernel, it should be some commits further even. Also, how does your overall IO stack look like, what is below drbd, what is above it? > kernel: [ 9827.966027] block drbd6: conn( SyncTarget -> Connected ) > disk( Inconsistent -> UpToDate ) > kernel: [ 9828.199039] block drbd6: helper command: /sbin/drbdadm > after-resync-target minor-6 > crm-unfence-peer.sh[24132]: invoked for drbd-serv-mail > crm-unfence-peer.sh[24132]: WARNING drbd-fencing could not determine > the master id of drbd resource drbd-serv-mail > kernel: [ 9828.238394] block drbd6: helper command: /sbin/drbdadm > after-resync-target minor-6 exit code 1 (0x100) > kernel: [ 9828.298906] block drbd6: bitmap WRITE of 83 pages took 15 jiffies > kernel: [ 9828.503024] block drbd6: 0 KB (0 bits) marked out-of-sync > by on disk bit-map. > kernel: [ 9831.788745] block drbd6: magic?? on data m: 0xa0816800 c: > 5120 l: 0 This would indicate either data corruption in memory or on the wire, or more likely a certain stacking of devices on top of DRBD, and missing at least this patch: commit 95153072a19dfef10a2cde98c0719cf0f5d72a68 Author: Lars Ellenberg <lars.ellenberg at linbit.com> Date: Thu Mar 8 16:43:45 2012 +0100 drbd: fix potential data corruption and protocol error > kernel: [ 9832.457733] block drbd6: data-integrity-alg: <not-used> Or you could enable data-integrity checking... -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com