Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> I'm running DRBD 8.3.13 on Debian Wheezy, Linux 3.2.20 and
> every now and then my DRBD resources spontaneously switch from
> cs:Connected to cs:WFConnection or the various syncing states and back
> (according to "watch cat /proc/drbd").
>
> I've sometimes seen "broken pipe" or even "protocol error"(!?) flashing
> by briefly.
No luck debugging this so far. I've tried changing network cards,
switching between bonding modes, reverting back to regular ethX (instead
of bonding), various MTU and txqueuelen values, using
resource-only-fencing (corosync) and not. Nothing has helped so far -
this connection unstability just seems to come and go.
Any better debugging ideas? Or maybe this is not a network issue at all?
Excerpt from DRBD configuration:
net {
timeout 20;
max-epoch-size 8192;
max-buffers 128k;
connect-int 2;
ping-int 2;
sndbuf-size 10M;
rcvbuf-size 10M;
ko-count 5;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
ping-timeout 2;
}
syncer {
rate 100M;
al-extents 3389;
csums-alg crc32c;
verify-alg crc32c;
}
Here's a syslog snippet demonstrating one whole cycle of this behavior:
kernel: [ 9827.966027] block drbd6: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate )
kernel: [ 9828.199039] block drbd6: helper command: /sbin/drbdadm
after-resync-target minor-6
crm-unfence-peer.sh[24132]: invoked for drbd-serv-mail
crm-unfence-peer.sh[24132]: WARNING drbd-fencing could not determine the
master id of drbd resource drbd-serv-mail
kernel: [ 9828.238394] block drbd6: helper command: /sbin/drbdadm
after-resync-target minor-6 exit code 1 (0x100)
kernel: [ 9828.298906] block drbd6: bitmap WRITE of 83 pages took 15 jiffies
kernel: [ 9828.503024] block drbd6: 0 KB (0 bits) marked out-of-sync by
on disk bit-map.
kernel: [ 9831.788745] block drbd6: magic?? on data m: 0xa0816800 c:
5120 l: 0
kernel: [ 9831.788790] block drbd6: peer( Primary -> Unknown ) conn(
Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
kernel: [ 9831.789573] block drbd6: asender terminated
kernel: [ 9831.789576] block drbd6: Terminating drbd6_asender
kernel: [ 9832.041526] block drbd6: Connection closed
kernel: [ 9832.041531] block drbd6: conn( ProtocolError -> Unconnected )
kernel: [ 9832.041535] block drbd6: receiver terminated
kernel: [ 9832.041537] block drbd6: Restarting drbd6_receiver
kernel: [ 9832.041539] block drbd6: receiver (re)started
kernel: [ 9832.041542] block drbd6: conn( Unconnected -> WFConnection )
kernel: [ 9832.457266] block drbd6: Handshake successful: Agreed network
protocol version 96
kernel: [ 9832.457276] block drbd6: conn( WFConnection -> WFReportParams )
kernel: [ 9832.457357] block drbd6: Starting asender thread (from
drbd6_receiver [29943])
kernel: [ 9832.457733] block drbd6: data-integrity-alg: <not-used>
kernel: [ 9832.457745] block drbd6: drbd_sync_handshake:
kernel: [ 9832.457748] block drbd6: self
E8E3BDC352C4C580:0000000000000000:71C7A5DE96C51226:71C6A5DE96C51227
bits:0 flags:0
kernel: [ 9832.457751] block drbd6: peer
E915DF859DCA76C9:E8E3BDC352C4C581:71C7A5DE96C51227:71C6A5DE96C51227
bits:12 flags:0
kernel: [ 9832.457754] block drbd6: uuid_compare()=-1 by rule 50
kernel: [ 9832.457758] block drbd6: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk(
DUnknown -> UpToDate )
kernel: [ 9832.883300] block drbd6: conn( WFBitMapT -> WFSyncUUID )
kernel: [ 9832.987097] block drbd6: updated sync uuid
E8E4BDC352C4C580:0000000000000000:71C7A5DE96C51226:71C6A5DE96C51227
kernel: [ 9833.141291] block drbd6: helper command: /sbin/drbdadm
before-resync-target minor-6
kernel: [ 9833.158129] block drbd6: helper command: /sbin/drbdadm
before-resync-target minor-6 exit code 0 (0x0)
kernel: [ 9833.158135] block drbd6: conn( WFSyncUUID -> SyncTarget )
disk( Outdated -> Inconsistent )
kernel: [ 9833.158141] block drbd6: Began resync as SyncTarget (will
sync 52 KB [13 bits set]).
kernel: [ 9833.415551] block drbd6: Resync done (total 1 sec; paused 0
sec; 52 K/sec)
kernel: [ 9833.415554] block drbd6: 23 % had equal checksums,
eliminated: 12K; transferred 40K total 52K
kernel: [ 9833.415558] block drbd6: updated UUIDs
E915DF859DCA76C8:0000000000000000:E8E4BDC352C4C580:E8E3BDC352C4C581
kernel: [ 9833.415563] block drbd6: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate )
kernel: [ 9833.575311] block drbd6: helper command: /sbin/drbdadm
after-resync-target minor-6
crm-unfence-peer.sh[24433]: invoked for drbd-serv-mail
crm-unfence-peer.sh[24433]: WARNING drbd-fencing could not determine the
master id of drbd resource drbd-serv-mail
kernel: [ 9833.615746] block drbd6: helper command: /sbin/drbdadm
after-resync-target minor-6 exit code 1 (0x100)
kernel: [ 9833.661043] block drbd6: bitmap WRITE of 84 pages took 11 jiffies
kernel: [ 9833.772319] block drbd6: 0 KB (0 bits) marked out-of-sync by
on disk bit-map.
kernel: [ 9851.333540] block drbd6: magic?? on data m: 0x80816700 c:
19201 l: 0