[DRBD-user] Digest mismatch resulting in "split brain" after (!) automatic reconnect

Lars Ellenberg lars.ellenberg at linbit.com
Mon Feb 21 10:36:33 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Feb 21, 2011 at 10:24:13AM +0100, Lars Ellenberg wrote:
> On Mon, Feb 21, 2011 at 10:02:30AM +0100, Raoul Bhatia [IPAX] wrote:
> > hi,
> > 
> > after a couple of days, i can tell that i do not see the described
> > problem with
> > drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64
> > (backports from squeeze to debian lenny)
> > 
> > > root at c02n01 ~ # cat /proc/drbd
> > > version: 8.3.7 (api:88/proto:86-91)
> > > srcversion: EE47D8BF18AC166BE219757
> > 
> > 
> > taking a closer look, i also do not see the original error message
> > anymore: (Digest mismatch, buffer modified by upper layers during write:
> > 0s +4096)
> 
> we changed the log message, respectively added the ability to
> distinguish between detecting mismatch on the receiving end (previously
> possible already), and detecting mismatch on the sending end as well
> (previously not checked).
> 
> > instead, i now see dmesg like:
> > > [197080.750826] block drbd1: Digest integrity check FAILED.
> > > [197080.750871] block drbd1: error receiving Data, l: 4136!
> > > [197080.750905] block drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) 
> > > [197080.750977] block drbd1: asender terminated
> > 
> > however, the devices correctly get back in sync.
> > 
> > i'll additionally run a manual verify later on and will report back.
> > 
> > lars: were you able to extract the logfiles from my original post?
> 
> The logs of your original post are completely boring.

No, wait.
They are not ;-)

Feb 16 06:25:03 c02n01 kernel: [3687390.120354] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Feb 16 06:25:03 c02n01 kernel: [3687390.120362] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]).
Feb 16 06:25:03 c02n01 kernel: [3687390.120797] block drbd1: updated sync UUID 3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B:3CFC3B16AAE1131D
Feb 16 06:25:03 c02n01 kernel: [3687390.131787] block drbd1: Retrying drbd_rs_del_all() later. refcnt=1
Feb 16 06:25:04 c02n01 kernel: [3687390.232237] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
Feb 16 06:25:04 c02n01 kernel: [3687390.232314] block drbd1: updated UUIDs 3C1DADF6B38C1AD7:0000000000000000:E7E50184F3F3AC0B:E7E40184F3F3AC0B
Feb 16 06:25:04 c02n01 kernel: [3687390.232434] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Feb 16 06:25:04 c02n01 kernel: [3687390.274089] block drbd1: bitmap WRITE of 762 pages took 10 jiffies
Feb 16 06:25:04 c02n01 kernel: [3687390.274154] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.

Feb 16 06:25:04 c02n01 kernel: [3687390.947353] block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 1 (0x100)
Feb 16 06:25:04 c02n01 kernel: [3687390.947487] block drbd1: fence-peer helper broken, returned 1

Fix your fence-peer helper,
that may be the cause of trouble there.

Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate -> DUnknown )

This should not have happened, either:
We must not change the pdsk state to DUnknown while keeping conn state at Connected.
That's nonsense.

Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B 

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list