[DRBD-user] Sporadic splitbrain occurs after a week or more

DooMRunneR doomrunner.lists at gmail.com
Tue Jan 11 08:41:02 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

i'm new to DRBD and have some problems with splitbrains every week or so.

I have a standard Active/Passive setup as given in the 
drbd-documentation and this is my config file:

global {
     usage-count no;
}
common {
     syncer { rate 100M; }
     protocol      C;
}
resource mailstore {
     startup {
        wfc-timeout 0;
        degr-wfc-timeout
        120;
     }
     disk { on-io-error detach; no-disk-barrier; }
     on primary.axigen.cluster {
        device      /dev/drbd1;
        disk        /dev/axigen/mailstore;
        address     192.168.1.10:7791;
        meta-disk   internal;
     }
     on secondary.axigen.cluster {
        device      /dev/drbd1;
        disk        /dev/axigen/mailstore;
        address     192.168.1.20:7791;
        meta-disk   internal;
     }
}

After some days my cluster always changes his state as following:

[root at primary ~]# drbd-overview
   1:mailstore  StandAlone Primary/Unknown UpToDate/DUnknown r---- 
/var/opt/axigen ext3 99G 298M 98G 1%

[root at secondary ~]# drbd-overview
   1:mailstore  StandAlone Secondary/Unknown Outdated/DUnknown r----

I get the following messages in in dmesg on my primary node:

[root at primary ~]# dmesg |grep block
drbd: registered as block device major 147
block drbd1: Starting worker thread (from cqueue/1 [195])
block drbd1: disk( Diskless -> Attaching )
block drbd1: Found 4 transactions (23 active extents) in activity log.
block drbd1: Method to ensure write ordering: flush
block drbd1: max_segment_size ( = BIO size ) = 32768
block drbd1: drbd_bm_resize called with capacity == 209708728
block drbd1: resync bitmap: bits=26213591 words=409588
block drbd1: size = 100 GB (104854364 KB)
block drbd1: recounting of set bits took additional 2 jiffies
block drbd1: 184 KB (46 bits) marked out-of-sync by on disk bit-map.
block drbd1: disk( Attaching -> UpToDate )
block drbd1: Barriers not supported on meta data device - disabling
block drbd1: conn( StandAlone -> Unconnected )
block drbd1: Starting receiver thread (from drbd1_worker [3504])
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3517])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
B0A76171352A5A3C:C23C1E60BAF36299:CDF18AFFD009B9F5:7F938649A9B876DD 
bits:46 flags:0
block drbd1: peer 
C23C1E60BAF36298:0000000000000000:CDF18AFFD009B9F4:7F938649A9B876DD 
bits:0 flags:0
block drbd1: uuid_compare()=1 by rule 70
block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> 
WFBitMapS ) pdsk( DUnknown -> UpToDate )
block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> 
Inconsistent )
block drbd1: Began resync as SyncSource (will sync 184 KB [46 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 184 K/sec)
block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> 
UpToDate )
block drbd1: peer( Secondary -> Primary )
block drbd1: peer( Primary -> Secondary )
block drbd1: role( Secondary -> Primary )
block drbd1: sock was shut down by peer
block drbd1: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe 
) pdsk( UpToDate -> DUnknown )
block drbd1: short read expecting header on sock: r=0
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: Creating new current UUID
block drbd1: Connection closed
block drbd1: conn( BrokenPipe -> Unconnected )
block drbd1: receiver terminated
block drbd1: Restarting receiver thread
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3517])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
5A7A31FAC38B4C31:B0A76171352A5A3D:28B6F22789E39014:C23C1E60BAF36299 
bits:0 flags:0
block drbd1: peer 
B0A76171352A5A3C:0000000000000000:28B6F22789E39014:C23C1E60BAF36299 
bits:0 flags:0
block drbd1: uuid_compare()=1 by rule 70
block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> 
WFBitMapS ) pdsk( DUnknown -> UpToDate )
block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> 
Inconsistent )
block drbd1: Began resync as SyncSource (will sync 0 KB [0 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> 
UpToDate )
block drbd1: sock was shut down by peer
block drbd1: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe 
) pdsk( UpToDate -> DUnknown )
block drbd1: short read expecting header on sock: r=0
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: Creating new current UUID
block drbd1: Connection closed
block drbd1: conn( BrokenPipe -> Unconnected )
block drbd1: receiver terminated
block drbd1: Restarting receiver thread
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: State change failed: Need access to UpToDate data
block drbd1:   state = { cs:WFConnection ro:Primary/Unknown 
ds:UpToDate/DUnknown r--- }
block drbd1:  wanted = { cs:WFConnection ro:Primary/Unknown 
ds:Outdated/DUnknown r--- }
block drbd1: role( Primary -> Secondary )
block drbd1: role( Secondary -> Primary )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3517])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
F601A659931B404D:5A7A31FAC38B4C31:F52C25D8178F32E6:B0A76171352A5A3D 
bits:2 flags:0
block drbd1: peer 
CDEEDF66F24D3ECC:5A7A31FAC38B4C30:F52C25D8178F32E6:B0A76171352A5A3D 
bits:1132 flags:0
block drbd1: uuid_compare()=100 by rule 90
block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 
exit code 0 (0x0)
block drbd1: Split-Brain detected but unresolved, dropping connection!
block drbd1: helper command: /sbin/drbdadm split-brain minor-1
block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 
0 (0x0)
block drbd1: conn( WFReportParams -> Disconnecting )
block drbd1: error receiving ReportState, l: 4!
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: Connection closed
block drbd1: conn( Disconnecting -> StandAlone )
block drbd1: receiver terminated
block drbd1: Terminating receiver thread
[root at primary ~]#

And on my secondary:

[root at secondary ~]# dmesg |grep block
drbd: registered as block device major 147
block drbd1: Starting worker thread (from cqueue/1 [195])
block drbd1: disk( Diskless -> Attaching )
block drbd1: Found 4 transactions (46 active extents) in activity log.
block drbd1: Method to ensure write ordering: flush
block drbd1: max_segment_size ( = BIO size ) = 32768
block drbd1: drbd_bm_resize called with capacity == 209708728
block drbd1: resync bitmap: bits=26213591 words=409588
block drbd1: size = 100 GB (104854364 KB)
block drbd1: recounting of set bits took additional 3 jiffies
block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd1: disk( Attaching -> UpToDate )
block drbd1: Barriers not supported on meta data device - disabling
block drbd1: conn( StandAlone -> Unconnected )
block drbd1: Starting receiver thread (from drbd1_worker [3527])
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3538])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
C23C1E60BAF36298:0000000000000000:CDF18AFFD009B9F4:7F938649A9B876DD 
bits:0 flags:0
block drbd1: peer 
B0A76171352A5A3C:C23C1E60BAF36299:CDF18AFFD009B9F5:7F938649A9B876DD 
bits:46 flags:0
block drbd1: uuid_compare()=-1 by rule 50
block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> 
WFBitMapT ) pdsk( DUnknown -> UpToDate )
block drbd1: conn( WFBitMapT -> WFSyncUUID )
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 
exit code 0 (0x0)
block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> 
Inconsistent )
block drbd1: Began resync as SyncTarget (will sync 184 KB [46 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 184 K/sec)
block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> 
UpToDate )
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 
exit code 0 (0x0)
block drbd1: role( Secondary -> Primary )
block drbd1: role( Primary -> Secondary )
block drbd1: peer( Secondary -> Primary )
block drbd1: PingAck did not arrive in time.
block drbd1: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown )
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: short read expecting header on sock: r=-512
block drbd1: Connection closed
block drbd1: conn( NetworkFailure -> Unconnected )
block drbd1: receiver terminated
block drbd1: Restarting receiver thread
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3538])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
B0A76171352A5A3C:0000000000000000:28B6F22789E39014:C23C1E60BAF36299 
bits:0 flags:0
block drbd1: peer 
5A7A31FAC38B4C31:B0A76171352A5A3D:28B6F22789E39014:C23C1E60BAF36299 
bits:0 flags:0
block drbd1: uuid_compare()=-1 by rule 50
block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> 
WFBitMapT ) pdsk( DUnknown -> UpToDate )
block drbd1: conn( WFBitMapT -> WFSyncUUID )
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 
exit code 0 (0x0)
block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> 
Inconsistent )
block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> 
UpToDate )
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 
exit code 0 (0x0)
block drbd1: Connected in w_make_resync_request
block drbd1: PingAck did not arrive in time.
block drbd1: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown )
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: short read expecting header on sock: r=-512
block drbd1: Connection closed
block drbd1: conn( NetworkFailure -> Unconnected )
block drbd1: receiver terminated
block drbd1: Restarting receiver thread
block drbd1: receiver (re)started
block drbd1: role( Secondary -> Primary )
block drbd1: Creating new current UUID
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: role( Primary -> Secondary )
block drbd1: disk( UpToDate -> Outdated )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [3538])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 
CDEEDF66F24D3ECC:5A7A31FAC38B4C30:F52C25D8178F32E6:B0A76171352A5A3D 
bits:1132 flags:0
block drbd1: peer 
F601A659931B404D:5A7A31FAC38B4C31:F52C25D8178F32E6:B0A76171352A5A3D 
bits:2 flags:0
block drbd1: uuid_compare()=100 by rule 90
block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
block drbd1: meta connection shut down by peer.
block drbd1: conn( WFReportParams -> NetworkFailure )
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 
exit code 0 (0x0)
block drbd1: Split-Brain detected but unresolved, dropping connection!
block drbd1: helper command: /sbin/drbdadm split-brain minor-1
block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 
0 (0x0)
block drbd1: conn( NetworkFailure -> Disconnecting )
block drbd1: error receiving ReportState, l: 4!
block drbd1: Connection closed
block drbd1: conn( Disconnecting -> StandAlone )
block drbd1: receiver terminated
block drbd1: Terminating receiver thread
[root at secondary ~]#



Could anyone please explain me what is happening here?

Thanks in advance

Anton










More information about the drbd-user mailing list