[DRBD-user] cs:WFBitMapT/cs:WFBitMapS status remains, no sync

Lars Ellenberg lars.ellenberg at linbit.com
Wed May 6 16:55:10 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, May 06, 2009 at 11:37:13AM +0200, Ger Apeldoorn wrote:
> Hi,
> 
> Problem:
>     cs:WFBitMapT/cs:WFBitMapS status remains after changes in network. 
> (Not syncing) (More detailed description below)
> 
> Environment:
>     Packages:
>         drbd82-8.2.6-1.el5.centos
>         kmod-drbd82-8.2.6-2  
> 
>     OS:
>         Linux node01 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 
> x86_64 x86_64 x86_64 GNU/Linux
> 
>     /proc/drbd:
>        Primary:
> --------------------%<--------------------
> version: 8.2.6 (api:88/proto:86-88)
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by 
> buildsvn at c5-x8664-build, 2008-10-03 11:30:17
>  0: cs:WFBitMapS st:Secondary/Secondary ds:UpToDate/Outdated C r---
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:380
> --------------------%<--------------------
>        Secondary:
> --------------------%<--------------------
> version: 8.2.6 (api:88/proto:86-88)
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by 
> buildsvn at c5-x8664-build, 2008-10-03 11:30:17
>  0: cs:WFBitMapT st:Secondary/Secondary ds:Outdated/UpToDate C r---
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:72

hmmm.

>     tail -n /var/log/messages | grep -i drbd
>        Primary:
> --------------------%<--------------------
> May  6 10:36:22 node01 kernel: drbd0: Split-Brain detected, dropping 
> connection!
> May  6 10:36:22 node01 kernel: drbd0: self 
> A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
> May  6 10:36:22 node01 kernel: drbd0: peer 
> CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
> May  6 10:36:22 node01 kernel: drbd0: helper command: /sbin/drbdadm 
> split-brain
> May  6 10:36:22 node01 kernel: drbd0: conn( WFReportParams -> 
> Disconnecting )
> May  6 10:36:22 node01 kernel: drbd0: error receiving ReportState, l: 4!
> May  6 10:36:22 node01 kernel: drbd0: asender terminated
> May  6 10:36:22 node01 kernel: drbd0: Terminating asender thread
> May  6 10:36:22 node01 kernel: drbd0: tl_clear()
> May  6 10:36:22 node01 kernel: drbd0: Connection closed
> May  6 10:36:22 node01 kernel: drbd0: conn( Disconnecting -> StandAlone )
> May  6 10:36:22 node01 kernel: drbd0: receiver terminated
> May  6 10:36:22 node01 kernel: drbd0: Terminating receiver thread
> May  6 10:40:43 node01 kernel: drbd0: conn( StandAlone -> Unconnected )
> May  6 10:40:43 node01 kernel: drbd0: Starting receiver thread (from 
> drbd0_worker [16858])
> May  6 10:40:43 node01 kernel: drbd0: receiver (re)started
> May  6 10:40:43 node01 kernel: drbd0: conn( Unconnected -> WFConnection )
> May  6 10:40:53 node01 kernel: drbd0: Handshake successful: Agreed 
> network protocol version 88
> May  6 10:40:53 node01 kernel: drbd0: conn( WFConnection -> 
> WFReportParams )
> May  6 10:40:53 node01 kernel: drbd0: Starting asender thread (from 
> drbd0_receiver [17038])
> May  6 10:40:53 node01 kernel: drbd0: data-integrity-alg: <not-used>
> May  6 10:40:53 node01 kernel: drbd0: Split-Brain detected, manually 
> solved. Sync from this node
> May  6 10:40:53 node01 kernel: drbd0: peer( Unknown -> Secondary ) conn( 
> WFReportParams -> WFBitMapS )
> May  6 10:40:53 node01 kernel: drbd0: Writing meta data super block now.
> May  6 10:41:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg 
> time expired, ko = 4294967295

bitmap exchange blocks.
other node does not receive.

but:

>        Secondary:
> --------------------%<--------------------
> May  6 10:36:22 node02 kernel: drbd0: Split-Brain detected, dropping 
> connection!
> May  6 10:36:22 node02 kernel: drbd0: self 
> CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
> May  6 10:36:22 node02 kernel: drbd0: peer 
> A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
> May  6 10:36:22 node02 kernel: drbd0: helper command: /sbin/drbdadm 
> split-brain
> May  6 10:36:22 node02 kernel: drbd0: conn( WFReportParams -> 
> Disconnecting )
> May  6 10:36:22 node02 kernel: drbd0: error receiving ReportState, l: 4!
> May  6 10:36:22 node02 kernel: drbd0: asender terminated
> May  6 10:36:22 node02 kernel: drbd0: Terminating asender thread
> May  6 10:36:22 node02 kernel: drbd0: tl_clear()
> May  6 10:36:22 node02 kernel: drbd0: Connection closed
> May  6 10:36:22 node02 kernel: drbd0: conn( Disconnecting -> StandAlone )
> May  6 10:36:22 node02 kernel: drbd0: receiver terminated
> May  6 10:36:22 node02 kernel: drbd0: Terminating receiver thread
> May  6 10:40:53 node02 kernel: drbd0: conn( StandAlone -> Unconnected )
> May  6 10:40:53 node02 kernel: drbd0: Starting receiver thread (from 
> drbd0_worker [13338])
> May  6 10:40:53 node02 kernel: drbd0: receiver (re)started
> May  6 10:40:53 node02 kernel: drbd0: conn( Unconnected -> WFConnection )
> May  6 10:40:53 node02 kernel: drbd0: Handshake successful: Agreed 
> network protocol version 88
> May  6 10:40:53 node02 kernel: drbd0: conn( WFConnection -> 
> WFReportParams )
> May  6 10:40:53 node02 kernel: drbd0: Starting asender thread (from 
> drbd0_receiver [14123])
> May  6 10:40:53 node02 kernel: drbd0: data-integrity-alg: <not-used>
> May  6 10:40:53 node02 kernel: drbd0: Split-Brain detected, manually 
> solved. Sync from peer node
> May  6 10:40:53 node02 kernel: drbd0: peer( Unknown -> Secondary ) conn( 
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> May  6 10:40:53 node02 kernel: drbd0: Writing meta data super block now.
> --------------------%<--------------------

it says it is just fine,
and waiting for the bitmap to be send over.

strange.

>    Configuration:
> --------------------%<--------------------
> # /usr/share/doc/drbd82/drbd.conf
> #
> common {
>         syncer { rate 100M; }
>         protocol C;
> }
> 
> resource drbd0 {
>         on node01 {
>                 device /dev/drbd0;
>                 disk /dev/datavg/drbdlv;
>                 meta-disk internal;
>                 address 192.168.1.1:8888;
>         }
> 
>         on node02 {
>                 device /dev/drbd0;
>                 disk /dev/datavg/drbdlv;
>                 meta-disk internal;
>                 address 192.168.1.2:8888;
>         }
> }
> --------------------%<--------------------
> 
> 
> It was working properly whilst on a test-network. Now that it is 
> connected to the production network, the nodes can see eachother, but 
> syncing does not work anymore.
> No firewalls are in the way.
> I use a seperate connection (over eth1) for the drbd sync.

some MTU mismatch issue, so that small packets get through,
but larger ones get dropped?
try a flood ping with increasing packet sizes.

> I try to get rid of the split-brain by issuing: (as can be seen in the logs)
> --------------------%<--------------------
> drbdadm -- --discard-my-data connect drbd0
> --------------------%<--------------------
> 
> How do I get out of this state and get the sync up and running again?


you should double check your network connection.
if the network can be ruled out, and you 
can reproduce it, try to reproduce with drbd 8.3.

if using latest drbd 8.3 gets rid of the problem,
you may have hit some race condition/corner case
that meanwhile has been fixed in DRBD.
especially if with your current
version it only happens "sometimes".

if that does not help, tracking this down may get tricky.
you may need to checkout linbit support offerings
well, you should do so, anyways ;-)


hope that helps,

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list