[DRBD-user] stuck in WFBitMapS / WFBitMapT

Alex Dean alex at crackpot.org
Tue May 6 02:05:43 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Sorry, forgot to include some important information...

These are Dell Poweredge 2950s.  16GB RAM.  RAID10 w/ PERC5 controller. 
  Using RHEL5 OS.

[alexd at dellpe2950-23 ~]$ cat /proc/drbd
version: 8.0.12 (api:86/proto:86)
GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by 
alexd at dellpe2950-23, 2008-05-01 09:44:22
  0: cs:WFBitMapT st:Secondary/Primary ds:Inconsistent/UpToDate C r---
     ns:0 nr:0 dw:0 dr:0 al:0 bm:154 lo:0 pe:0 ua:0 ap:0
         resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/577 hits:0 misses:0 starving:0 dirty:0 changed:0
[alexd at dellpe2950-23 ~]$ uname -a
Linux dellpe2950-23 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007 
x86_64 x86_64 x86_64 GNU/Linux

[alexd at dellpe2950-22 ~]$ cat /proc/drbd
version: 8.0.12 (api:86/proto:86)
GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by 
alexd at dellpe2950-22, 2008-05-01 09:31:32
  0: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r---
     ns:0 nr:0 dw:4 dr:81 al:1 bm:0 lo:0 pe:0 ua:0 ap:0
         resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/577 hits:0 misses:1 starving:0 dirty:0 changed:1
[alexd at dellpe2950-22 ~]$ uname -a
Linux dellpe2950-22 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007 
x86_64 x86_64 x86_64 GNU/Linux



alex at crackpot.org wrote:
> On a test cluster, I was trying to tune drbd.conf.  Entered a very large 
> value for snfbuf-size (1024).  After 30 min, command had still not 
> completed, though the file being written hadn't been updated in 27 min, 
> and was the desired size.  (I used dd to create a 1GB file, and the test 
> file was 1GB.)
> 
> 23 was primary, 22 was secondary.
> 
> The manual says anything larger than 1M may cause problems, and in my 
> case it seems clear this is too large.  The trouble now is I cannot get 
> my cluster usable again.
> 
> I edited drbd.conf on both nodes to restore the previous sndbuf-size 
> value (128).  Was unable to make this take effect on the current 
> primary.  (Very sorry now, did not note down the exact error.  Something 
> like 'took more than 5 seconds to complete'.)
> 
> I was unable to shut 23 down cleanly.  'shutdown' noted 'system going 
> down for reboot' in the syslog, and did nothing after that.  Forcibly 
> cycled the power.
> 
> I have rebooted both nodes.  The current primary is 22 (took over when 
> 23 rebooted).  I have been unable to get them to sync now, even after 
> invalidating the entire device on 23.  They are connected, but not 
> getting past the 'waiting for bit map' stage.  Seems the bitmap is 
> messed up in some respect.  I'm really unsure at this point how to 
> resolve this.  Any help is appreciated.
> 
> alex
> 
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: short sent ReportState 
> size=12 sent=0
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: asender terminated
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: tl_clear()
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: Connection closed
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Timeout -> Unconnected )
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: receiver terminated
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: receiver (re)started
> May  5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> 
> WFConnection )
> May  5 15:25:22 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD 
> Network Protocol version 86
> May  5 15:25:22 dellpe2950-23 kernel: drbd0: conn( WFConnection -> 
> WFReportParams )
> May  5 15:25:22 dellpe2950-23 kernel: drbd0: Starting asender thread 
> (from drbd0_receiver [6259])
> May  5 15:25:28 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> 
> Timeout )
> May  5 15:25:28 dellpe2950-23 kernel: drbd0: short sent ReportSizes 
> size=40 sent=0
> May  5 15:25:34 dellpe2950-23 kernel: drbd0: short sent ReportUUIDs 
> size=56 sent=0
> May  5 15:25:40 dellpe2950-23 kernel: drbd0: short sent ReportState 
> size=12 sent=0
> 
> 
> May  5 15:27:20 dellpe2950-23 kernel: drbd0: State change failed: Can 
> not start resync since it is already active
> May  5 15:27:20 dellpe2950-23 kernel: drbd0:   state = { cs:WFBitMapT 
> st:Secondary/Primary ds:UpToDate/UpToDate r--- }
> May  5 15:27:20 dellpe2950-23 kernel: drbd0:  wanted = { 
> cs:StartingSyncT st:Secondary/Primary ds:Inconsistent/UpToDate r--- }
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: peer( Primary -> Unknown ) 
> conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown )
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: error receiving 
> ReportBitMap, l: 4088!
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: asender terminated
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: Writing meta data super 
> block now.
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: tl_clear()
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: Connection closed
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> 
> StandAlone )
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: receiver terminated
> May  5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating receiver thread
> 
> 
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: conn( StandAlone -> 
> Unconnected )
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: Starting receiver thread 
> (from drbd0_worker [4416])
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: receiver (re)started
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> 
> WFConnection )
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD 
> Network Protocol version 86
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: conn( WFConnection -> 
> WFReportParams )
> May  5 15:28:21 dellpe2950-23 kernel: drbd0: Starting asender thread 
> (from drbd0_receiver [6301])
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: Split-Brain detected, 
> aborting!
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: self 
> 99D56CF91187B3F4:8C1668A9CCF498F1:150E86C1B532DE51:FBA773E22A805495
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: peer 
> C21D5DCBDE372E53:8C1668A9CCF498F0:150E86C1B532DE50:FBA773E22A805495
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: helper command: 
> /sbin/drbdadm split-brain
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> 
> Disconnecting )
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: error receiving 
> ReportState, l: 4!
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: asender terminated
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: tl_clear()
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: Connection closed
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> 
> StandAlone )
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: receiver terminated
> May  5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating receiver thread
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: disk( UpToDate -> 
> Inconsistent )
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: Queueing bitmap io: 
> invalidate forced full sync
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super 
> block now.
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super 
> block now.
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: writing of bitmap took 13 
> jiffies
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: 259 GB (67774141 bits) 
> marked out-of-sync by on disk bit-map.
> May  5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super 
> block now.
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: conn( StandAlone -> 
> Unconnected )
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: Starting receiver thread 
> (from drbd0_worker [4416])
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: receiver (re)started
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: conn( Unconnected -> 
> WFConnection )
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD 
> Network Protocol version 86
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: conn( WFConnection -> 
> WFReportParams )
> May  5 15:29:07 dellpe2950-23 kernel: drbd0: Starting asender thread 
> (from drbd0_receiver [6321])
> May  5 15:29:08 dellpe2950-23 kernel: drbd0: Becoming sync target due to 
> disk states.
> May  5 15:29:08 dellpe2950-23 kernel: drbd0: peer( Unknown -> Primary ) 
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> May  5 15:29:08 dellpe2950-23 kernel: drbd0: Writing meta data super 
> block now.
> 
> [root at dellpe2950-23]# cat /etc/drbd.conf
> resource drbd-resource-0 {
>   protocol C;
>   startup {
>     degr-wfc-timeout 5;
>   }
> 
>   net {
>     #on-disconnect reconnect;
>     after-sb-0pri disconnect;
>     after-sb-1pri disconnect;
>     max-buffers 4096;
>     unplug-watermark 128;
>     sndbuf-size 128;
>   }
> 
>   disk {
>     on-io-error detach;
>   }
> 
>   syncer {
>     rate 12M;
>     al-extents 577;
>   }
> 
>   on dellpe2950-22 {
>     device /dev/drbd0;
>     disk   /dev/sda7; # db partition
>     address 10.99.210.33:7789; # Private subnet IP
>     meta-disk internal;
>   }
> 
>   on dellpe2950-23 {
>     device /dev/drbd0;
>     disk   /dev/sda7;   # db partition
>     address 10.99.210.34:7789;  # Private subnet IP
>     meta-disk internal;
>   }
> }


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080505/5fc3af35/attachment.pgp>


More information about the drbd-user mailing list