Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Sorry, forgot to include some important information...
These are Dell Poweredge 2950s. 16GB RAM. RAID10 w/ PERC5 controller.
Using RHEL5 OS.
[alexd at dellpe2950-23 ~]$ cat /proc/drbd
version: 8.0.12 (api:86/proto:86)
GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by
alexd at dellpe2950-23, 2008-05-01 09:44:22
0: cs:WFBitMapT st:Secondary/Primary ds:Inconsistent/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:154 lo:0 pe:0 ua:0 ap:0
resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/577 hits:0 misses:0 starving:0 dirty:0 changed:0
[alexd at dellpe2950-23 ~]$ uname -a
Linux dellpe2950-23 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
[alexd at dellpe2950-22 ~]$ cat /proc/drbd
version: 8.0.12 (api:86/proto:86)
GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by
alexd at dellpe2950-22, 2008-05-01 09:31:32
0: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r---
ns:0 nr:0 dw:4 dr:81 al:1 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/577 hits:0 misses:1 starving:0 dirty:0 changed:1
[alexd at dellpe2950-22 ~]$ uname -a
Linux dellpe2950-22 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
alex at crackpot.org wrote:
> On a test cluster, I was trying to tune drbd.conf. Entered a very large
> value for snfbuf-size (1024). After 30 min, command had still not
> completed, though the file being written hadn't been updated in 27 min,
> and was the desired size. (I used dd to create a 1GB file, and the test
> file was 1GB.)
>
> 23 was primary, 22 was secondary.
>
> The manual says anything larger than 1M may cause problems, and in my
> case it seems clear this is too large. The trouble now is I cannot get
> my cluster usable again.
>
> I edited drbd.conf on both nodes to restore the previous sndbuf-size
> value (128). Was unable to make this take effect on the current
> primary. (Very sorry now, did not note down the exact error. Something
> like 'took more than 5 seconds to complete'.)
>
> I was unable to shut 23 down cleanly. 'shutdown' noted 'system going
> down for reboot' in the syslog, and did nothing after that. Forcibly
> cycled the power.
>
> I have rebooted both nodes. The current primary is 22 (took over when
> 23 rebooted). I have been unable to get them to sync now, even after
> invalidating the entire device on 23. They are connected, but not
> getting past the 'waiting for bit map' stage. Seems the bitmap is
> messed up in some respect. I'm really unsure at this point how to
> resolve this. Any help is appreciated.
>
> alex
>
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: short sent ReportState
> size=12 sent=0
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: asender terminated
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: tl_clear()
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: Connection closed
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Timeout -> Unconnected )
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver terminated
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver (re)started
> May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
> WFConnection )
> May 5 15:25:22 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 86
> May 5 15:25:22 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> May 5 15:25:22 dellpe2950-23 kernel: drbd0: Starting asender thread
> (from drbd0_receiver [6259])
> May 5 15:25:28 dellpe2950-23 kernel: drbd0: conn( WFReportParams ->
> Timeout )
> May 5 15:25:28 dellpe2950-23 kernel: drbd0: short sent ReportSizes
> size=40 sent=0
> May 5 15:25:34 dellpe2950-23 kernel: drbd0: short sent ReportUUIDs
> size=56 sent=0
> May 5 15:25:40 dellpe2950-23 kernel: drbd0: short sent ReportState
> size=12 sent=0
>
>
> May 5 15:27:20 dellpe2950-23 kernel: drbd0: State change failed: Can
> not start resync since it is already active
> May 5 15:27:20 dellpe2950-23 kernel: drbd0: state = { cs:WFBitMapT
> st:Secondary/Primary ds:UpToDate/UpToDate r--- }
> May 5 15:27:20 dellpe2950-23 kernel: drbd0: wanted = {
> cs:StartingSyncT st:Secondary/Primary ds:Inconsistent/UpToDate r--- }
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: peer( Primary -> Unknown )
> conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown )
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: error receiving
> ReportBitMap, l: 4088!
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: asender terminated
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: Writing meta data super
> block now.
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: tl_clear()
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: Connection closed
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: conn( Disconnecting ->
> StandAlone )
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: receiver terminated
> May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating receiver thread
>
>
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( StandAlone ->
> Unconnected )
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting receiver thread
> (from drbd0_worker [4416])
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: receiver (re)started
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
> WFConnection )
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 86
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting asender thread
> (from drbd0_receiver [6301])
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: Split-Brain detected,
> aborting!
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: self
> 99D56CF91187B3F4:8C1668A9CCF498F1:150E86C1B532DE51:FBA773E22A805495
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: peer
> C21D5DCBDE372E53:8C1668A9CCF498F0:150E86C1B532DE50:FBA773E22A805495
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: helper command:
> /sbin/drbdadm split-brain
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( WFReportParams ->
> Disconnecting )
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: error receiving
> ReportState, l: 4!
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: asender terminated
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating asender thread
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: tl_clear()
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: Connection closed
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( Disconnecting ->
> StandAlone )
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: receiver terminated
> May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating receiver thread
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: disk( UpToDate ->
> Inconsistent )
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: Queueing bitmap io:
> invalidate forced full sync
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
> block now.
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
> block now.
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: writing of bitmap took 13
> jiffies
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: 259 GB (67774141 bits)
> marked out-of-sync by on disk bit-map.
> May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
> block now.
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( StandAlone ->
> Unconnected )
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting receiver thread
> (from drbd0_worker [4416])
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: receiver (re)started
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
> WFConnection )
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 86
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting asender thread
> (from drbd0_receiver [6321])
> May 5 15:29:08 dellpe2950-23 kernel: drbd0: Becoming sync target due to
> disk states.
> May 5 15:29:08 dellpe2950-23 kernel: drbd0: peer( Unknown -> Primary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> May 5 15:29:08 dellpe2950-23 kernel: drbd0: Writing meta data super
> block now.
>
> [root at dellpe2950-23]# cat /etc/drbd.conf
> resource drbd-resource-0 {
> protocol C;
> startup {
> degr-wfc-timeout 5;
> }
>
> net {
> #on-disconnect reconnect;
> after-sb-0pri disconnect;
> after-sb-1pri disconnect;
> max-buffers 4096;
> unplug-watermark 128;
> sndbuf-size 128;
> }
>
> disk {
> on-io-error detach;
> }
>
> syncer {
> rate 12M;
> al-extents 577;
> }
>
> on dellpe2950-22 {
> device /dev/drbd0;
> disk /dev/sda7; # db partition
> address 10.99.210.33:7789; # Private subnet IP
> meta-disk internal;
> }
>
> on dellpe2950-23 {
> device /dev/drbd0;
> disk /dev/sda7; # db partition
> address 10.99.210.34:7789; # Private subnet IP
> meta-disk internal;
> }
> }
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080505/5fc3af35/attachment.pgp>