Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Sorry, forgot to include some important information... These are Dell Poweredge 2950s. 16GB RAM. RAID10 w/ PERC5 controller. Using RHEL5 OS. [alexd at dellpe2950-23 ~]$ cat /proc/drbd version: 8.0.12 (api:86/proto:86) GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by alexd at dellpe2950-23, 2008-05-01 09:44:22 0: cs:WFBitMapT st:Secondary/Primary ds:Inconsistent/UpToDate C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:154 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/577 hits:0 misses:0 starving:0 dirty:0 changed:0 [alexd at dellpe2950-23 ~]$ uname -a Linux dellpe2950-23 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [alexd at dellpe2950-22 ~]$ cat /proc/drbd version: 8.0.12 (api:86/proto:86) GIT-hash: 5c9f89594553e32adb87d9638dce591782f947e3 build by alexd at dellpe2950-22, 2008-05-01 09:31:32 0: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns:0 nr:0 dw:4 dr:81 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/577 hits:0 misses:1 starving:0 dirty:0 changed:1 [alexd at dellpe2950-22 ~]$ uname -a Linux dellpe2950-22 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux alex at crackpot.org wrote: > On a test cluster, I was trying to tune drbd.conf. Entered a very large > value for snfbuf-size (1024). After 30 min, command had still not > completed, though the file being written hadn't been updated in 27 min, > and was the desired size. (I used dd to create a 1GB file, and the test > file was 1GB.) > > 23 was primary, 22 was secondary. > > The manual says anything larger than 1M may cause problems, and in my > case it seems clear this is too large. The trouble now is I cannot get > my cluster usable again. > > I edited drbd.conf on both nodes to restore the previous sndbuf-size > value (128). Was unable to make this take effect on the current > primary. (Very sorry now, did not note down the exact error. Something > like 'took more than 5 seconds to complete'.) > > I was unable to shut 23 down cleanly. 'shutdown' noted 'system going > down for reboot' in the syslog, and did nothing after that. Forcibly > cycled the power. > > I have rebooted both nodes. The current primary is 22 (took over when > 23 rebooted). I have been unable to get them to sync now, even after > invalidating the entire device on 23. They are connected, but not > getting past the 'waiting for bit map' stage. Seems the bitmap is > messed up in some respect. I'm really unsure at this point how to > resolve this. Any help is appreciated. > > alex > > May 5 15:25:21 dellpe2950-23 kernel: drbd0: short sent ReportState > size=12 sent=0 > May 5 15:25:21 dellpe2950-23 kernel: drbd0: asender terminated > May 5 15:25:21 dellpe2950-23 kernel: drbd0: Terminating asender thread > May 5 15:25:21 dellpe2950-23 kernel: drbd0: tl_clear() > May 5 15:25:21 dellpe2950-23 kernel: drbd0: Connection closed > May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Timeout -> Unconnected ) > May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver terminated > May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver (re)started > May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> > WFConnection ) > May 5 15:25:22 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 86 > May 5 15:25:22 dellpe2950-23 kernel: drbd0: conn( WFConnection -> > WFReportParams ) > May 5 15:25:22 dellpe2950-23 kernel: drbd0: Starting asender thread > (from drbd0_receiver [6259]) > May 5 15:25:28 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> > Timeout ) > May 5 15:25:28 dellpe2950-23 kernel: drbd0: short sent ReportSizes > size=40 sent=0 > May 5 15:25:34 dellpe2950-23 kernel: drbd0: short sent ReportUUIDs > size=56 sent=0 > May 5 15:25:40 dellpe2950-23 kernel: drbd0: short sent ReportState > size=12 sent=0 > > > May 5 15:27:20 dellpe2950-23 kernel: drbd0: State change failed: Can > not start resync since it is already active > May 5 15:27:20 dellpe2950-23 kernel: drbd0: state = { cs:WFBitMapT > st:Secondary/Primary ds:UpToDate/UpToDate r--- } > May 5 15:27:20 dellpe2950-23 kernel: drbd0: wanted = { > cs:StartingSyncT st:Secondary/Primary ds:Inconsistent/UpToDate r--- } > May 5 15:28:05 dellpe2950-23 kernel: drbd0: peer( Primary -> Unknown ) > conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) > May 5 15:28:05 dellpe2950-23 kernel: drbd0: error receiving > ReportBitMap, l: 4088! > May 5 15:28:05 dellpe2950-23 kernel: drbd0: asender terminated > May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating asender thread > May 5 15:28:05 dellpe2950-23 kernel: drbd0: Writing meta data super > block now. > May 5 15:28:05 dellpe2950-23 kernel: drbd0: tl_clear() > May 5 15:28:05 dellpe2950-23 kernel: drbd0: Connection closed > May 5 15:28:05 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> > StandAlone ) > May 5 15:28:05 dellpe2950-23 kernel: drbd0: receiver terminated > May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating receiver thread > > > May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( StandAlone -> > Unconnected ) > May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting receiver thread > (from drbd0_worker [4416]) > May 5 15:28:21 dellpe2950-23 kernel: drbd0: receiver (re)started > May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> > WFConnection ) > May 5 15:28:21 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 86 > May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( WFConnection -> > WFReportParams ) > May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting asender thread > (from drbd0_receiver [6301]) > May 5 15:28:22 dellpe2950-23 kernel: drbd0: Split-Brain detected, > aborting! > May 5 15:28:22 dellpe2950-23 kernel: drbd0: self > 99D56CF91187B3F4:8C1668A9CCF498F1:150E86C1B532DE51:FBA773E22A805495 > May 5 15:28:22 dellpe2950-23 kernel: drbd0: peer > C21D5DCBDE372E53:8C1668A9CCF498F0:150E86C1B532DE50:FBA773E22A805495 > May 5 15:28:22 dellpe2950-23 kernel: drbd0: helper command: > /sbin/drbdadm split-brain > May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> > Disconnecting ) > May 5 15:28:22 dellpe2950-23 kernel: drbd0: error receiving > ReportState, l: 4! > May 5 15:28:22 dellpe2950-23 kernel: drbd0: asender terminated > May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating asender thread > May 5 15:28:22 dellpe2950-23 kernel: drbd0: tl_clear() > May 5 15:28:22 dellpe2950-23 kernel: drbd0: Connection closed > May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> > StandAlone ) > May 5 15:28:22 dellpe2950-23 kernel: drbd0: receiver terminated > May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating receiver thread > May 5 15:28:57 dellpe2950-23 kernel: drbd0: disk( UpToDate -> > Inconsistent ) > May 5 15:28:57 dellpe2950-23 kernel: drbd0: Queueing bitmap io: > invalidate forced full sync > May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super > block now. > May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super > block now. > May 5 15:28:57 dellpe2950-23 kernel: drbd0: writing of bitmap took 13 > jiffies > May 5 15:28:57 dellpe2950-23 kernel: drbd0: 259 GB (67774141 bits) > marked out-of-sync by on disk bit-map. > May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super > block now. > May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( StandAlone -> > Unconnected ) > May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting receiver thread > (from drbd0_worker [4416]) > May 5 15:29:07 dellpe2950-23 kernel: drbd0: receiver (re)started > May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( Unconnected -> > WFConnection ) > May 5 15:29:07 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 86 > May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( WFConnection -> > WFReportParams ) > May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting asender thread > (from drbd0_receiver [6321]) > May 5 15:29:08 dellpe2950-23 kernel: drbd0: Becoming sync target due to > disk states. > May 5 15:29:08 dellpe2950-23 kernel: drbd0: peer( Unknown -> Primary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > May 5 15:29:08 dellpe2950-23 kernel: drbd0: Writing meta data super > block now. > > [root at dellpe2950-23]# cat /etc/drbd.conf > resource drbd-resource-0 { > protocol C; > startup { > degr-wfc-timeout 5; > } > > net { > #on-disconnect reconnect; > after-sb-0pri disconnect; > after-sb-1pri disconnect; > max-buffers 4096; > unplug-watermark 128; > sndbuf-size 128; > } > > disk { > on-io-error detach; > } > > syncer { > rate 12M; > al-extents 577; > } > > on dellpe2950-22 { > device /dev/drbd0; > disk /dev/sda7; # db partition > address 10.99.210.33:7789; # Private subnet IP > meta-disk internal; > } > > on dellpe2950-23 { > device /dev/drbd0; > disk /dev/sda7; # db partition > address 10.99.210.34:7789; # Private subnet IP > meta-disk internal; > } > } -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080505/5fc3af35/attachment.pgp>