Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On a test cluster, I was trying to tune drbd.conf. Entered a very large value for snfbuf-size (1024). After 30 min, command had still not completed, though the file being written hadn't been updated in 27 min, and was the desired size. (I used dd to create a 1GB file, and the test file was 1GB.) 23 was primary, 22 was secondary. The manual says anything larger than 1M may cause problems, and in my case it seems clear this is too large. The trouble now is I cannot get my cluster usable again. I edited drbd.conf on both nodes to restore the previous sndbuf-size value (128). Was unable to make this take effect on the current primary. (Very sorry now, did not note down the exact error. Something like 'took more than 5 seconds to complete'.) I was unable to shut 23 down cleanly. 'shutdown' noted 'system going down for reboot' in the syslog, and did nothing after that. Forcibly cycled the power. I have rebooted both nodes. The current primary is 22 (took over when 23 rebooted). I have been unable to get them to sync now, even after invalidating the entire device on 23. They are connected, but not getting past the 'waiting for bit map' stage. Seems the bitmap is messed up in some respect. I'm really unsure at this point how to resolve this. Any help is appreciated. alex May 5 15:25:21 dellpe2950-23 kernel: drbd0: short sent ReportState size=12 sent=0 May 5 15:25:21 dellpe2950-23 kernel: drbd0: asender terminated May 5 15:25:21 dellpe2950-23 kernel: drbd0: Terminating asender thread May 5 15:25:21 dellpe2950-23 kernel: drbd0: tl_clear() May 5 15:25:21 dellpe2950-23 kernel: drbd0: Connection closed May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Timeout -> Unconnected ) May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver terminated May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver (re)started May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> WFConnection ) May 5 15:25:22 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 May 5 15:25:22 dellpe2950-23 kernel: drbd0: conn( WFConnection -> WFReportParams ) May 5 15:25:22 dellpe2950-23 kernel: drbd0: Starting asender thread (from drbd0_receiver [6259]) May 5 15:25:28 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> Timeout ) May 5 15:25:28 dellpe2950-23 kernel: drbd0: short sent ReportSizes size=40 sent=0 May 5 15:25:34 dellpe2950-23 kernel: drbd0: short sent ReportUUIDs size=56 sent=0 May 5 15:25:40 dellpe2950-23 kernel: drbd0: short sent ReportState size=12 sent=0 May 5 15:27:20 dellpe2950-23 kernel: drbd0: State change failed: Can not start resync since it is already active May 5 15:27:20 dellpe2950-23 kernel: drbd0: state = { cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate r--- } May 5 15:27:20 dellpe2950-23 kernel: drbd0: wanted = { cs:StartingSyncT st:Secondary/Primary ds:Inconsistent/UpToDate r--- } May 5 15:28:05 dellpe2950-23 kernel: drbd0: peer( Primary -> Unknown ) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) May 5 15:28:05 dellpe2950-23 kernel: drbd0: error receiving ReportBitMap, l: 4088! May 5 15:28:05 dellpe2950-23 kernel: drbd0: asender terminated May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating asender thread May 5 15:28:05 dellpe2950-23 kernel: drbd0: Writing meta data super block now. May 5 15:28:05 dellpe2950-23 kernel: drbd0: tl_clear() May 5 15:28:05 dellpe2950-23 kernel: drbd0: Connection closed May 5 15:28:05 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> StandAlone ) May 5 15:28:05 dellpe2950-23 kernel: drbd0: receiver terminated May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating receiver thread May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( StandAlone -> Unconnected ) May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting receiver thread (from drbd0_worker [4416]) May 5 15:28:21 dellpe2950-23 kernel: drbd0: receiver (re)started May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( Unconnected -> WFConnection ) May 5 15:28:21 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( WFConnection -> WFReportParams ) May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting asender thread (from drbd0_receiver [6301]) May 5 15:28:22 dellpe2950-23 kernel: drbd0: Split-Brain detected, aborting! May 5 15:28:22 dellpe2950-23 kernel: drbd0: self 99D56CF91187B3F4:8C1668A9CCF498F1:150E86C1B532DE51:FBA773E22A805495 May 5 15:28:22 dellpe2950-23 kernel: drbd0: peer C21D5DCBDE372E53:8C1668A9CCF498F0:150E86C1B532DE50:FBA773E22A805495 May 5 15:28:22 dellpe2950-23 kernel: drbd0: helper command: /sbin/drbdadm split-brain May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> Disconnecting ) May 5 15:28:22 dellpe2950-23 kernel: drbd0: error receiving ReportState, l: 4! May 5 15:28:22 dellpe2950-23 kernel: drbd0: asender terminated May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating asender thread May 5 15:28:22 dellpe2950-23 kernel: drbd0: tl_clear() May 5 15:28:22 dellpe2950-23 kernel: drbd0: Connection closed May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( Disconnecting -> StandAlone ) May 5 15:28:22 dellpe2950-23 kernel: drbd0: receiver terminated May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating receiver thread May 5 15:28:57 dellpe2950-23 kernel: drbd0: disk( UpToDate -> Inconsistent ) May 5 15:28:57 dellpe2950-23 kernel: drbd0: Queueing bitmap io: invalidate forced full sync May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super block now. May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super block now. May 5 15:28:57 dellpe2950-23 kernel: drbd0: writing of bitmap took 13 jiffies May 5 15:28:57 dellpe2950-23 kernel: drbd0: 259 GB (67774141 bits) marked out-of-sync by on disk bit-map. May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super block now. May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( StandAlone -> Unconnected ) May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting receiver thread (from drbd0_worker [4416]) May 5 15:29:07 dellpe2950-23 kernel: drbd0: receiver (re)started May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( Unconnected -> WFConnection ) May 5 15:29:07 dellpe2950-23 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( WFConnection -> WFReportParams ) May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting asender thread (from drbd0_receiver [6321]) May 5 15:29:08 dellpe2950-23 kernel: drbd0: Becoming sync target due to disk states. May 5 15:29:08 dellpe2950-23 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 5 15:29:08 dellpe2950-23 kernel: drbd0: Writing meta data super block now. [root at dellpe2950-23]# cat /etc/drbd.conf resource drbd-resource-0 { protocol C; startup { degr-wfc-timeout 5; } net { #on-disconnect reconnect; after-sb-0pri disconnect; after-sb-1pri disconnect; max-buffers 4096; unplug-watermark 128; sndbuf-size 128; } disk { on-io-error detach; } syncer { rate 12M; al-extents 577; } on dellpe2950-22 { device /dev/drbd0; disk /dev/sda7; # db partition address 10.99.210.33:7789; # Private subnet IP meta-disk internal; } on dellpe2950-23 { device /dev/drbd0; disk /dev/sda7; # db partition address 10.99.210.34:7789; # Private subnet IP meta-disk internal; } }