Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On a test cluster, I was trying to tune drbd.conf. Entered a very
large value for snfbuf-size (1024). After 30 min, command had still
not completed, though the file being written hadn't been updated in 27
min, and was the desired size. (I used dd to create a 1GB file, and
the test file was 1GB.)
23 was primary, 22 was secondary.
The manual says anything larger than 1M may cause problems, and in my
case it seems clear this is too large. The trouble now is I cannot
get my cluster usable again.
I edited drbd.conf on both nodes to restore the previous sndbuf-size
value (128). Was unable to make this take effect on the current
primary. (Very sorry now, did not note down the exact error.
Something like 'took more than 5 seconds to complete'.)
I was unable to shut 23 down cleanly. 'shutdown' noted 'system going
down for reboot' in the syslog, and did nothing after that. Forcibly
cycled the power.
I have rebooted both nodes. The current primary is 22 (took over when
23 rebooted). I have been unable to get them to sync now, even after
invalidating the entire device on 23. They are connected, but not
getting past the 'waiting for bit map' stage. Seems the bitmap is
messed up in some respect. I'm really unsure at this point how to
resolve this. Any help is appreciated.
alex
May 5 15:25:21 dellpe2950-23 kernel: drbd0: short sent ReportState
size=12 sent=0
May 5 15:25:21 dellpe2950-23 kernel: drbd0: asender terminated
May 5 15:25:21 dellpe2950-23 kernel: drbd0: Terminating asender thread
May 5 15:25:21 dellpe2950-23 kernel: drbd0: tl_clear()
May 5 15:25:21 dellpe2950-23 kernel: drbd0: Connection closed
May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Timeout -> Unconnected )
May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver terminated
May 5 15:25:21 dellpe2950-23 kernel: drbd0: receiver (re)started
May 5 15:25:21 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
WFConnection )
May 5 15:25:22 dellpe2950-23 kernel: drbd0: Handshake successful:
DRBD Network Protocol version 86
May 5 15:25:22 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
WFReportParams )
May 5 15:25:22 dellpe2950-23 kernel: drbd0: Starting asender thread
(from drbd0_receiver [6259])
May 5 15:25:28 dellpe2950-23 kernel: drbd0: conn( WFReportParams -> Timeout )
May 5 15:25:28 dellpe2950-23 kernel: drbd0: short sent ReportSizes
size=40 sent=0
May 5 15:25:34 dellpe2950-23 kernel: drbd0: short sent ReportUUIDs
size=56 sent=0
May 5 15:25:40 dellpe2950-23 kernel: drbd0: short sent ReportState
size=12 sent=0
May 5 15:27:20 dellpe2950-23 kernel: drbd0: State change failed: Can
not start resync since it is already active
May 5 15:27:20 dellpe2950-23 kernel: drbd0: state = { cs:WFBitMapT
st:Secondary/Primary ds:UpToDate/UpToDate r--- }
May 5 15:27:20 dellpe2950-23 kernel: drbd0: wanted = {
cs:StartingSyncT st:Secondary/Primary ds:Inconsistent/UpToDate r--- }
May 5 15:28:05 dellpe2950-23 kernel: drbd0: peer( Primary -> Unknown
) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown )
May 5 15:28:05 dellpe2950-23 kernel: drbd0: error receiving
ReportBitMap, l: 4088!
May 5 15:28:05 dellpe2950-23 kernel: drbd0: asender terminated
May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating asender thread
May 5 15:28:05 dellpe2950-23 kernel: drbd0: Writing meta data super
block now.
May 5 15:28:05 dellpe2950-23 kernel: drbd0: tl_clear()
May 5 15:28:05 dellpe2950-23 kernel: drbd0: Connection closed
May 5 15:28:05 dellpe2950-23 kernel: drbd0: conn( Disconnecting ->
StandAlone )
May 5 15:28:05 dellpe2950-23 kernel: drbd0: receiver terminated
May 5 15:28:05 dellpe2950-23 kernel: drbd0: Terminating receiver thread
May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( StandAlone -> Unconnected )
May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting receiver thread
(from drbd0_worker [4416])
May 5 15:28:21 dellpe2950-23 kernel: drbd0: receiver (re)started
May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
WFConnection )
May 5 15:28:21 dellpe2950-23 kernel: drbd0: Handshake successful:
DRBD Network Protocol version 86
May 5 15:28:21 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
WFReportParams )
May 5 15:28:21 dellpe2950-23 kernel: drbd0: Starting asender thread
(from drbd0_receiver [6301])
May 5 15:28:22 dellpe2950-23 kernel: drbd0: Split-Brain detected, aborting!
May 5 15:28:22 dellpe2950-23 kernel: drbd0: self
99D56CF91187B3F4:8C1668A9CCF498F1:150E86C1B532DE51:FBA773E22A805495
May 5 15:28:22 dellpe2950-23 kernel: drbd0: peer
C21D5DCBDE372E53:8C1668A9CCF498F0:150E86C1B532DE50:FBA773E22A805495
May 5 15:28:22 dellpe2950-23 kernel: drbd0: helper command:
/sbin/drbdadm split-brain
May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( WFReportParams ->
Disconnecting )
May 5 15:28:22 dellpe2950-23 kernel: drbd0: error receiving
ReportState, l: 4!
May 5 15:28:22 dellpe2950-23 kernel: drbd0: asender terminated
May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating asender thread
May 5 15:28:22 dellpe2950-23 kernel: drbd0: tl_clear()
May 5 15:28:22 dellpe2950-23 kernel: drbd0: Connection closed
May 5 15:28:22 dellpe2950-23 kernel: drbd0: conn( Disconnecting ->
StandAlone )
May 5 15:28:22 dellpe2950-23 kernel: drbd0: receiver terminated
May 5 15:28:22 dellpe2950-23 kernel: drbd0: Terminating receiver thread
May 5 15:28:57 dellpe2950-23 kernel: drbd0: disk( UpToDate -> Inconsistent )
May 5 15:28:57 dellpe2950-23 kernel: drbd0: Queueing bitmap io:
invalidate forced full sync
May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
block now.
May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
block now.
May 5 15:28:57 dellpe2950-23 kernel: drbd0: writing of bitmap took 13 jiffies
May 5 15:28:57 dellpe2950-23 kernel: drbd0: 259 GB (67774141 bits)
marked out-of-sync by on disk bit-map.
May 5 15:28:57 dellpe2950-23 kernel: drbd0: Writing meta data super
block now.
May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( StandAlone -> Unconnected )
May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting receiver thread
(from drbd0_worker [4416])
May 5 15:29:07 dellpe2950-23 kernel: drbd0: receiver (re)started
May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( Unconnected ->
WFConnection )
May 5 15:29:07 dellpe2950-23 kernel: drbd0: Handshake successful:
DRBD Network Protocol version 86
May 5 15:29:07 dellpe2950-23 kernel: drbd0: conn( WFConnection ->
WFReportParams )
May 5 15:29:07 dellpe2950-23 kernel: drbd0: Starting asender thread
(from drbd0_receiver [6321])
May 5 15:29:08 dellpe2950-23 kernel: drbd0: Becoming sync target due
to disk states.
May 5 15:29:08 dellpe2950-23 kernel: drbd0: peer( Unknown -> Primary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 5 15:29:08 dellpe2950-23 kernel: drbd0: Writing meta data super
block now.
[root at dellpe2950-23]# cat /etc/drbd.conf
resource drbd-resource-0 {
protocol C;
startup {
degr-wfc-timeout 5;
}
net {
#on-disconnect reconnect;
after-sb-0pri disconnect;
after-sb-1pri disconnect;
max-buffers 4096;
unplug-watermark 128;
sndbuf-size 128;
}
disk {
on-io-error detach;
}
syncer {
rate 12M;
al-extents 577;
}
on dellpe2950-22 {
device /dev/drbd0;
disk /dev/sda7; # db partition
address 10.99.210.33:7789; # Private subnet IP
meta-disk internal;
}
on dellpe2950-23 {
device /dev/drbd0;
disk /dev/sda7; # db partition
address 10.99.210.34:7789; # Private subnet IP
meta-disk internal;
}
}