Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm continuing testing an active/active nfs cluster using a pair of drbd
replicated filesystems. Currently, to test that it is working, the nodes
are set to reboot on even/odd hours to ensure that failover works as
expected without any trouble. Which, for the most part, it does so long as
they are allowed to reach uptodate state between the cycles.
However, I've just run into this and I'm not sure what it means or how to
resolve it. Is this is a software issue or the result of a configuration
error?
Meanwhile, the r1 filesystem is currently mounted on node a and it appears
to be intact; nfs clients have been happily writting away without any errors.
2.6.18-8.1.14.el5 x86_64, centos
on node a:
# cat /proc/drbd
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:34415876 nr:25906480 dw:60322492 dr:2187965 al:8966 bm:888 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:271442 misses:308 starving:0 dirty:0 changed:308
act_log: used:0/521 hits:8595037 misses:8978 starving:0 dirty:12 changed:8966
1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
ns:36930244 nr:24805160 dw:63101540 dr:1859585 al:10341 bm:739 lo:0 pe:0 ua:0 ap:0
[==>.................] sync'ed: 14.5% (1427132/1664416)K
stalled
resync: used:0/31 hits:309394 misses:341 starving:0 dirty:0 changed:341
act_log: used:0/521 hits:9563754 misses:10354 starving:0 dirty:13 changed:10341
on node b:
# cat /proc/drbd
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:39210316 dw:39210252 dr:0 al:0 bm:153 lo:17 pe:0 ua:16 ap:0
resync: used:0/31 hits:122901 misses:153 starving:0 dirty:0 changed:153
act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:42409152 dw:42409152 dr:0 al:0 bm:115 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:104757 misses:115 starving:0 dirty:0 changed:115
act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
node a reboots:
Oct 11 12:38:21 a kernel: drbd1: disk( Diskless -> Attaching )
Oct 11 12:38:21 a kernel: drbd1: Found 10 transactions (592 active extents) in activity log.
Oct 11 12:38:21 a kernel: drbd1: max_segment_size ( = BIO size ) = 32768
Oct 11 12:38:21 a kernel: drbd1: drbd_bm_resize called with capacity == 4294836144
Oct 11 12:38:21 a kernel: drbd1: resync bitmap: bits=536854518 words=8388352
Oct 11 12:38:21 a kernel: drbd1: size = 2047 GB (2147418072 KB)
Oct 11 12:38:22 a kernel: drbd1: reading of bitmap took 295 jiffies
Oct 11 12:38:22 a kernel: drbd1: recounting of set bits took additional 116 jiffies
Oct 11 12:38:22 a kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map.
Oct 11 12:38:22 a kernel: drbd1: disk( Attaching -> Outdated )
Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:38:22 a kernel: drbd1: conn( StandAlone -> Unconnected )
Oct 11 12:38:22 a kernel: drbd1: receiver (re)started
Oct 11 12:38:22 a kernel: drbd1: conn( Unconnected -> WFConnection )
Oct 11 12:38:22 a kernel: drbd1: conn( WFConnection -> WFReportParams )
Oct 11 12:38:22 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
Oct 11 12:38:22 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
Oct 11 12:38:22 a kernel: drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:38:23 a kernel: drbd1: conn( WFBitMapT -> WFSyncUUID )
Oct 11 12:38:23 a kernel: drbd1: aftr_isp( 0 -> 1 )
Oct 11 12:38:23 a kernel: drbd1: peer_isp( 0 -> 1 )
Oct 11 12:38:23 a kernel: drbd1: conn( WFSyncUUID -> PausedSyncT ) disk( Outdated -> Inconsistent )
Oct 11 12:38:23 a kernel: drbd1: Began resync as PausedSyncT (will sync 3267100 KB [816775 bits set]).
Oct 11 12:38:23 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:39:27 a kernel: drbd1: aftr_isp( 1 -> 0 )
Oct 11 12:39:27 a kernel: drbd1: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 )
Oct 11 12:39:27 a kernel: drbd1: Syncer continues.
Oct 11 12:40:45 a kernel: drbd1: Resync done (total 141 sec; paused 63 sec; 41884 K/sec)
Oct 11 12:40:45 a kernel: drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Oct 11 12:40:45 a kernel: drbd1: Writing meta data super block now.
note b goes down for reboot:
Oct 11 13:05:24 a kernel: drbd1: peer( Primary -> Secondary )
Oct 11 13:05:24 a kernel: drbd1: role( Secondary -> Primary )
Oct 11 13:05:24 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:05:36 a kernel: drbd1: State change failed: Refusing to be Primary while peer is not outdated
Oct 11 13:05:36 a kernel: drbd1: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- }
Oct 11 13:05:36 a kernel: drbd1: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Oct 11 13:05:36 a kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated )
Oct 11 13:05:36 a kernel: drbd1: Creating new current UUID
Oct 11 13:05:36 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:05:36 a kernel: drbd1: asender terminated
Oct 11 13:05:36 a kernel: drbd1: tl_clear()
Oct 11 13:05:36 a kernel: drbd1: Connection closed
Oct 11 13:05:36 a kernel: drbd1: conn( TearDown -> Unconnected )
Oct 11 13:05:36 a kernel: drbd1: receiver terminated
Oct 11 13:05:36 a kernel: drbd1: receiver (re)started
Oct 11 13:05:36 a kernel: drbd1: conn( Unconnected -> WFConnection )
Oct 11 13:05:45 a Filesystem[12372]: [12422]: INFO: Running start for /dev/drbd1 on /mnt/b.vnfs
Oct 11 13:05:45 a kernel: EXT3 FS on drbd1, internal journal
node b is back:
Oct 11 13:08:03 a kernel: drbd1: conn( WFConnection -> WFReportParams )
Oct 11 13:08:03 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
Oct 11 13:08:03 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
Oct 11 13:08:03 a kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:08:04 a kernel: drbd1: aftr_isp( 0 -> 1 )
Oct 11 13:08:04 a kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent )
Oct 11 13:08:04 a kernel: drbd1: Began resync as PausedSyncS (will sync 1664416 KB [416104 bits set]).
Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:08:04 a kernel: drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 )
Oct 11 13:09:08 a kernel: drbd1: aftr_isp( 1 -> 0 )
Oct 11 13:09:08 a kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 )
Oct 11 13:09:08 a kernel: drbd1: Syncer continues.
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=772139 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0)
# cat /etc/drbd.conf
global {
usage-count yes;
}
common {
syncer { rate 500M; }
}
resource r0 {
protocol C;
disk {
on-io-error detach;
fencing resource-and-stonith;
}
handlers {
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
}
syncer {
rate 100M;
al-extents 521;
}
net {
cram-hmac-alg "sha1";
shared-secret "FrooFrooFunFactory";
#http://lists.linbit.com/pipermail/drbd-user/2007-March/006651.html
sndbuf-size 512k;
max-buffers 8192;
unplug-watermark 8192;
max-epoch-size 8192;
}
on a.store. {
device /dev/drbd0;
disk /dev/sda;
address 192.168.1.1:7788;
meta-disk internal;
}
on b.store. {
device /dev/drbd0;
disk /dev/sda;
address 192.168.1.2:7788;
meta-disk internal;
}
}
resource r1 {
protocol C;
disk {
on-io-error detach;
fencing resource-and-stonith;
}
handlers {
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
}
net {
cram-hmac-alg "sha1";
shared-secret "FrooFrooFunFactory";
sndbuf-size 512k;
max-buffers 8192;
unplug-watermark 8192;
max-epoch-size 8192;
after-sb-0pri discard-younger-primary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
}
syncer {
rate 100M;
al-extents 521;
after "r0";
}
on a.store. {
device /dev/drbd1;
disk /dev/sdb;
address 192.168.1.5:7788;
meta-disk internal;
}
on b.store. {
device /dev/drbd1;
disk /dev/sdb;
address 192.168.1.6:7788;
meta-disk internal;
}
}
--
Kelsey Cummings - kgc at corp.sonic.net sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407