Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm continuing testing an active/active nfs cluster using a pair of drbd replicated filesystems. Currently, to test that it is working, the nodes are set to reboot on even/odd hours to ensure that failover works as expected without any trouble. Which, for the most part, it does so long as they are allowed to reach uptodate state between the cycles. However, I've just run into this and I'm not sure what it means or how to resolve it. Is this is a software issue or the result of a configuration error? Meanwhile, the r1 filesystem is currently mounted on node a and it appears to be intact; nfs clients have been happily writting away without any errors. 2.6.18-8.1.14.el5 x86_64, centos on node a: # cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:34415876 nr:25906480 dw:60322492 dr:2187965 al:8966 bm:888 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:271442 misses:308 starving:0 dirty:0 changed:308 act_log: used:0/521 hits:8595037 misses:8978 starving:0 dirty:12 changed:8966 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns:36930244 nr:24805160 dw:63101540 dr:1859585 al:10341 bm:739 lo:0 pe:0 ua:0 ap:0 [==>.................] sync'ed: 14.5% (1427132/1664416)K stalled resync: used:0/31 hits:309394 misses:341 starving:0 dirty:0 changed:341 act_log: used:0/521 hits:9563754 misses:10354 starving:0 dirty:13 changed:10341 on node b: # cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:39210316 dw:39210252 dr:0 al:0 bm:153 lo:17 pe:0 ua:16 ap:0 resync: used:0/31 hits:122901 misses:153 starving:0 dirty:0 changed:153 act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:42409152 dw:42409152 dr:0 al:0 bm:115 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:104757 misses:115 starving:0 dirty:0 changed:115 act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0 node a reboots: Oct 11 12:38:21 a kernel: drbd1: disk( Diskless -> Attaching ) Oct 11 12:38:21 a kernel: drbd1: Found 10 transactions (592 active extents) in activity log. Oct 11 12:38:21 a kernel: drbd1: max_segment_size ( = BIO size ) = 32768 Oct 11 12:38:21 a kernel: drbd1: drbd_bm_resize called with capacity == 4294836144 Oct 11 12:38:21 a kernel: drbd1: resync bitmap: bits=536854518 words=8388352 Oct 11 12:38:21 a kernel: drbd1: size = 2047 GB (2147418072 KB) Oct 11 12:38:22 a kernel: drbd1: reading of bitmap took 295 jiffies Oct 11 12:38:22 a kernel: drbd1: recounting of set bits took additional 116 jiffies Oct 11 12:38:22 a kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map. Oct 11 12:38:22 a kernel: drbd1: disk( Attaching -> Outdated ) Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now. Oct 11 12:38:22 a kernel: drbd1: conn( StandAlone -> Unconnected ) Oct 11 12:38:22 a kernel: drbd1: receiver (re)started Oct 11 12:38:22 a kernel: drbd1: conn( Unconnected -> WFConnection ) Oct 11 12:38:22 a kernel: drbd1: conn( WFConnection -> WFReportParams ) Oct 11 12:38:22 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86 Oct 11 12:38:22 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Oct 11 12:38:22 a kernel: drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now. Oct 11 12:38:23 a kernel: drbd1: conn( WFBitMapT -> WFSyncUUID ) Oct 11 12:38:23 a kernel: drbd1: aftr_isp( 0 -> 1 ) Oct 11 12:38:23 a kernel: drbd1: peer_isp( 0 -> 1 ) Oct 11 12:38:23 a kernel: drbd1: conn( WFSyncUUID -> PausedSyncT ) disk( Outdated -> Inconsistent ) Oct 11 12:38:23 a kernel: drbd1: Began resync as PausedSyncT (will sync 3267100 KB [816775 bits set]). Oct 11 12:38:23 a kernel: drbd1: Writing meta data super block now. Oct 11 12:39:27 a kernel: drbd1: aftr_isp( 1 -> 0 ) Oct 11 12:39:27 a kernel: drbd1: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 ) Oct 11 12:39:27 a kernel: drbd1: Syncer continues. Oct 11 12:40:45 a kernel: drbd1: Resync done (total 141 sec; paused 63 sec; 41884 K/sec) Oct 11 12:40:45 a kernel: drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Oct 11 12:40:45 a kernel: drbd1: Writing meta data super block now. note b goes down for reboot: Oct 11 13:05:24 a kernel: drbd1: peer( Primary -> Secondary ) Oct 11 13:05:24 a kernel: drbd1: role( Secondary -> Primary ) Oct 11 13:05:24 a kernel: drbd1: Writing meta data super block now. Oct 11 13:05:36 a kernel: drbd1: State change failed: Refusing to be Primary while peer is not outdated Oct 11 13:05:36 a kernel: drbd1: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- } Oct 11 13:05:36 a kernel: drbd1: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown s--- } Oct 11 13:05:36 a kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) Oct 11 13:05:36 a kernel: drbd1: Creating new current UUID Oct 11 13:05:36 a kernel: drbd1: Writing meta data super block now. Oct 11 13:05:36 a kernel: drbd1: asender terminated Oct 11 13:05:36 a kernel: drbd1: tl_clear() Oct 11 13:05:36 a kernel: drbd1: Connection closed Oct 11 13:05:36 a kernel: drbd1: conn( TearDown -> Unconnected ) Oct 11 13:05:36 a kernel: drbd1: receiver terminated Oct 11 13:05:36 a kernel: drbd1: receiver (re)started Oct 11 13:05:36 a kernel: drbd1: conn( Unconnected -> WFConnection ) Oct 11 13:05:45 a Filesystem[12372]: [12422]: INFO: Running start for /dev/drbd1 on /mnt/b.vnfs Oct 11 13:05:45 a kernel: EXT3 FS on drbd1, internal journal node b is back: Oct 11 13:08:03 a kernel: drbd1: conn( WFConnection -> WFReportParams ) Oct 11 13:08:03 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86 Oct 11 13:08:03 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Oct 11 13:08:03 a kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now. Oct 11 13:08:04 a kernel: drbd1: aftr_isp( 0 -> 1 ) Oct 11 13:08:04 a kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent ) Oct 11 13:08:04 a kernel: drbd1: Began resync as PausedSyncS (will sync 1664416 KB [416104 bits set]). Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now. Oct 11 13:08:04 a kernel: drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 ) Oct 11 13:09:08 a kernel: drbd1: aftr_isp( 1 -> 0 ) Oct 11 13:09:08 a kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 ) Oct 11 13:09:08 a kernel: drbd1: Syncer continues. Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=772139 > rs_total=416104 (rs_failed 0) Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0) Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0) Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0) Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0) Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0) Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0) # cat /etc/drbd.conf global { usage-count yes; } common { syncer { rate 500M; } } resource r0 { protocol C; disk { on-io-error detach; fencing resource-and-stonith; } handlers { outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; } syncer { rate 100M; al-extents 521; } net { cram-hmac-alg "sha1"; shared-secret "FrooFrooFunFactory"; #http://lists.linbit.com/pipermail/drbd-user/2007-March/006651.html sndbuf-size 512k; max-buffers 8192; unplug-watermark 8192; max-epoch-size 8192; } on a.store. { device /dev/drbd0; disk /dev/sda; address 192.168.1.1:7788; meta-disk internal; } on b.store. { device /dev/drbd0; disk /dev/sda; address 192.168.1.2:7788; meta-disk internal; } } resource r1 { protocol C; disk { on-io-error detach; fencing resource-and-stonith; } handlers { outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; } net { cram-hmac-alg "sha1"; shared-secret "FrooFrooFunFactory"; sndbuf-size 512k; max-buffers 8192; unplug-watermark 8192; max-epoch-size 8192; after-sb-0pri discard-younger-primary; after-sb-1pri consensus; after-sb-2pri disconnect; } syncer { rate 100M; al-extents 521; after "r0"; } on a.store. { device /dev/drbd1; disk /dev/sdb; address 192.168.1.5:7788; meta-disk internal; } on b.store. { device /dev/drbd1; disk /dev/sdb; address 192.168.1.6:7788; meta-disk internal; } } -- Kelsey Cummings - kgc at corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407