Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Oct 11, 2007 at 01:55:22PM -0700, Kelsey Cummings wrote: > I'm continuing testing an active/active nfs cluster using a pair of drbd > replicated filesystems. Currently, to test that it is working, the nodes > are set to reboot on even/odd hours to ensure that failover works as > expected without any trouble. Which, for the most part, it does so long as > they are allowed to reach uptodate state between the cycles. > > However, I've just run into this and I'm not sure what it means or how to > resolve it. Is this is a software issue or the result of a configuration > error? > > Meanwhile, the r1 filesystem is currently mounted on node a and it appears > to be intact; nfs clients have been happily writting away without any errors. > > 2.6.18-8.1.14.el5 x86_64, centos > > on node a: > # cat /proc/drbd > version: 8.0.6 (api:86/proto:86) > SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56 > 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- > ns:34415876 nr:25906480 dw:60322492 dr:2187965 al:8966 bm:888 lo:0 pe:0 ua:0 ap:0 > resync: used:0/31 hits:271442 misses:308 starving:0 dirty:0 changed:308 > act_log: used:0/521 hits:8595037 misses:8978 starving:0 dirty:12 changed:8966 > 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- > ns:36930244 nr:24805160 dw:63101540 dr:1859585 al:10341 bm:739 lo:0 pe:0 ua:0 ap:0 > [==>.................] sync'ed: 14.5% (1427132/1664416)K > stalled > resync: used:0/31 hits:309394 misses:341 starving:0 dirty:0 changed:341 > act_log: used:0/521 hits:9563754 misses:10354 starving:0 dirty:13 changed:10341 > > on node b: > # cat /proc/drbd > version: 8.0.6 (api:86/proto:86) > SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56 > 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- > ns:0 nr:39210316 dw:39210252 dr:0 al:0 bm:153 lo:17 pe:0 ua:16 ap:0 > resync: used:0/31 hits:122901 misses:153 starving:0 dirty:0 changed:153 > act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0 > 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- > ns:0 nr:42409152 dw:42409152 dr:0 al:0 bm:115 lo:0 pe:0 ua:0 ap:0 > resync: used:0/31 hits:104757 misses:115 starving:0 dirty:0 changed:115 > act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0 > > > > node a reboots: > Oct 11 12:38:21 a kernel: drbd1: disk( Diskless -> Attaching ) > Oct 11 12:38:21 a kernel: drbd1: Found 10 transactions (592 active extents) in activity log. > Oct 11 12:38:21 a kernel: drbd1: max_segment_size ( = BIO size ) = 32768 > Oct 11 12:38:21 a kernel: drbd1: drbd_bm_resize called with capacity == 4294836144 > Oct 11 12:38:21 a kernel: drbd1: resync bitmap: bits=536854518 words=8388352 > Oct 11 12:38:21 a kernel: drbd1: size = 2047 GB (2147418072 KB) > Oct 11 12:38:22 a kernel: drbd1: reading of bitmap took 295 jiffies > Oct 11 12:38:22 a kernel: drbd1: recounting of set bits took additional 116 jiffies > Oct 11 12:38:22 a kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map. > Oct 11 12:38:22 a kernel: drbd1: disk( Attaching -> Outdated ) > Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now. > Oct 11 12:38:22 a kernel: drbd1: conn( StandAlone -> Unconnected ) > Oct 11 12:38:22 a kernel: drbd1: receiver (re)started > Oct 11 12:38:22 a kernel: drbd1: conn( Unconnected -> WFConnection ) > Oct 11 12:38:22 a kernel: drbd1: conn( WFConnection -> WFReportParams ) > Oct 11 12:38:22 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86 > Oct 11 12:38:22 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC > Oct 11 12:38:22 a kernel: drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now. > Oct 11 12:38:23 a kernel: drbd1: conn( WFBitMapT -> WFSyncUUID ) > Oct 11 12:38:23 a kernel: drbd1: aftr_isp( 0 -> 1 ) > Oct 11 12:38:23 a kernel: drbd1: peer_isp( 0 -> 1 ) > Oct 11 12:38:23 a kernel: drbd1: conn( WFSyncUUID -> PausedSyncT ) disk( Outdated -> Inconsistent ) > Oct 11 12:38:23 a kernel: drbd1: Began resync as PausedSyncT (will sync 3267100 KB [816775 bits set]). > Oct 11 12:38:23 a kernel: drbd1: Writing meta data super block now. > Oct 11 12:39:27 a kernel: drbd1: aftr_isp( 1 -> 0 ) > Oct 11 12:39:27 a kernel: drbd1: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 ) > Oct 11 12:39:27 a kernel: drbd1: Syncer continues. > Oct 11 12:40:45 a kernel: drbd1: Resync done (total 141 sec; paused 63 sec; 41884 K/sec) > Oct 11 12:40:45 a kernel: drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) > Oct 11 12:40:45 a kernel: drbd1: Writing meta data super block now. > > note b goes down for reboot: > Oct 11 13:05:24 a kernel: drbd1: peer( Primary -> Secondary ) > Oct 11 13:05:24 a kernel: drbd1: role( Secondary -> Primary ) > Oct 11 13:05:24 a kernel: drbd1: Writing meta data super block now. > Oct 11 13:05:36 a kernel: drbd1: State change failed: Refusing to be Primary while peer is not outdated > Oct 11 13:05:36 a kernel: drbd1: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- } > Oct 11 13:05:36 a kernel: drbd1: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown s--- } > Oct 11 13:05:36 a kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) > Oct 11 13:05:36 a kernel: drbd1: Creating new current UUID > Oct 11 13:05:36 a kernel: drbd1: Writing meta data super block now. > Oct 11 13:05:36 a kernel: drbd1: asender terminated > Oct 11 13:05:36 a kernel: drbd1: tl_clear() > Oct 11 13:05:36 a kernel: drbd1: Connection closed > Oct 11 13:05:36 a kernel: drbd1: conn( TearDown -> Unconnected ) > Oct 11 13:05:36 a kernel: drbd1: receiver terminated > Oct 11 13:05:36 a kernel: drbd1: receiver (re)started > Oct 11 13:05:36 a kernel: drbd1: conn( Unconnected -> WFConnection ) > Oct 11 13:05:45 a Filesystem[12372]: [12422]: INFO: Running start for /dev/drbd1 on /mnt/b.vnfs > Oct 11 13:05:45 a kernel: EXT3 FS on drbd1, internal journal > > node b is back: > Oct 11 13:08:03 a kernel: drbd1: conn( WFConnection -> WFReportParams ) > Oct 11 13:08:03 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86 > Oct 11 13:08:03 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC > Oct 11 13:08:03 a kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now. > Oct 11 13:08:04 a kernel: drbd1: aftr_isp( 0 -> 1 ) > Oct 11 13:08:04 a kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent ) > Oct 11 13:08:04 a kernel: drbd1: Began resync as PausedSyncS (will sync 1664416 KB [416104 bits set]). > Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now. > Oct 11 13:08:04 a kernel: drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 ) > Oct 11 13:09:08 a kernel: drbd1: aftr_isp( 1 -> 0 ) > Oct 11 13:09:08 a kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 ) > Oct 11 13:09:08 a kernel: drbd1: Syncer continues. > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=772139 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0) > Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0) now, that is interessting... probably indeed a "logic bug", in that we don't get the accounting right. I believe thought that drbd does the "right thing" still, and only gets the counter wrong for some reason. we'll have a look. > # cat /etc/drbd.conf > global { > usage-count yes; > } > > common { > syncer { rate 500M; } > } > > resource r0 { > protocol C; > > disk { > on-io-error detach; > fencing resource-and-stonith; > } > > handlers { > outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; > pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; > } > > syncer { > rate 100M; > al-extents 521; > } > > net { > cram-hmac-alg "sha1"; > shared-secret "FrooFrooFunFactory"; > #http://lists.linbit.com/pipermail/drbd-user/2007-March/006651.html > sndbuf-size 512k; > max-buffers 8192; > unplug-watermark 8192; > max-epoch-size 8192; > } > > on a.store. { > device /dev/drbd0; > disk /dev/sda; > address 192.168.1.1:7788; > meta-disk internal; > } > > on b.store. { > device /dev/drbd0; > disk /dev/sda; > address 192.168.1.2:7788; > meta-disk internal; > } > } > > resource r1 { > protocol C; > > disk { > on-io-error detach; > fencing resource-and-stonith; > } > > handlers { > outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; > pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; > } > > net { > cram-hmac-alg "sha1"; > shared-secret "FrooFrooFunFactory"; > sndbuf-size 512k; > max-buffers 8192; > unplug-watermark 8192; > max-epoch-size 8192; > after-sb-0pri discard-younger-primary; > after-sb-1pri consensus; > after-sb-2pri disconnect; > > } > > syncer { > rate 100M; > al-extents 521; > after "r0"; > } > > on a.store. { > device /dev/drbd1; > disk /dev/sdb; > address 192.168.1.5:7788; > meta-disk internal; > } > > on b.store. { > device /dev/drbd1; > disk /dev/sdb; > address 192.168.1.6:7788; > meta-disk internal; > } > } > > > -- > Kelsey Cummings - kgc at corp.sonic.net sonic.net, inc. > System Architect 2260 Apollo Way > 707.522.1000 Santa Rosa, CA 95407 > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- : commercial DRBD/HA support and consulting: sales at linbit.com : : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.