Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Oct 11, 2007 at 01:55:22PM -0700, Kelsey Cummings wrote:
> I'm continuing testing an active/active nfs cluster using a pair of drbd
> replicated filesystems. Currently, to test that it is working, the nodes
> are set to reboot on even/odd hours to ensure that failover works as
> expected without any trouble. Which, for the most part, it does so long as
> they are allowed to reach uptodate state between the cycles.
>
> However, I've just run into this and I'm not sure what it means or how to
> resolve it. Is this is a software issue or the result of a configuration
> error?
>
> Meanwhile, the r1 filesystem is currently mounted on node a and it appears
> to be intact; nfs clients have been happily writting away without any errors.
>
> 2.6.18-8.1.14.el5 x86_64, centos
>
> on node a:
> # cat /proc/drbd
> version: 8.0.6 (api:86/proto:86)
> SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> ns:34415876 nr:25906480 dw:60322492 dr:2187965 al:8966 bm:888 lo:0 pe:0 ua:0 ap:0
> resync: used:0/31 hits:271442 misses:308 starving:0 dirty:0 changed:308
> act_log: used:0/521 hits:8595037 misses:8978 starving:0 dirty:12 changed:8966
> 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
> ns:36930244 nr:24805160 dw:63101540 dr:1859585 al:10341 bm:739 lo:0 pe:0 ua:0 ap:0
> [==>.................] sync'ed: 14.5% (1427132/1664416)K
> stalled
> resync: used:0/31 hits:309394 misses:341 starving:0 dirty:0 changed:341
> act_log: used:0/521 hits:9563754 misses:10354 starving:0 dirty:13 changed:10341
>
> on node b:
> # cat /proc/drbd
> version: 8.0.6 (api:86/proto:86)
> SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
> ns:0 nr:39210316 dw:39210252 dr:0 al:0 bm:153 lo:17 pe:0 ua:16 ap:0
> resync: used:0/31 hits:122901 misses:153 starving:0 dirty:0 changed:153
> act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
> 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
> ns:0 nr:42409152 dw:42409152 dr:0 al:0 bm:115 lo:0 pe:0 ua:0 ap:0
> resync: used:0/31 hits:104757 misses:115 starving:0 dirty:0 changed:115
> act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
>
> node a reboots:
> Oct 11 12:38:21 a kernel: drbd1: disk( Diskless -> Attaching )
> Oct 11 12:38:21 a kernel: drbd1: Found 10 transactions (592 active extents) in activity log.
> Oct 11 12:38:21 a kernel: drbd1: max_segment_size ( = BIO size ) = 32768
> Oct 11 12:38:21 a kernel: drbd1: drbd_bm_resize called with capacity == 4294836144
> Oct 11 12:38:21 a kernel: drbd1: resync bitmap: bits=536854518 words=8388352
> Oct 11 12:38:21 a kernel: drbd1: size = 2047 GB (2147418072 KB)
> Oct 11 12:38:22 a kernel: drbd1: reading of bitmap took 295 jiffies
> Oct 11 12:38:22 a kernel: drbd1: recounting of set bits took additional 116 jiffies
> Oct 11 12:38:22 a kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map.
> Oct 11 12:38:22 a kernel: drbd1: disk( Attaching -> Outdated )
> Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
> Oct 11 12:38:22 a kernel: drbd1: conn( StandAlone -> Unconnected )
> Oct 11 12:38:22 a kernel: drbd1: receiver (re)started
> Oct 11 12:38:22 a kernel: drbd1: conn( Unconnected -> WFConnection )
> Oct 11 12:38:22 a kernel: drbd1: conn( WFConnection -> WFReportParams )
> Oct 11 12:38:22 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
> Oct 11 12:38:22 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
> Oct 11 12:38:22 a kernel: drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
> Oct 11 12:38:23 a kernel: drbd1: conn( WFBitMapT -> WFSyncUUID )
> Oct 11 12:38:23 a kernel: drbd1: aftr_isp( 0 -> 1 )
> Oct 11 12:38:23 a kernel: drbd1: peer_isp( 0 -> 1 )
> Oct 11 12:38:23 a kernel: drbd1: conn( WFSyncUUID -> PausedSyncT ) disk( Outdated -> Inconsistent )
> Oct 11 12:38:23 a kernel: drbd1: Began resync as PausedSyncT (will sync 3267100 KB [816775 bits set]).
> Oct 11 12:38:23 a kernel: drbd1: Writing meta data super block now.
> Oct 11 12:39:27 a kernel: drbd1: aftr_isp( 1 -> 0 )
> Oct 11 12:39:27 a kernel: drbd1: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 )
> Oct 11 12:39:27 a kernel: drbd1: Syncer continues.
> Oct 11 12:40:45 a kernel: drbd1: Resync done (total 141 sec; paused 63 sec; 41884 K/sec)
> Oct 11 12:40:45 a kernel: drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
> Oct 11 12:40:45 a kernel: drbd1: Writing meta data super block now.
>
> note b goes down for reboot:
> Oct 11 13:05:24 a kernel: drbd1: peer( Primary -> Secondary )
> Oct 11 13:05:24 a kernel: drbd1: role( Secondary -> Primary )
> Oct 11 13:05:24 a kernel: drbd1: Writing meta data super block now.
> Oct 11 13:05:36 a kernel: drbd1: State change failed: Refusing to be Primary while peer is not outdated
> Oct 11 13:05:36 a kernel: drbd1: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- }
> Oct 11 13:05:36 a kernel: drbd1: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown s--- }
> Oct 11 13:05:36 a kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated )
> Oct 11 13:05:36 a kernel: drbd1: Creating new current UUID
> Oct 11 13:05:36 a kernel: drbd1: Writing meta data super block now.
> Oct 11 13:05:36 a kernel: drbd1: asender terminated
> Oct 11 13:05:36 a kernel: drbd1: tl_clear()
> Oct 11 13:05:36 a kernel: drbd1: Connection closed
> Oct 11 13:05:36 a kernel: drbd1: conn( TearDown -> Unconnected )
> Oct 11 13:05:36 a kernel: drbd1: receiver terminated
> Oct 11 13:05:36 a kernel: drbd1: receiver (re)started
> Oct 11 13:05:36 a kernel: drbd1: conn( Unconnected -> WFConnection )
> Oct 11 13:05:45 a Filesystem[12372]: [12422]: INFO: Running start for /dev/drbd1 on /mnt/b.vnfs
> Oct 11 13:05:45 a kernel: EXT3 FS on drbd1, internal journal
>
> node b is back:
> Oct 11 13:08:03 a kernel: drbd1: conn( WFConnection -> WFReportParams )
> Oct 11 13:08:03 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
> Oct 11 13:08:03 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
> Oct 11 13:08:03 a kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
> Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
> Oct 11 13:08:04 a kernel: drbd1: aftr_isp( 0 -> 1 )
> Oct 11 13:08:04 a kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent )
> Oct 11 13:08:04 a kernel: drbd1: Began resync as PausedSyncS (will sync 1664416 KB [416104 bits set]).
> Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
> Oct 11 13:08:04 a kernel: drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 )
> Oct 11 13:09:08 a kernel: drbd1: aftr_isp( 1 -> 0 )
> Oct 11 13:09:08 a kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 )
> Oct 11 13:09:08 a kernel: drbd1: Syncer continues.
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=772139 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0)
> Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0)
now, that is interessting...
probably indeed a "logic bug", in that we don't get the accounting
right. I believe thought that drbd does the "right thing" still,
and only gets the counter wrong for some reason. we'll have a look.
> # cat /etc/drbd.conf
> global {
> usage-count yes;
> }
>
> common {
> syncer { rate 500M; }
> }
>
> resource r0 {
> protocol C;
>
> disk {
> on-io-error detach;
> fencing resource-and-stonith;
> }
>
> handlers {
> outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> }
>
> syncer {
> rate 100M;
> al-extents 521;
> }
>
> net {
> cram-hmac-alg "sha1";
> shared-secret "FrooFrooFunFactory";
> #http://lists.linbit.com/pipermail/drbd-user/2007-March/006651.html
> sndbuf-size 512k;
> max-buffers 8192;
> unplug-watermark 8192;
> max-epoch-size 8192;
> }
>
> on a.store. {
> device /dev/drbd0;
> disk /dev/sda;
> address 192.168.1.1:7788;
> meta-disk internal;
> }
>
> on b.store. {
> device /dev/drbd0;
> disk /dev/sda;
> address 192.168.1.2:7788;
> meta-disk internal;
> }
> }
>
> resource r1 {
> protocol C;
>
> disk {
> on-io-error detach;
> fencing resource-and-stonith;
> }
>
> handlers {
> outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> }
>
> net {
> cram-hmac-alg "sha1";
> shared-secret "FrooFrooFunFactory";
> sndbuf-size 512k;
> max-buffers 8192;
> unplug-watermark 8192;
> max-epoch-size 8192;
> after-sb-0pri discard-younger-primary;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
>
> }
>
> syncer {
> rate 100M;
> al-extents 521;
> after "r0";
> }
>
> on a.store. {
> device /dev/drbd1;
> disk /dev/sdb;
> address 192.168.1.5:7788;
> meta-disk internal;
> }
>
> on b.store. {
> device /dev/drbd1;
> disk /dev/sdb;
> address 192.168.1.6:7788;
> meta-disk internal;
> }
> }
>
>
> --
> Kelsey Cummings - kgc at corp.sonic.net sonic.net, inc.
> System Architect 2260 Apollo Way
> 707.522.1000 Santa Rosa, CA 95407
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
--
: commercial DRBD/HA support and consulting: sales at linbit.com :
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
__
please use the "List-Reply" function of your email client.