[DRBD-user] drbd logic bug?

Kelsey Cummings kgc at corp.sonic.net
Thu Oct 11 22:55:22 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I'm continuing testing an active/active nfs cluster using a pair of drbd
replicated filesystems.  Currently, to test that it is working, the nodes
are set to reboot on even/odd hours to ensure that failover works as
expected without any trouble.  Which, for the most part, it does so long as
they are allowed to reach uptodate state between the cycles.

However, I've just run into this and I'm not sure what it means or how to
resolve it.  Is this is a software issue or the result of a configuration
error?

Meanwhile, the r1 filesystem is currently mounted on node a and it appears
to be intact; nfs clients have been happily writting away without any errors.

2.6.18-8.1.14.el5 x86_64, centos

on node a:
# cat /proc/drbd 
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:34415876 nr:25906480 dw:60322492 dr:2187965 al:8966 bm:888 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:271442 misses:308 starving:0 dirty:0 changed:308
        act_log: used:0/521 hits:8595037 misses:8978 starving:0 dirty:12 changed:8966
 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:36930244 nr:24805160 dw:63101540 dr:1859585 al:10341 bm:739 lo:0 pe:0 ua:0 ap:0
        [==>.................] sync'ed: 14.5% (1427132/1664416)K
        stalled
        resync: used:0/31 hits:309394 misses:341 starving:0 dirty:0 changed:341
        act_log: used:0/521 hits:9563754 misses:10354 starving:0 dirty:13 changed:10341

on node b:
# cat /proc/drbd 
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:39210316 dw:39210252 dr:0 al:0 bm:153 lo:17 pe:0 ua:16 ap:0
        resync: used:0/31 hits:122901 misses:153 starving:0 dirty:0 changed:153
        act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:42409152 dw:42409152 dr:0 al:0 bm:115 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:104757 misses:115 starving:0 dirty:0 changed:115
        act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0



node a reboots:
Oct 11 12:38:21 a kernel: drbd1: disk( Diskless -> Attaching ) 
Oct 11 12:38:21 a kernel: drbd1: Found 10 transactions (592 active extents) in activity log.
Oct 11 12:38:21 a kernel: drbd1: max_segment_size ( = BIO size ) = 32768
Oct 11 12:38:21 a kernel: drbd1: drbd_bm_resize called with capacity == 4294836144
Oct 11 12:38:21 a kernel: drbd1: resync bitmap: bits=536854518 words=8388352
Oct 11 12:38:21 a kernel: drbd1: size = 2047 GB (2147418072 KB)
Oct 11 12:38:22 a kernel: drbd1: reading of bitmap took 295 jiffies
Oct 11 12:38:22 a kernel: drbd1: recounting of set bits took additional 116 jiffies
Oct 11 12:38:22 a kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map.
Oct 11 12:38:22 a kernel: drbd1: disk( Attaching -> Outdated ) 
Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:38:22 a kernel: drbd1: conn( StandAlone -> Unconnected ) 
Oct 11 12:38:22 a kernel: drbd1: receiver (re)started
Oct 11 12:38:22 a kernel: drbd1: conn( Unconnected -> WFConnection ) 
Oct 11 12:38:22 a kernel: drbd1: conn( WFConnection -> WFReportParams ) 
Oct 11 12:38:22 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
Oct 11 12:38:22 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
Oct 11 12:38:22 a kernel: drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Oct 11 12:38:22 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:38:23 a kernel: drbd1: conn( WFBitMapT -> WFSyncUUID ) 
Oct 11 12:38:23 a kernel: drbd1: aftr_isp( 0 -> 1 ) 
Oct 11 12:38:23 a kernel: drbd1: peer_isp( 0 -> 1 ) 
Oct 11 12:38:23 a kernel: drbd1: conn( WFSyncUUID -> PausedSyncT ) disk( Outdated -> Inconsistent ) 
Oct 11 12:38:23 a kernel: drbd1: Began resync as PausedSyncT (will sync 3267100 KB [816775 bits set]).
Oct 11 12:38:23 a kernel: drbd1: Writing meta data super block now.
Oct 11 12:39:27 a kernel: drbd1: aftr_isp( 1 -> 0 ) 
Oct 11 12:39:27 a kernel: drbd1: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 ) 
Oct 11 12:39:27 a kernel: drbd1: Syncer continues.
Oct 11 12:40:45 a kernel: drbd1: Resync done (total 141 sec; paused 63 sec; 41884 K/sec)
Oct 11 12:40:45 a kernel: drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
Oct 11 12:40:45 a kernel: drbd1: Writing meta data super block now.

note b goes down for reboot:
Oct 11 13:05:24 a kernel: drbd1: peer( Primary -> Secondary ) 
Oct 11 13:05:24 a kernel: drbd1: role( Secondary -> Primary ) 
Oct 11 13:05:24 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:05:36 a kernel: drbd1: State change failed: Refusing to be Primary while peer is not outdated
Oct 11 13:05:36 a kernel: drbd1:   state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- }
Oct 11 13:05:36 a kernel: drbd1:  wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Oct 11 13:05:36 a kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) 
Oct 11 13:05:36 a kernel: drbd1: Creating new current UUID
Oct 11 13:05:36 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:05:36 a kernel: drbd1: asender terminated
Oct 11 13:05:36 a kernel: drbd1: tl_clear()
Oct 11 13:05:36 a kernel: drbd1: Connection closed
Oct 11 13:05:36 a kernel: drbd1: conn( TearDown -> Unconnected ) 
Oct 11 13:05:36 a kernel: drbd1: receiver terminated
Oct 11 13:05:36 a kernel: drbd1: receiver (re)started
Oct 11 13:05:36 a kernel: drbd1: conn( Unconnected -> WFConnection ) 
Oct 11 13:05:45 a Filesystem[12372]: [12422]: INFO: Running start for /dev/drbd1 on /mnt/b.vnfs
Oct 11 13:05:45 a kernel: EXT3 FS on drbd1, internal journal

node b is back:
Oct 11 13:08:03 a kernel: drbd1: conn( WFConnection -> WFReportParams ) 
Oct 11 13:08:03 a kernel: drbd1: Handshake successful: DRBD Network Protocol version 86
Oct 11 13:08:03 a kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
Oct 11 13:08:03 a kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:08:04 a kernel: drbd1: aftr_isp( 0 -> 1 ) 
Oct 11 13:08:04 a kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent ) 
Oct 11 13:08:04 a kernel: drbd1: Began resync as PausedSyncS (will sync 1664416 KB [416104 bits set]).
Oct 11 13:08:04 a kernel: drbd1: Writing meta data super block now.
Oct 11 13:08:04 a kernel: drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 ) 
Oct 11 13:09:08 a kernel: drbd1: aftr_isp( 1 -> 0 ) 
Oct 11 13:09:08 a kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 ) 
Oct 11 13:09:08 a kernel: drbd1: Syncer continues.
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=772139 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0)
Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0)

# cat /etc/drbd.conf
global {
        usage-count yes;
}

common {
  syncer { rate 500M; }
}

resource r0 {
  protocol C;

  disk {
      on-io-error       detach;
      fencing           resource-and-stonith;
  }

  handlers {
      outdate-peer              "/usr/lib64/heartbeat/drbd-peer-outdater";
      pri-on-incon-degr         "echo o > /proc/sysrq-trigger ; halt -f";
      pri-lost-after-sb         "echo o > /proc/sysrq-trigger ; halt -f";
      local-io-error            "echo o > /proc/sysrq-trigger ; halt -f";
  }

  syncer {
     rate 100M;
     al-extents 521;
  }

  net {
    cram-hmac-alg "sha1";
    shared-secret "FrooFrooFunFactory";
    #http://lists.linbit.com/pipermail/drbd-user/2007-March/006651.html
    sndbuf-size             512k;
    max-buffers             8192;
    unplug-watermark        8192;
    max-epoch-size          8192;
  }

  on a.store. {
    device     /dev/drbd0;
    disk       /dev/sda;
    address    192.168.1.1:7788;
    meta-disk  internal;
  }

  on b.store. {
    device    /dev/drbd0;
    disk      /dev/sda;
    address   192.168.1.2:7788;
    meta-disk internal;
  }
}

resource r1 {
  protocol C;

  disk {
      on-io-error       detach;
      fencing           resource-and-stonith;
  }

  handlers {
      outdate-peer              "/usr/lib64/heartbeat/drbd-peer-outdater";
      pri-on-incon-degr         "echo o > /proc/sysrq-trigger ; halt -f";
      pri-lost-after-sb         "echo o > /proc/sysrq-trigger ; halt -f";
      local-io-error            "echo o > /proc/sysrq-trigger ; halt -f";
  }

  net {
    cram-hmac-alg "sha1";
    shared-secret "FrooFrooFunFactory";
    sndbuf-size             512k;
    max-buffers             8192;
    unplug-watermark        8192;
    max-epoch-size          8192;
    after-sb-0pri discard-younger-primary;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;

  }

  syncer {
     rate 100M;
     al-extents 521;
     after "r0";
  }

  on a.store. {
    device     /dev/drbd1;
    disk       /dev/sdb;
    address    192.168.1.5:7788;
    meta-disk  internal;
  }

  on b.store. {
    device    /dev/drbd1;
    disk      /dev/sdb;
    address   192.168.1.6:7788;
    meta-disk internal;
  }
}


-- 
Kelsey Cummings - kgc at corp.sonic.net      sonic.net, inc.
System Architect                          2260 Apollo Way
707.522.1000                              Santa Rosa, CA 95407



More information about the drbd-user mailing list