[DRBD-user] RHEL cluster, DRBD, and the same problem after fencing

Theophanis Kontogiannis theophanis_kontogiannis at yahoo.com
Sun Oct 11 11:23:15 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello all

I have RHEL 5.3 (CentOS), drbd 8.3.1, 2.6.18-164.el5xen kernel.

I have set up a two nodes cluster running among other services, BOINC on
both nodes.

For some reason BOINC always makes both nodes to crash / being fenced
after a few hours of operation.

However the issue is that ALWAYS after reboot, drbd (which runs the
cluster storage) does not recover.

This is the log from node 1:
...............
Oct 11 11:59:42 localhost kernel: drbd2: conn( WFBitMapT ->
WFSyncUUID ) 
Oct 11 11:59:42 localhost clurgmgrd: [4307]: <info>
Executing /etc/init.d/drbd status 
Oct 11 11:59:42 localhost kernel: drbd2: helper command: /sbin/drbdadm
before-resync-target minor-2
Oct 11 11:59:42 localhost kernel: drbd2: helper command: /sbin/drbdadm
before-resync-target minor-2 exit code 0 (0x0)
Oct 11 11:59:42 localhost kernel: drbd2: conn( WFSyncUUID ->
SyncTarget ) disk( Outdated -> Inconsistent ) 
Oct 11 11:59:42 localhost kernel: drbd2: Began resync as SyncTarget
(will sync 2031616 KB [507904 bits set]).
Oct 11 11:59:43 localhost kernel: drbd1: peer( Secondary -> Primary ) 
Oct 11 11:59:43 localhost kernel: drbd2: peer( Secondary -> Primary ) 
Oct 11 11:59:43 localhost kernel: drbd0: role( Secondary -> Primary ) 
Oct 11 11:59:43 localhost kernel: drbd1: role( Secondary -> Primary ) 
Oct 11 11:59:44 localhost kernel: drbd2: role( Secondary -> Primary ) 
Oct 11 11:59:44 localhost kernel: drbd2: Resync done (total 45 sec;
paused 0 sec; 45144 K/sec)
Oct 11 11:59:44 localhost kernel: drbd2: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate ) 
Oct 11 11:59:45 localhost kernel: drbd2: helper command: /sbin/drbdadm
after-resync-target minor-2
Oct 11 11:59:46 localhost kernel: drbd2: helper command: /sbin/drbdadm
after-resync-target minor-2 exit code 0 (0x0)
Oct 11 12:00:08 localhost kernel: drbd0: Resync done (total 84 sec;
paused 0 sec; 12824 K/sec)
Oct 11 12:00:08 localhost kernel: drbd0: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate ) 
Oct 11 12:00:08 localhost kernel: drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0
Oct 11 12:00:08 localhost kernel: drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0 exit code 0 (0x0)
Oct 11 12:00:13 localhost clurgmgrd: [4307]: <info>
Executing /etc/init.d/drbd status 
Oct 11 12:00:43 localhost clurgmgrd: [4307]: <info>
Executing /etc/init.d/drbd status 
Oct 11 12:01:13 localhost clurgmgrd: [4307]: <info>
Executing /etc/init.d/drbd status 
Oct 11 12:01:33 localhost kernel: drbd0: peer( Primary -> Secondary ) 
Oct 11 12:01:33 localhost kernel: drbd0: State change failed: Refusing
to be Primary while peer is not outdated
Oct 11 12:01:33 localhost kernel: drbd0:   state = { cs:Connected
ro:Primary/Secondary ds:UpToDate/UpToDate r--- }
Oct 11 12:01:33 localhost kernel: drbd0:  wanted = { cs:TearDown
ro:Primary/Unknown ds:UpToDate/DUnknown s--- }
Oct 11 12:01:33 localhost kernel: drbd0: peer( Secondary -> Unknown )
conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) 
Oct 11 12:01:33 localhost kernel: drbd0: Creating new current UUID
Oct 11 12:01:33 localhost kernel: drbd0: meta connection shut down by
peer.
Oct 11 12:01:33 localhost kernel: drbd0: asender terminated
Oct 11 12:01:33 localhost kernel: drbd0: Terminating asender thread
Oct 11 12:01:34 localhost kernel: drbd0: Connection closed
Oct 11 12:01:34 localhost kernel: drbd0: conn( TearDown ->
Unconnected ) 
Oct 11 12:01:34 localhost kernel: drbd0: receiver terminated
Oct 11 12:01:34 localhost kernel: drbd0: Restarting receiver thread
Oct 11 12:01:34 localhost kernel: drbd0: receiver (re)started
Oct 11 12:01:34 localhost kernel: drbd0: conn( Unconnected ->
WFConnection ) 
Oct 11 12:01:34 localhost kernel: drbd1: peer( Primary -> Secondary ) 
Oct 11 12:01:34 localhost kernel: drbd1: State change failed: Refusing
to be Primary without at least one UpToDate disk
Oct 11 12:01:34 localhost kernel: drbd1:   state = { cs:SyncTarget
ro:Primary/Secondary ds:Inconsistent/UpToDate r--- }
Oct 11 12:01:34 localhost kernel: drbd1:  wanted = { cs:TearDown
ro:Primary/Unknown ds:Inconsistent/DUnknown s--- }
Oct 11 12:01:34 localhost kernel: drbd1: State change failed: Refusing
to be Primary without at least one UpToDate disk
Oct 11 12:01:34 localhost kernel: drbd1:   state = { cs:SyncTarget
ro:Primary/Secondary ds:Inconsistent/UpToDate r--- }
Oct 11 12:01:34 localhost kernel: drbd1:  wanted = { cs:TearDown
ro:Primary/Unknown ds:Inconsistent/Outdated r--- }
Oct 11 12:01:34 localhost kernel: drbd2: peer( Primary -> Secondary ) 
Oct 11 12:01:34 localhost kernel: drbd2: State change failed: Refusing
to be Primary while peer is not outdated
Oct 11 12:01:34 localhost kernel: drbd2:   state = { cs:Connected
ro:Primary/Secondary ds:UpToDate/UpToDate r--- }
Oct 11 12:01:34 localhost kernel: drbd2:  wanted = { cs:TearDown
ro:Primary/Unknown ds:UpToDate/DUnknown s--- }
Oct 11 12:01:34 localhost kernel: drbd2: peer( Secondary -> Unknown )
conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) 
Oct 11 12:01:34 localhost kernel: drbd2: Creating new current UUID
Oct 11 12:01:34 localhost kernel: drbd2: meta connection shut down by
peer.
Oct 11 12:01:34 localhost kernel: drbd2: asender terminated
Oct 11 12:01:34 localhost kernel: drbd2: Terminating asender thread
Oct 11 12:01:34 localhost kernel: drbd2: Connection closed
Oct 11 12:01:34 localhost kernel: drbd2: conn( TearDown ->
Unconnected ) 
Oct 11 12:01:34 localhost kernel: drbd2: receiver terminated
Oct 11 12:01:34 localhost kernel: drbd2: Restarting receiver thread
Oct 11 12:01:34 localhost kernel: drbd2: receiver (re)started
Oct 11 12:01:34 localhost kernel: drbd2: conn( Unconnected ->
WFConnection ) 
Oct 11 12:01:38 localhost kernel: drbd1: State change failed: Refusing
to be Primary without at least one UpToDate disk
Oct 11 12:01:38 localhost kernel: drbd1:   state = { cs:SyncTarget
ro:Primary/Secondary ds:Inconsistent/UpToDate r--- }
Oct 11 12:01:38 localhost kernel: drbd1:  wanted = { cs:TearDown
ro:Primary/Unknown ds:Inconsistent/DUnknown s--- }
Oct 11 12:01:38 localhost kernel: drbd1: State change failed: Refusing
to be Primary without at least one UpToDate disk
Oct 11 12:01:38 localhost kernel: drbd1:   state = { cs:SyncTarget
ro:Primary/Secondary ds:Inconsistent/UpToDate r--- }
Oct 11 12:01:38 localhost kernel: drbd1:  wanted = { cs:TearDown
ro:Primary/Unknown ds:Inconsistent/Outdated r--- }


And this is the log from node 2:
.........
drbd1: conn( WFBitMapS -> SyncSource ) 
drbd1: Began resync as SyncSource (will sync 300165324 KB [75041331 bits
set]).
drbd2: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) 
drbd2: Began resync as SyncSource (will sync 2031616 KB [507904 bits
set]).
drbd1: role( Secondary -> Primary ) 
drbd2: role( Secondary -> Primary ) 
drbd0: peer( Secondary -> Primary ) 
drbd1: peer( Secondary -> Primary ) 
drbd2: peer( Secondary -> Primary ) 
virbr0: no IPv6 routers present
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
drbd2: Resync done (total 45 sec; paused 0 sec; 45144 K/sec)
drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
eth1: too many iterations (16) in nv_nic_irq.
drbd0: Resync done (total 84 sec; paused 0 sec; 12824 K/sec)
drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
dlm: Using TCP for communications
dlm: got connection from 2
drbd0: role( Primary -> Secondary ) 
drbd0: Requested state change failed by peer: Refusing to be Primary
while peer is not outdated (-7)
drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting )
disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown ) 
drbd0: short read expecting header on sock: r=-512
drbd0: asender terminated
drbd0: Terminating asender thread
drbd0: Connection closed
drbd0: conn( Disconnecting -> StandAlone ) 
drbd0: receiver terminated
drbd0: Terminating receiver thread
drbd0: disk( Outdated -> Diskless ) 
drbd0: drbd_bm_resize called with capacity == 0
drbd0: worker terminated
drbd0: Terminating worker thread
drbd1: role( Primary -> Secondary ) 
drbd1: Requested state change failed by peer: Refusing to be Primary
without at least one UpToDate disk (-2)
drbd1: Requested state change failed by peer: Refusing to be Primary
without at least one UpToDate disk (-2)
drbd1: State change failed: Refusing to be inconsistent on both nodes
drbd1:   state = { cs:SyncSource ro:Secondary/Primary
ds:UpToDate/Inconsistent r--- }
drbd1:  wanted = { cs:SyncSource ro:Secondary/Primary
ds:Diskless/Inconsistent r--- }
drbd2: role( Primary -> Secondary ) 
drbd2: Requested state change failed by peer: Refusing to be Primary
while peer is not outdated (-7)
drbd2: peer( Primary -> Unknown ) conn( Connected -> Disconnecting )
disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown ) 
drbd2: short read expecting header on sock: r=-512
drbd2: asender terminated
drbd2: Terminating asender thread
drbd2: Connection closed
drbd2: conn( Disconnecting -> StandAlone ) 
drbd2: receiver terminated
drbd2: Terminating receiver thread
drbd2: disk( Outdated -> Diskless ) 
drbd2: drbd_bm_resize called with capacity == 0
drbd2: worker terminated
drbd2: Terminating worker thread
drbd1: Requested state change failed by peer: Refusing to be Primary
without at least one UpToDate disk (-2)
drbd1: Requested state change failed by peer: Refusing to be Primary
without at least one UpToDate disk (-2)
drbd1: State change failed: Refusing to be inconsistent on both nodes
drbd1:   state = { cs:SyncSource ro:Secondary/Primary
ds:UpToDate/Inconsistent r--- }
drbd1:  wanted = { cs:SyncSource ro:Secondary/Primary
ds:Diskless/Inconsistent r--- }

This is my config:

global {
    # minor-count 64;

    # dialog-refresh 5; # 5 seconds

    # disable-ip-verification;

    usage-count yes;
}


common {

  protocol C;

  syncer {

    rate 100M;

    al-extents 257;
  }

  
 handlers {
 
    pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

    outdate-peer "/sbin/obliterate";

    pri-lost "echo pri-lost. Have a look at the log files. | mail -s
'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

    split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

  }

  startup {

     wfc-timeout  100;

    degr-wfc-timeout 60;    # 1 minutes.

    become-primary-on both;

  }

  disk {

    fencing resource-and-stonith;

  }

  net {
    
     sndbuf-size 512k;

     timeout       60;    #  6 seconds  (unit = 0.1 seconds)
     connect-int   10;    # 10 seconds  (unit = 1 second)
     ping-int      10;    # 10 seconds  (unit = 1 second)
     ping-timeout  50;    # 500 ms (unit = 0.1 seconds)

     max-buffers     2048;

     max-epoch-size  2048;

     ko-count 10;

    allow-two-primaries;

      cram-hmac-alg "sha1";
      shared-secret "*****";

    after-sb-0pri discard-least-changes;

    after-sb-1pri violently-as0p;

    after-sb-2pri violently-as0p;

    rr-conflict call-pri-lost;



    data-integrity-alg "crc32c";

  }


}


resource r0 {

        device          /dev/drbd0;
        disk            /dev/hda4;
        meta-disk       internal;

 on tweety-1 { address   10.254.254.253:7788; }

 on tweety-2 { address   10.254.254.254:7788; }

}

resource r1 {

        device        /dev/drbd1;
        disk          /dev/hdb4;
        meta-disk     internal;

  on tweety-1 { address  10.254.254.253:7789; }

  on tweety-2 { address  10.254.254.254:7789; }
}

resource r2 {

	device		/dev/drbd2;
	disk		/dev/sda1;
	meta-disk	internal;

  on tweety-1 { address  10.254.254.253:7790; }

  on tweety-2 { address  10.254.254.254:7790; }
}


This is the current status of drdb (after the above logged crash)

version: 8.3.1 (api:88/proto:86-89)
GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by
root at tweety-1, 2009-09-21 17:23:59
 0: cs:Unconfigured
 1: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r---
    ns:64823832 nr:0 dw:0 dr:64831176 al:0 bm:4089 lo:35 pe:83 ua:256
ap:0 ep:1 wo:b oos:235344204
	[===>................] sync'ed: 21.6% (229828/293128)M
	finish: 14:51:27 speed: 4,064 (45,264) K/sec
 2: cs:Unconfigured

Can someone give some food for thoughts on I have done wrong?

Thank you All for your time,

Theophanis Kontogiannis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091011/8cae21ff/attachment.htm>


More information about the drbd-user mailing list