[DRBD-user] Primary/Primary not activated after reboot of both nodes

Sat Aug 2 14:21:12 CEST 2008

Hello,

i have installed drbd in Primary/Primary mode on both nodes and use LVM2
(clvm) and then GFS2 over the clustered VG and an LV.

If i reboot the nodes then the Prim/Prim mode isn't activated:

version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43
 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:195 lo:0 pe:0 ua:0 ap:0
	resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
	act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
 1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
	resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
	act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0

drbd.conf:

global {
    usage-count no;
}
common {
  syncer { rate 10M; }
}
resource gfs-drbd0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root";
    split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
  }
  startup {
    degr-wfc-timeout 120;    # 2 minutes.
    become-primary-on both;
  }
  disk {
    on-io-error   detach;
  }
  net {
    ko-count 6;
    allow-two-primaries;
    cram-hmac-alg "md5";
    shared-secret "xyt";
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 30M;
    al-extents 257;
  }
  on parastore01 {
    device     /dev/drbd0;
    disk       /dev/sda4;
    address    192.168.2.15:7788;
    flexible-meta-disk  internal;
  }
  on parastore02 {
    device    /dev/drbd0;
    disk      /dev/sda4;
    address   192.168.2.14:7788;
    flexible-meta-disk internal;
  }
}

// drbd1 with same setup as drbd0, but with sdb4 and other 
// tcp ports for communication
}

syslog tells me that there are network problems, but i don't see any network troubles wether
on the switch nor and on the nics:

Jul 30 23:06:45 parastore01 kernel: [  347.645268] drbd0: disk( Diskless -> Attaching )
Jul 30 23:06:45 parastore01 kernel: [  347.645278] drbd0: Starting worker thread (from cqueue/0 [3941])
Jul 30 23:06:45 parastore01 kernel: [  347.677976] drbd0: Found 6 transactions (295 active extents) in activity log.
Jul 30 23:06:45 parastore01 kernel: [  347.677985] drbd0: max_segment_size ( = BIO size ) = 32768
Jul 30 23:06:45 parastore01 kernel: [  347.677991] drbd0: drbd_bm_resize called with capacity == 95551624
Jul 30 23:06:45 parastore01 kernel: [  347.680187] drbd0: resync bitmap: bits=11943953 words=373250
Jul 30 23:06:45 parastore01 kernel: [  347.680197] drbd0: size = 45 GB (47775812 KB)
Jul 30 23:06:45 parastore01 kernel: [  347.735129] drbd0: reading of bitmap took 5 jiffies
Jul 30 23:06:45 parastore01 kernel: [  347.737749] drbd0: recounting of set bits took additional 1 jiffies
Jul 30 23:06:45 parastore01 kernel: [  347.737754] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 30 23:06:45 parastore01 kernel: [  347.737885] drbd0: Marked additional 984 MB as out-of-sync based on AL.
Jul 30 23:06:48 parastore01 kernel: [  350.775353] drbd0: disk( Attaching -> UpToDate )
Jul 30 23:06:48 parastore01 kernel: [  350.775421] drbd0: Writing meta data super block now.
...
Jul 30 23:06:48 parastore01 kernel: [  351.048102] drbd0: conn( StandAlone -> Unconnected )
Jul 30 23:06:48 parastore01 kernel: [  351.048266] drbd0: Starting receiver thread (from drbd0_worker [5116])
Jul 30 23:06:48 parastore01 kernel: [  351.049392] drbd0: receiver (re)started
Jul 30 23:06:48 parastore01 kernel: [  351.049404] drbd0: conn( Unconnected -> WFConnection )
...
Jul 30 23:08:51 parastore01 kernel: [  474.018937] drbd0: Handshake successful: DRBD Network Protocol version 86
Jul 30 23:08:51 parastore01 kernel: [  474.019664] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Jul 30 23:08:51 parastore01 kernel: [  474.019676] drbd0: conn( WFConnection -> WFReportParams )
Jul 30 23:08:51 parastore01 kernel: [  474.019683] drbd0: Starting asender thread (from drbd0_receiver [5135])
...
Jul 30 23:08:51 parastore01 kernel: [  474.091144] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Jul 30 23:08:51 parastore01 kernel: [  474.091159] drbd0: Writing meta data super block now.
...
Jul 30 23:09:10 parastore01 kernel: [  492.620587] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 5
Jul 30 23:09:16 parastore01 kernel: [  498.617884] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 4
Jul 30 23:09:22 parastore01 kernel: [  504.615194] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 3
Jul 30 23:09:28 parastore01 kernel: [  510.612504] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 2
Jul 30 23:09:34 parastore01 kernel: [  516.609814] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 1
Jul 30 23:09:40 parastore01 kernel: [  522.607129] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown )
Jul 30 23:09:40 parastore01 kernel: [  522.607144] drbd0: short sent ReportBitMap size=4096 sent=3216
Jul 30 23:09:40 parastore01 kernel: [  522.607241] drbd0: error receiving ReportBitMap, l: 0!
Jul 30 23:09:40 parastore01 kernel: [  522.607741] drbd0: role( Secondary -> Primary )
Jul 30 23:09:40 parastore01 kernel: [  522.607754] drbd0: Creating new current UUID
Jul 30 23:09:40 parastore01 kernel: [  522.607772] drbd0: Writing meta data super block now.
Jul 30 23:09:40 parastore01 kernel: [  522.607877] drbd0: asender terminated
Jul 30 23:09:40 parastore01 kernel: [  522.607882] drbd0: Terminating asender thread
Jul 30 23:09:40 parastore01 kernel: [  522.608699] drbd0: tl_clear()
Jul 30 23:09:40 parastore01 kernel: [  522.608704] drbd0: Connection closed
Jul 30 23:09:40 parastore01 kernel: [  522.608711] drbd0: conn( Timeout -> Unconnected )
Jul 30 23:09:40 parastore01 kernel: [  522.608715] drbd0: receiver terminated
Jul 30 23:09:40 parastore01 kernel: [  522.608717] drbd0: receiver (re)started
Jul 30 23:09:40 parastore01 kernel: [  522.608721] drbd0: conn( Unconnected -> WFConnection )
Jul 30 23:09:40 parastore01 kernel: [  522.907015] drbd0: Handshake successful: DRBD Network Protocol version 86
Jul 30 23:09:40 parastore01 kernel: [  522.907790] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Jul 30 23:09:40 parastore01 kernel: [  522.907803] drbd0: conn( WFConnection -> WFReportParams )
Jul 30 23:09:40 parastore01 kernel: [  522.907808] drbd0: Starting asender thread (from drbd0_receiver [5135])
Jul 30 23:09:40 parastore01 kernel: [  522.948382] drbd0: meta connection shut down by peer.
Jul 30 23:09:40 parastore01 kernel: [  522.948454] drbd0: conn( WFReportParams -> NetworkFailure )
Jul 30 23:09:40 parastore01 kernel: [  522.948464] drbd0: asender terminated
Jul 30 23:09:40 parastore01 kernel: [  522.948467] drbd0: Terminating asender thread
Jul 30 23:09:40 parastore01 kernel: [  522.949311] drbd0: tl_clear()
Jul 30 23:09:40 parastore01 kernel: [  522.949317] drbd0: Connection closed
Jul 30 23:09:40 parastore01 kernel: [  522.949324] drbd0: conn( NetworkFailure -> Unconnected )
Jul 30 23:09:40 parastore01 kernel: [  522.949328] drbd0: receiver terminated
Jul 30 23:09:40 parastore01 kernel: [  522.949331] drbd0: receiver (re)started
Jul 30 23:09:40 parastore01 kernel: [  522.949334] drbd0: conn( Unconnected -> WFConnection )

when i reboot the node i booted second at the first boot then the 
device drbd0 wents to Prim/Prim and starts syncing. 
could i tweak the timeout values to avoid this behaviour?

another strange thing: i have another device (drbd1) which is Sec/Sec 
after the first boot, and Sec/Prim after i reboot the second node:

version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43
 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
    ns:1007616 nr:0 dw:0 dr:1007616 al:0 bm:390 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:62781 misses:195 starving:0 dirty:0 changed:195
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0

how can i avoid this problems?

thanks, Joysn