Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, i have installed drbd in Primary/Primary mode on both nodes and use LVM2 (clvm) and then GFS2 over the clustered VG and an LV. If i reboot the nodes then the Prim/Prim mode isn't activated: version: 8.0.11 (api:86/proto:86) GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:195 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 drbd.conf: global { usage-count no; } common { syncer { rate 10M; } } resource gfs-drbd0 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root"; split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; } startup { degr-wfc-timeout 120; # 2 minutes. become-primary-on both; } disk { on-io-error detach; } net { ko-count 6; allow-two-primaries; cram-hmac-alg "md5"; shared-secret "xyt"; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 30M; al-extents 257; } on parastore01 { device /dev/drbd0; disk /dev/sda4; address 192.168.2.15:7788; flexible-meta-disk internal; } on parastore02 { device /dev/drbd0; disk /dev/sda4; address 192.168.2.14:7788; flexible-meta-disk internal; } } // drbd1 with same setup as drbd0, but with sdb4 and other // tcp ports for communication } syslog tells me that there are network problems, but i don't see any network troubles wether on the switch nor and on the nics: Jul 30 23:06:45 parastore01 kernel: [ 347.645268] drbd0: disk( Diskless -> Attaching ) Jul 30 23:06:45 parastore01 kernel: [ 347.645278] drbd0: Starting worker thread (from cqueue/0 [3941]) Jul 30 23:06:45 parastore01 kernel: [ 347.677976] drbd0: Found 6 transactions (295 active extents) in activity log. Jul 30 23:06:45 parastore01 kernel: [ 347.677985] drbd0: max_segment_size ( = BIO size ) = 32768 Jul 30 23:06:45 parastore01 kernel: [ 347.677991] drbd0: drbd_bm_resize called with capacity == 95551624 Jul 30 23:06:45 parastore01 kernel: [ 347.680187] drbd0: resync bitmap: bits=11943953 words=373250 Jul 30 23:06:45 parastore01 kernel: [ 347.680197] drbd0: size = 45 GB (47775812 KB) Jul 30 23:06:45 parastore01 kernel: [ 347.735129] drbd0: reading of bitmap took 5 jiffies Jul 30 23:06:45 parastore01 kernel: [ 347.737749] drbd0: recounting of set bits took additional 1 jiffies Jul 30 23:06:45 parastore01 kernel: [ 347.737754] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jul 30 23:06:45 parastore01 kernel: [ 347.737885] drbd0: Marked additional 984 MB as out-of-sync based on AL. Jul 30 23:06:48 parastore01 kernel: [ 350.775353] drbd0: disk( Attaching -> UpToDate ) Jul 30 23:06:48 parastore01 kernel: [ 350.775421] drbd0: Writing meta data super block now. ... Jul 30 23:06:48 parastore01 kernel: [ 351.048102] drbd0: conn( StandAlone -> Unconnected ) Jul 30 23:06:48 parastore01 kernel: [ 351.048266] drbd0: Starting receiver thread (from drbd0_worker [5116]) Jul 30 23:06:48 parastore01 kernel: [ 351.049392] drbd0: receiver (re)started Jul 30 23:06:48 parastore01 kernel: [ 351.049404] drbd0: conn( Unconnected -> WFConnection ) ... Jul 30 23:08:51 parastore01 kernel: [ 474.018937] drbd0: Handshake successful: DRBD Network Protocol version 86 Jul 30 23:08:51 parastore01 kernel: [ 474.019664] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Jul 30 23:08:51 parastore01 kernel: [ 474.019676] drbd0: conn( WFConnection -> WFReportParams ) Jul 30 23:08:51 parastore01 kernel: [ 474.019683] drbd0: Starting asender thread (from drbd0_receiver [5135]) ... Jul 30 23:08:51 parastore01 kernel: [ 474.091144] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jul 30 23:08:51 parastore01 kernel: [ 474.091159] drbd0: Writing meta data super block now. ... Jul 30 23:09:10 parastore01 kernel: [ 492.620587] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 5 Jul 30 23:09:16 parastore01 kernel: [ 498.617884] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 4 Jul 30 23:09:22 parastore01 kernel: [ 504.615194] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 3 Jul 30 23:09:28 parastore01 kernel: [ 510.612504] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 2 Jul 30 23:09:34 parastore01 kernel: [ 516.609814] drbd0: [drbd0_receiver/5135] sock_sendmsg time expired, ko = 1 Jul 30 23:09:40 parastore01 kernel: [ 522.607129] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown ) Jul 30 23:09:40 parastore01 kernel: [ 522.607144] drbd0: short sent ReportBitMap size=4096 sent=3216 Jul 30 23:09:40 parastore01 kernel: [ 522.607241] drbd0: error receiving ReportBitMap, l: 0! Jul 30 23:09:40 parastore01 kernel: [ 522.607741] drbd0: role( Secondary -> Primary ) Jul 30 23:09:40 parastore01 kernel: [ 522.607754] drbd0: Creating new current UUID Jul 30 23:09:40 parastore01 kernel: [ 522.607772] drbd0: Writing meta data super block now. Jul 30 23:09:40 parastore01 kernel: [ 522.607877] drbd0: asender terminated Jul 30 23:09:40 parastore01 kernel: [ 522.607882] drbd0: Terminating asender thread Jul 30 23:09:40 parastore01 kernel: [ 522.608699] drbd0: tl_clear() Jul 30 23:09:40 parastore01 kernel: [ 522.608704] drbd0: Connection closed Jul 30 23:09:40 parastore01 kernel: [ 522.608711] drbd0: conn( Timeout -> Unconnected ) Jul 30 23:09:40 parastore01 kernel: [ 522.608715] drbd0: receiver terminated Jul 30 23:09:40 parastore01 kernel: [ 522.608717] drbd0: receiver (re)started Jul 30 23:09:40 parastore01 kernel: [ 522.608721] drbd0: conn( Unconnected -> WFConnection ) Jul 30 23:09:40 parastore01 kernel: [ 522.907015] drbd0: Handshake successful: DRBD Network Protocol version 86 Jul 30 23:09:40 parastore01 kernel: [ 522.907790] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Jul 30 23:09:40 parastore01 kernel: [ 522.907803] drbd0: conn( WFConnection -> WFReportParams ) Jul 30 23:09:40 parastore01 kernel: [ 522.907808] drbd0: Starting asender thread (from drbd0_receiver [5135]) Jul 30 23:09:40 parastore01 kernel: [ 522.948382] drbd0: meta connection shut down by peer. Jul 30 23:09:40 parastore01 kernel: [ 522.948454] drbd0: conn( WFReportParams -> NetworkFailure ) Jul 30 23:09:40 parastore01 kernel: [ 522.948464] drbd0: asender terminated Jul 30 23:09:40 parastore01 kernel: [ 522.948467] drbd0: Terminating asender thread Jul 30 23:09:40 parastore01 kernel: [ 522.949311] drbd0: tl_clear() Jul 30 23:09:40 parastore01 kernel: [ 522.949317] drbd0: Connection closed Jul 30 23:09:40 parastore01 kernel: [ 522.949324] drbd0: conn( NetworkFailure -> Unconnected ) Jul 30 23:09:40 parastore01 kernel: [ 522.949328] drbd0: receiver terminated Jul 30 23:09:40 parastore01 kernel: [ 522.949331] drbd0: receiver (re)started Jul 30 23:09:40 parastore01 kernel: [ 522.949334] drbd0: conn( Unconnected -> WFConnection ) when i reboot the node i booted second at the first boot then the device drbd0 wents to Prim/Prim and starts syncing. could i tweak the timeout values to avoid this behaviour? another strange thing: i have another device (drbd1) which is Sec/Sec after the first boot, and Sec/Prim after i reboot the second node: version: 8.0.11 (api:86/proto:86) GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r--- ns:1007616 nr:0 dw:0 dr:1007616 al:0 bm:390 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:62781 misses:195 starving:0 dirty:0 changed:195 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 how can i avoid this problems? thanks, Joysn