Hello everybody,<br><br>we have a problem in a cluster and we haven't been able to find the cause. The cluster is set up with heartbeat 3.0.2 and drbd 8.3.7 on Debian Squeeze and kernel 2.6.32. This works fine when it's up and running, the problem comes when restarting the secondary node. The situation goes like this:<br>
<br>1. Primary and secondary servers working correctly.<br>2. Reboot secondary server. /proc/drbd in the primary looks like this when the secondary goes down:<br>version: 8.3.7 (api:88/proto:86-91)<br>GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@srv1, 2012-11-19 14:18:13<br>
0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----<br> ns:104865056 nr:0 dw:4542118 dr:314589852 al:29 bm:19201 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:716<br><br>3. When the secondary comes back up the connection between the two nodes is closed, leaving the following states:<br>
Primary:<br>version: 8.3.7 (api:88/proto:86-91)<br>GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@srv1, 2012-11-19 14:18:13<br> 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----<br> ns:104865056 nr:0 dw:4542118 dr:314589852 al:29 bm:19201 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:716<br>
<br> Secondary:<br>version: 8.3.7 (api:88/proto:86-91)<br>GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@srv2, 2012-11-19 12:36:39<br> 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----<br>
ns:0 nr:0 dw:0 dr:0 al:0 bm:12 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:118784<br> <br>This can be seen in the primary's syslog: <br>Dec 18 15:08:49 srv1 kernel: [2413255.584781] block drbd0: Handshake successful: Agreed network protocol version 91<br>
Dec 18 15:08:49 srv1 kernel: [2413255.584791] block drbd0: conn( WFConnection -> WFReportParams )<br>Dec 18 15:08:49 srv1 kernel: [2413255.584896] block drbd0: Starting asender thread (from drbd0_receiver [12413])<br>Dec 18 15:08:49 srv1 kernel: [2413255.585397] block drbd0: data-integrity-alg: <not-used><br>
Dec 18 15:08:49 srv1 kernel: [2413255.585415] block drbd0: drbd_sync_handshake:<br>Dec 18 15:08:49 srv1 kernel: [2413255.585419] block drbd0: self 3D01B3FD7AA76821:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7 bits:179 flags:0<br>
Dec 18 15:08:49 srv1 kernel: [2413255.585422] block drbd0: peer 3D01B3FD7AA76820:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7 bits:29696 flags:2<br>Dec 18 15:08:49 srv1 kernel: [2413255.585426] block drbd0: uuid_compare()=-1 by rule 40<br>
Dec 18 15:08:49 srv1 kernel: [2413255.585428] block drbd0: I shall become SyncTarget, but I am primary!<br>Dec 18 15:08:49 srv1 kernel: [2413255.604751] block drbd0: conn( WFReportParams -> Disconnecting )<br>Dec 18 15:08:49 srv1 kernel: [2413255.604773] block drbd0: error receiving ReportState, l: 4!<br>
Dec 18 15:08:49 srv1 kernel: [2413255.624811] block drbd0: asender terminated<br>Dec 18 15:08:49 srv1 kernel: [2413255.624820] block drbd0: Terminating asender thread<br>Dec 18 15:08:49 srv1 kernel: [2413255.624955] block drbd0: Connection closed<br>
Dec 18 15:08:49 srv1 kernel: [2413255.624962] block drbd0: conn( Disconnecting -> StandAlone )<br>Dec 18 15:08:49 srv1 kernel: [2413255.625010] block drbd0: receiver terminated<br>Dec 18 15:08:49 srv1 kernel: [2413255.625012] block drbd0: Terminating receiver thread<br>
<br>And this in the secondary's syslog (its clock is a few seconds behind the primary, the two blocks happen at the same time from the handshake on):<br>Dec 18 15:08:32 srv2 kernel: [ 262.821407] drbd: initialized. Version: 8.3.7 (api:88/proto:86-91)<br>
Dec 18 15:08:32 srv2 kernel: [ 262.821410] drbd: GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@srv2, 2012-11-19 12:36:39<br>Dec 18 15:08:32 srv2 kernel: [ 262.821413] drbd: registered as block device major 147<br>
Dec 18 15:08:32 srv2 kernel: [ 262.821415] drbd: minor_table @ 0xffff88042c4fc000<br>Dec 18 15:08:32 srv2 kernel: [ 262.825336] block drbd0: Starting worker thread (from cqueue [2192])<br>Dec 18 15:08:32 srv2 kernel: [ 262.825501] block drbd0: disk( Diskless -> Attaching )<br>
Dec 18 15:08:32 srv2 kernel: [ 262.829125] block drbd0: Found 4 transactions (52 active extents) in activity log.<br>Dec 18 15:08:32 srv2 kernel: [ 262.829128] block drbd0: Method to ensure write ordering: barrier<br>Dec 18 15:08:32 srv2 kernel: [ 262.829132] block drbd0: max_segment_size ( = BIO size ) = 32768<br>
Dec 18 15:08:32 srv2 kernel: [ 262.829135] block drbd0: drbd_bm_resize called with capacity == 209706680<br>Dec 18 15:08:32 srv2 kernel: [ 262.830332] block drbd0: resync bitmap: bits=26213335 words=409584<br>Dec 18 15:08:32 srv2 kernel: [ 262.830335] block drbd0: size = 100 GB (104853340 KB)<br>
Dec 18 15:08:32 srv2 kernel: [ 262.931529] block drbd0: recounting of set bits took additional 0 jiffies<br>Dec 18 15:08:32 srv2 kernel: [ 262.931534] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.<br>
Dec 18 15:08:32 srv2 kernel: [ 262.931547] block drbd0: Marked additional 116 MB as out-of-sync based on AL.<br>Dec 18 15:08:32 srv2 kernel: [ 262.932626] block drbd0: disk( Attaching -> UpToDate )<br>Dec 18 15:08:32 srv2 kernel: [ 262.935598] block drbd0: conn( StandAlone -> Unconnected )<br>
Dec 18 15:08:32 srv2 kernel: [ 262.935631] block drbd0: Starting receiver thread (from drbd0_worker [2194])<br>Dec 18 15:08:32 srv2 kernel: [ 262.935705] block drbd0: receiver (re)started<br>Dec 18 15:08:32 srv2 kernel: [ 262.935711] block drbd0: conn( Unconnected -> WFConnection )<br>
Dec 18 15:08:32 srv2 kernel: [ 263.044870] block drbd0: Handshake successful: Agreed network protocol version 91<br>Dec 18 15:08:32 srv2 kernel: [ 263.044880] block drbd0: conn( WFConnection -> WFReportParams )<br>Dec 18 15:08:32 srv2 kernel: [ 263.044956] block drbd0: Starting asender thread (from drbd0_receiver [2204])<br>
Dec 18 15:08:32 srv2 kernel: [ 263.045139] block drbd0: data-integrity-alg: <not-used><br>Dec 18 15:08:32 srv2 kernel: [ 263.045152] block drbd0: drbd_sync_handshake:<br>Dec 18 15:08:32 srv2 kernel: [ 263.045156] block drbd0: self 3D01B3FD7AA76820:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7 bits:29696 flags:0<br>
Dec 18 15:08:32 srv2 kernel: [ 263.045159] block drbd0: peer 3D01B3FD7AA76821:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7 bits:179 flags:0<br>Dec 18 15:08:32 srv2 kernel: [ 263.045162] block drbd0: uuid_compare()=1 by rule 40<br>
Dec 18 15:08:32 srv2 kernel: [ 263.045167] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )<br>Dec 18 15:08:32 srv2 kernel: [ 263.084931] block drbd0: meta connection shut down by peer.<br>
Dec 18 15:08:32 srv2 kernel: [ 263.084966] block drbd0: sock was shut down by peer<br>Dec 18 15:08:32 srv2 kernel: [ 263.084971] block drbd0: peer( Primary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( UpToDate -> DUnknown )<br>
Dec 18 15:08:32 srv2 kernel: [ 263.084975] block drbd0: short read expecting header on sock: r=0<br>Dec 18 15:08:32 srv2 kernel: [ 263.084986] block drbd0: sock_sendmsg returned -32<br>Dec 18 15:08:32 srv2 kernel: [ 263.084989] block drbd0: short sent ReportBitMap size=4096 sent=2220<br>
Dec 18 15:08:32 srv2 kernel: [ 263.104570] block drbd0: asender terminated<br>Dec 18 15:08:32 srv2 kernel: [ 263.104576] block drbd0: Terminating asender thread<br>Dec 18 15:08:32 srv2 kernel: [ 263.104693] block drbd0: Connection closed<br>
Dec 18 15:08:32 srv2 kernel: [ 263.104698] block drbd0: conn( BrokenPipe -> Unconnected )<br>Dec 18 15:08:32 srv2 kernel: [ 263.104703] block drbd0: receiver terminated<br>Dec 18 15:08:32 srv2 kernel: [ 263.104706] block drbd0: Restarting receiver thread<br>
Dec 18 15:08:32 srv2 kernel: [ 263.104709] block drbd0: receiver (re)started<br>Dec 18 15:08:32 srv2 kernel: [ 263.104714] block drbd0: conn( Unconnected -> WFConnection )<br> <br>4. We have only been able to restore de cluster by doing:<br>
secondary: drbdadm disconnect r0<br>secondary: drbdadm invalidate r0<br>secondary: drbdadm connect r0<br>primary: drbdadm connect r0<br><br>After which the primary goes into SyncSource, the secondary into SyncTarget, and the deviced is sync'd.<br>
<br>Any idea on what could be causing it or how to fix it?<br><br>Thanks!<br><br><br>Some configuration files:<br><br>global_common.conf:<br>global {<br> usage-count yes;<br>}<br>common {<br> protocol C;<br>
handlers {<br> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br>
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";<br> }<br> syncer {<br> rate 900M;<br>
}<br>}<br><br>mysql.res:<br>resource r0 {<br> on srv1 {<br> device /dev/drbd0;<br> disk /dev/sdc1;<br> address <a href="http://192.168.0.1:7789">192.168.0.1:7789</a>;<br> meta-disk internal;<br>
}<br> on srv2 {<br> device /dev/drbd0;<br> disk /dev/sdc1;<br> address <a href="http://192.168.0.2:7789">192.168.0.2:7789</a>;<br> meta-disk internal;<br> }<br>}<br><br><a href="http://ha.cf">ha.cf</a>:<br>
logfacility daemon<br>keepalive 1<br>deadtime 10<br>warntime 5<br>use_logd yes<br>bcast eth2<br>auto_failback off<br>node evl1900669 evl1900670<br><br>haresources:<br>srv1 \<br> IPaddr2::<a href="http://172.16.218.244/28/eth3:VI1/172.16.218.255">172.16.218.244/28/eth3:VI1/172.16.218.255</a> \<br>
IPaddr2::<a href="http://81.93.218.244/28/eth0:VI0/81.93.218.255">81.93.218.244/28/eth0:VI0/81.93.218.255</a> \<br> drbddisk::r0 \<br> Filesystem::/dev/drbd0::/data::xfs::noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k \<br>
mysql-ha<br><br><br><br>