[DRBD-user] Cluster breaks when restarting secondary

Alba Ferrer alba at caliu.cat
Wed Dec 19 09:22:03 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello everybody,

we have a problem in a cluster and we haven't been able to find the cause.
The cluster is set up with heartbeat 3.0.2 and drbd 8.3.7 on Debian Squeeze
and kernel 2.6.32. This works fine when it's up and running, the problem
comes when restarting the secondary node. The situation goes like this:

1. Primary and secondary servers working correctly.
2. Reboot secondary server. /proc/drbd in the primary looks like this when
the secondary goes down:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at srv1,
2012-11-19 14:18:13
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:104865056 nr:0 dw:4542118 dr:314589852 al:29 bm:19201 lo:0 pe:0 ua:0
ap:0 ep:1 wo:b oos:716

3. When the secondary comes back up the connection between the two nodes is
closed, leaving the following states:
    Primary:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at srv1,
2012-11-19 14:18:13
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:104865056 nr:0 dw:4542118 dr:314589852 al:29 bm:19201 lo:0 pe:0 ua:0
ap:0 ep:1 wo:b oos:716

    Secondary:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at srv2,
2012-11-19 12:36:39
 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:12 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:118784

This can be seen in the primary's syslog:
Dec 18 15:08:49 srv1 kernel: [2413255.584781] block drbd0: Handshake
successful: Agreed network protocol version 91
Dec 18 15:08:49 srv1 kernel: [2413255.584791] block drbd0: conn(
WFConnection -> WFReportParams )
Dec 18 15:08:49 srv1 kernel: [2413255.584896] block drbd0: Starting asender
thread (from drbd0_receiver [12413])
Dec 18 15:08:49 srv1 kernel: [2413255.585397] block drbd0:
data-integrity-alg: <not-used>
Dec 18 15:08:49 srv1 kernel: [2413255.585415] block drbd0:
drbd_sync_handshake:
Dec 18 15:08:49 srv1 kernel: [2413255.585419] block drbd0: self
3D01B3FD7AA76821:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7
bits:179 flags:0
Dec 18 15:08:49 srv1 kernel: [2413255.585422] block drbd0: peer
3D01B3FD7AA76820:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7
bits:29696 flags:2
Dec 18 15:08:49 srv1 kernel: [2413255.585426] block drbd0:
uuid_compare()=-1 by rule 40
Dec 18 15:08:49 srv1 kernel: [2413255.585428] block drbd0: I shall become
SyncTarget, but I am primary!
Dec 18 15:08:49 srv1 kernel: [2413255.604751] block drbd0: conn(
WFReportParams -> Disconnecting )
Dec 18 15:08:49 srv1 kernel: [2413255.604773] block drbd0: error receiving
ReportState, l: 4!
Dec 18 15:08:49 srv1 kernel: [2413255.624811] block drbd0: asender
terminated
Dec 18 15:08:49 srv1 kernel: [2413255.624820] block drbd0: Terminating
asender thread
Dec 18 15:08:49 srv1 kernel: [2413255.624955] block drbd0: Connection closed
Dec 18 15:08:49 srv1 kernel: [2413255.624962] block drbd0: conn(
Disconnecting -> StandAlone )
Dec 18 15:08:49 srv1 kernel: [2413255.625010] block drbd0: receiver
terminated
Dec 18 15:08:49 srv1 kernel: [2413255.625012] block drbd0: Terminating
receiver thread

And this in the secondary's syslog (its clock is a few seconds behind the
primary, the two blocks happen at the same time from the handshake on):
Dec 18 15:08:32 srv2 kernel: [  262.821407] drbd: initialized. Version:
8.3.7 (api:88/proto:86-91)
Dec 18 15:08:32 srv2 kernel: [  262.821410] drbd: GIT-hash:
ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at srv2, 2012-11-19
12:36:39
Dec 18 15:08:32 srv2 kernel: [  262.821413] drbd: registered as block
device major 147
Dec 18 15:08:32 srv2 kernel: [  262.821415] drbd: minor_table @
0xffff88042c4fc000
Dec 18 15:08:32 srv2 kernel: [  262.825336] block drbd0: Starting worker
thread (from cqueue [2192])
Dec 18 15:08:32 srv2 kernel: [  262.825501] block drbd0: disk( Diskless ->
Attaching )
Dec 18 15:08:32 srv2 kernel: [  262.829125] block drbd0: Found 4
transactions (52 active extents) in activity log.
Dec 18 15:08:32 srv2 kernel: [  262.829128] block drbd0: Method to ensure
write ordering: barrier
Dec 18 15:08:32 srv2 kernel: [  262.829132] block drbd0: max_segment_size (
= BIO size ) = 32768
Dec 18 15:08:32 srv2 kernel: [  262.829135] block drbd0: drbd_bm_resize
called with capacity == 209706680
Dec 18 15:08:32 srv2 kernel: [  262.830332] block drbd0: resync bitmap:
bits=26213335 words=409584
Dec 18 15:08:32 srv2 kernel: [  262.830335] block drbd0: size = 100 GB
(104853340 KB)
Dec 18 15:08:32 srv2 kernel: [  262.931529] block drbd0: recounting of set
bits took additional 0 jiffies
Dec 18 15:08:32 srv2 kernel: [  262.931534] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
Dec 18 15:08:32 srv2 kernel: [  262.931547] block drbd0: Marked additional
116 MB as out-of-sync based on AL.
Dec 18 15:08:32 srv2 kernel: [  262.932626] block drbd0: disk( Attaching ->
UpToDate )
Dec 18 15:08:32 srv2 kernel: [  262.935598] block drbd0: conn( StandAlone
-> Unconnected )
Dec 18 15:08:32 srv2 kernel: [  262.935631] block drbd0: Starting receiver
thread (from drbd0_worker [2194])
Dec 18 15:08:32 srv2 kernel: [  262.935705] block drbd0: receiver
(re)started
Dec 18 15:08:32 srv2 kernel: [  262.935711] block drbd0: conn( Unconnected
-> WFConnection )
Dec 18 15:08:32 srv2 kernel: [  263.044870] block drbd0: Handshake
successful: Agreed network protocol version 91
Dec 18 15:08:32 srv2 kernel: [  263.044880] block drbd0: conn( WFConnection
-> WFReportParams )
Dec 18 15:08:32 srv2 kernel: [  263.044956] block drbd0: Starting asender
thread (from drbd0_receiver [2204])
Dec 18 15:08:32 srv2 kernel: [  263.045139] block drbd0:
data-integrity-alg: <not-used>
Dec 18 15:08:32 srv2 kernel: [  263.045152] block drbd0:
drbd_sync_handshake:
Dec 18 15:08:32 srv2 kernel: [  263.045156] block drbd0: self
3D01B3FD7AA76820:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7
bits:29696 flags:0
Dec 18 15:08:32 srv2 kernel: [  263.045159] block drbd0: peer
3D01B3FD7AA76821:E8819499AD4A72E7:74DBC3543CD6C995:26BE9CEE2E53B0D7
bits:179 flags:0
Dec 18 15:08:32 srv2 kernel: [  263.045162] block drbd0: uuid_compare()=1
by rule 40
Dec 18 15:08:32 srv2 kernel: [  263.045167] block drbd0: peer( Unknown ->
Primary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Dec 18 15:08:32 srv2 kernel: [  263.084931] block drbd0: meta connection
shut down by peer.
Dec 18 15:08:32 srv2 kernel: [  263.084966] block drbd0: sock was shut down
by peer
Dec 18 15:08:32 srv2 kernel: [  263.084971] block drbd0: peer( Primary ->
Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Dec 18 15:08:32 srv2 kernel: [  263.084975] block drbd0: short read
expecting header on sock: r=0
Dec 18 15:08:32 srv2 kernel: [  263.084986] block drbd0: sock_sendmsg
returned -32
Dec 18 15:08:32 srv2 kernel: [  263.084989] block drbd0: short sent
ReportBitMap size=4096 sent=2220
Dec 18 15:08:32 srv2 kernel: [  263.104570] block drbd0: asender terminated
Dec 18 15:08:32 srv2 kernel: [  263.104576] block drbd0: Terminating
asender thread
Dec 18 15:08:32 srv2 kernel: [  263.104693] block drbd0: Connection closed
Dec 18 15:08:32 srv2 kernel: [  263.104698] block drbd0: conn( BrokenPipe
-> Unconnected )
Dec 18 15:08:32 srv2 kernel: [  263.104703] block drbd0: receiver terminated
Dec 18 15:08:32 srv2 kernel: [  263.104706] block drbd0: Restarting
receiver thread
Dec 18 15:08:32 srv2 kernel: [  263.104709] block drbd0: receiver
(re)started
Dec 18 15:08:32 srv2 kernel: [  263.104714] block drbd0: conn( Unconnected
-> WFConnection )

4. We have only been able to restore de cluster by doing:
secondary: drbdadm disconnect r0
secondary: drbdadm invalidate r0
secondary: drbdadm connect r0
primary: drbdadm connect r0

After which the primary goes into SyncSource, the secondary into
SyncTarget, and the deviced is sync'd.

Any idea on what could be causing it or how to fix it?

Thanks!


Some configuration files:

global_common.conf:
global {
        usage-count yes;
}
common {
        protocol C;
        handlers {
                pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
                pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
        }
        syncer {
                rate 900M;
        }
}

mysql.res:
resource r0 {
  on srv1 {
        device /dev/drbd0;
        disk /dev/sdc1;
        address 192.168.0.1:7789;
        meta-disk internal;
  }
  on srv2 {
        device /dev/drbd0;
        disk /dev/sdc1;
        address 192.168.0.2:7789;
        meta-disk internal;
  }
}

ha.cf:
logfacility daemon
keepalive 1
deadtime 10
warntime 5
use_logd yes
bcast eth2
auto_failback off
node evl1900669 evl1900670

haresources:
srv1 \
        IPaddr2::172.16.218.244/28/eth3:VI1/172.16.218.255 \
        IPaddr2::81.93.218.244/28/eth0:VI0/81.93.218.255 \
        drbddisk::r0 \

Filesystem::/dev/drbd0::/data::xfs::noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k
\
        mysql-ha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121219/0cfe41f4/attachment.htm>


More information about the drbd-user mailing list