<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000099">

<tt>Hi all<br>

<br>

I hope someone will be able to help with a rather stubbornly persistent

and yet random problem which has been bugging us for over two months. </tt><tt>Every

night (backups time) six replica devices </tt><tt>in all </tt><tt>are

</tt><tt>asked to disconnect </tt><tt>with:<br>

<br>

&nbsp;drbdadm disconnect replica_resource<br>

<br>

</tt><tt>On just one server</tt><tt> at times </tt><tt>this</tt><tt> </tt><tt>causes</tt><tt>

</tt><tt>process</tt><tt></tt><br>

<tt><br>

&nbsp;drbdsetup /dev/drbd1 disconnect<br>

<br>

to hang in such a way that it cannot be killed. But the device is

disconnected and</tt><tt> replication</tt><tt> is stopped.</tt><tt>

This may not happen for six nights and then it will hang just a few

hours after the system was rebooted.<br>

<br>

Misbehaving server is one from two pairs of servers which run the same

version of DRBD (drbd-0.7.24-1) and kernel (Redhat EL4 2.6.9-55.ELsmp),

although they are of different architectures (i368 and x86_64). Other

difference between the two pairs is that one uses hardware RAID-ed

devices and the one in question uses native metadisk (/dev/md?)

devices. Another piece of "exotic" configuration is that IP addresses

DRBD devices are bound to are on OpenVPN's tun0 interfaces.<br>

<br>

I have been at pains to find out what is the difference between the two

servers in this pair, which causes this problem. I reconciled DRBD

config files and OS RPMs</tt><tt> and still did not get rid of this

problem</tt><tt>.<br>

<br>

Being guided by

<a class="moz-txt-link-freetext" href="http://lists.linbit.com/pipermail/drbd-user/2006-April/004832.html">http://lists.linbit.com/pipermail/drbd-user/2006-April/004832.html</a> I

gathered following evidence:<br>

<br>

# cat /proc/drbd<br>

<br>

version: 0.7.24 (api:79/proto:74)<br>

SVN Revision: 2875 build by buildsystem@linbit, 2007-06-04 14:03:37<br>

&nbsp;0: cs:Connected st:Primary/Secondary ld:Consistent<br>

&nbsp;&nbsp;&nbsp; ns:32298620 nr:0 dw:31831224 dr:76573961 al:1370841 bm:10302 lo:3

pe:4 ua:0 ap:2<br>

&nbsp;1: cs:BrokenPipe st:Secondary/Unknown ld:Consistent<br>

&nbsp;&nbsp;&nbsp; ns:0 nr:5038052 dw:35515816 dr:127195916 al:651 bm:5783 lo:3 pe:0

ua:3 ap:0<br>

<br>

# ps -ef|grep drbd|grep -v grep<br>

<br>

root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2057&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 0 Aug19 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:01:10 [drbd1_receiver]<br>

root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4495&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 0 Aug15 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:02:05 [drbd0_receiver]<br>

root&nbsp;&nbsp;&nbsp;&nbsp; 18176&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 0 00:46 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:15 [drbd0_asender]<br>

root&nbsp;&nbsp;&nbsp;&nbsp; 20219&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 0 01:18 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 drbdsetup /dev/drbd1

disconnect<br>

root&nbsp;&nbsp;&nbsp;&nbsp; 26539&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp; 0 00:35 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:07 [drbd0_worker]<br>

</tt><tt><br>

# ps axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep -w 2057 | grep

-v grep<br>

<br>

&nbsp;2057 drbd_wait_ee&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [drbd1_receiver] drbd1_receiver<br>

<br>

</tt><tt># ps axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep 20219</tt><tt>

| grep -v grep</tt><br>

<tt><br>

20219 drbd_thread_stop&nbsp; drbdsetup /dev/d drbdsetup<br>

<br>

</tt><tt># cat /proc/20219/status<br>

<br>

Name:&nbsp;&nbsp; drbdsetup<br>

State:&nbsp; D (disk sleep)<br>

SleepAVG:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 68%<br>

Tgid:&nbsp;&nbsp; 20219<br>

Pid:&nbsp;&nbsp;&nbsp; 20219<br>

PPid:&nbsp;&nbsp; 1<br>

TracerPid:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>

Uid:&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>

Gid:&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>

FDSize: 32<br>

Groups: 0 1 2 3 4 6 10<br>

VmSize:&nbsp;&nbsp;&nbsp;&nbsp; 1788 kB<br>

VmLck:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 kB<br>

VmRSS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 196 kB<br>

VmData:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 12 kB<br>

VmStk:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 464 kB<br>

VmExe:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 19 kB<br>

VmLib:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1265 kB<br>

StaBrk: 0804e000 kB<br>

Brk:&nbsp;&nbsp;&nbsp; 09542000 kB<br>

StaStk: bff8cf60 kB<br>

ExecLim:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ffffffff<br>

Threads:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br>

SigPnd: 0000000000000000<br>

ShdPnd: 0000000000000000<br>

SigBlk: 0000000000000000<br>

SigIgn: 0000000000000000<br>

SigCgt: 0000000000000000<br>

CapInh: 0000000000000000<br>

CapPrm: 00000000fffffeff<br>

CapEff: 00000000fffffeff<br>

<br>

# cat /proc/20219/wchan<br>

<br>

_drbd_thread_stop<br>

<br>

# var/log/messages<br>

<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: sock was shut down by peer<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

Connected --&gt; BrokenPipe<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: short read expecting header

on sock: r=0<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: worker terminated<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: meta connection shut down by

peer.<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: asender terminated<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

BrokenPipe --&gt; Unconnected<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: Connection lost.<br>

Aug 20 00:35:11 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

Unconnected --&gt; WFConnection<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

WFConnection --&gt; WFReportParams<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: Handshake successful: DRBD

Network Protocol version 74<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: Connection established.<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: I am(P):

1:00000006:0000000b:00000267:00000031:10<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: Peer(S):

1:00000006:0000000b:00000266:00000032:00<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

WFReportParams --&gt; WFBitMapS<br>

Aug 20 00:46:43 butterfly1 kernel: drbd0: Primary/Unknown --&gt;

Primary/Secondary<br>

Aug 20 00:46:44 butterfly1 kernel: drbd0: drbd0_receiver [4495]: cstate

WFBitMapS --&gt; SyncSource<br>

Aug 20 00:46:44 butterfly1 kernel: drbd0: Resync started as SyncSource

(need to sync 15528 KB [3882 bits set]).<br>

Aug 20 00:46:54 butterfly1 kernel: drbd0: Resync done (total 10 sec;

paused 0 sec; 1552 K/sec)<br>

Aug 20 00:46:54 butterfly1 kernel: drbd0: drbd0_worker [26539]: cstate

SyncSource --&gt; Connected<br>

<br>

Aug 20 01:18:30 butterfly1 kernel: drbd1: drbdsetup [20219]: cstate

Connected --&gt; Unconnected<br>

Aug 20 01:18:30 butterfly1 kernel: drbd1: drbd1_receiver [2057]: cstate

Unconnected --&gt; BrokenPipe<br>

Aug 20 01:18:30 butterfly1 kernel: drbd1: short read expecting header

on sock: r=-512<br>

Aug 20 01:18:30 butterfly1 kernel: drbd1: worker terminated<br>

Aug 20 01:18:30 butterfly1 kernel: drbd1: asender terminated<br>

<br>

#<br>

<br>

<br>

First block in the above is for the replica device on the remote server

being disconnected, backed up and reconnected. The second block is ALL

there is in messages (and dmesg) for this stuck "disconnect" command.<br>

<br>

There is one other curious thing I noticed. When DRBD is restarted

(reboot) and both devices are reconnected, the device which was </tt><tt>cs:BrokenPipe

will receive a fair bit of re-sync data, as one would expect it after

not being connected for a few busy hours. But the odd thing is that

when the primary device is reconnected, a similarly large amount of

data is re-synced, as if this device was disconnected just as long as

the replica, despite being off for just 2-3 minutes it takes to reboot.

Example:<br>

<br>

</tt>

<pre class="wiki"> 0: cs:WFConnection st:Primary/Unknown ld:Consistent

    ns:0 nr:0 dw:3600 dr:104253 al:157 bm:284 lo:0 pe:0 ua:0 ap:0

 1: cs:SyncTarget st:Secondary/Primary ld:Inconsistent

    ns:0 nr:41584 dw:41580 dr:0 al:0 bm:27 lo:266 pe:348 ua:266 ap:0

        [=&gt;..................] sync'ed:  5.1% (848796/889276)K

 0: cs:SyncSource st:Primary/Secondary ld:Consistent

    ns:6380 nr:0 dw:11088 dr:211713 al:492 bm:616 lo:1 pe:57 ua:457 ap:0

        [&gt;...................] sync'ed:  1.6% (517976/524116)K

        finish: 0:01:20 speed: 6,140 (6,140) K/sec

 1: cs:Connected st:Secondary/Primary ld:Consistent

    ns:0 nr:898404 dw:898404 dr:0 al:0 bm:2570 lo:6 pe:0 ua:6 ap:0

        finish: 0:01:02 speed: 13,492 (13,492) K/sec</pre>

<br>

<tt>I've been monitoring both DRBD devices' stats as well as the

underlying disk devices' stats when the replica device is disconnected

and stuck. During that time the primary DRBD device (drbd0), its remote

replica (drbd1) and both metadisk devices (md?) show data being

sent/received via network and written to metadisks. Why then so many

hundreds of MBs need to be re-synced?<br>

<br>

</tt><tt>Is there any other information which may help in getting to

the bottom of this? Questions, suggestions, answers anyone?<br>

<br>

Thanks in advance<br>

<br>

Maciek Olczak<br>

</tt>

</body>

</html>