[DRBD-user] disconnecting hangs after ko-count failure

Tue Jan 22 17:33:05 CET 2008

On Sat, Jan 19, 2008 at 03:12:23PM +0100, Walter Haidinger wrote:
> Hi!
> 
> I've got a reproducible problem with drbd-8.0.8. Well, the problem
> surfaced after migrating to v8, it never showed with drbd v7.

and you are sure that nothing else but the drbd version changed?
same kernel, same wlan drivers, those metal-wire-fruit baskets
in the middle of the room did not start to dance, and your neighbor
still has the same old microwave oven?

> Both peers run openSUSE 10.3 and a self-compiled drbd 8.0.8. They're
> connected over an 11b WLAN-Link with a throughput of about 550 kB/s.
> Exchanged data is tunneled using OpenVPN. The link is pretty stable, I
> had a ping running for over a week and did not lose a single packet
> with an average rtt of 2.4 ms.

well.
what about a flood ping with big packets?
# ping -w 20 -f -s 4100 peer-node
or saturating your link using dd and netcat...

> Now, ever since going to v8, the peer is considered dead due to the
> ko-count (raised to 25 already). I've yet to figure out why this
> happens but there is a more annoying issue:
> 
> Any subsequent attempt to shut down drbd on either node makes drbd
> hang in the disconnecting state, i.e. 'drbdadm disconnect res0' will
> show: Child process does not terminate!  Exiting.

please do
# ps -eo pid,state,wchan:30,cmd | grep -e D -e drbd

> /proc/drbd on both nodes then:
>  0: cs:Disconnecting st:Secondary/Unknown ds:UpToDate/Inconsistent C r---
> 
> The logs show:
>  Becoming sync source due to disk states.
>  peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 24
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 23
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 22
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 21
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 20
>  [drbd0_receiver/13018] sock_sendmsg time expired, ko = 19
>  PingAck did not arrive in time.
>  peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure )
>  asender terminated
>  short sent ReportBitMap size=4096 sent=2012
>  Writing meta data super block now.
>  md_sync_timer expired! Worker calls drbd_md_sync().
>  role( Primary -> Secondary )
>  conn( NetworkFailure -> Disconnecting )
> 
> The other node logs:
>  Becoming sync target due to disk states.
>  peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>  Writing meta data super block now.
>  sock_recvmsg returned -104
>  peer( Primary -> Unknown ) conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>  asender terminated
>  error receiving ReportBitMap, l: 4088!
>  tl_clear()
>  Connection closed
>  Writing meta data super block now.
>  conn( NetworkFailure -> Unconnected )
> 
> Is there any way to force a disconnect? So far only rebooting both
> nodes solves this as drbd will reconnect then. Well, until the next
> network failure. 
> 
> How can I further diagnose this? Do you need more information, like
> the drbd.conf setup? 

 * saturate your network using other means,
   and see if it does similar things.
 * use tcpdump/wirshark, once you see the first "ko-count" message,
   have a look and have a guess.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.