[DRBD-user] disconnecting hangs after ko-count failure

hexie hexie at tx.com.cn
Mon Jan 21 03:07:53 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

drbd.conf content:
   net {
        after-sb-0pri discard-older-primary;
        after-sb-1pri call-pri-lost-after-sb;
        after-sb-2pri call-pri-lost-after-sb;

  ----- Original Message ----- 
  From: Walter Haidinger 
  To: drbd-user at lists.linbit.com 
  Sent: Saturday, January 19, 2008 10:12 PM
  Subject: [DRBD-user] disconnecting hangs after ko-count failure


  I've got a reproducible problem with drbd-8.0.8. Well, the problem surfaced after migrating to v8, it never showed with drbd v7.

  Both peers run openSUSE 10.3 and a self-compiled drbd 8.0.8. They're connected over an 11b WLAN-Link with a throughput of about 550 kB/s. Exchanged data is tunneled using OpenVPN. The link is pretty stable, I had a ping running for over a week and did not lose a single packet with an average rtt of 2.4 ms.

  Now, ever since going to v8, the peer is considered dead due to the ko-count (raised to 25 already). I've yet to figure out why this happens but there is a more annoying issue:

  Any subsequent attempt to shut down drbd on either node makes drbd hang in the disconnecting state, i.e. 'drbdadm disconnect res0' will show: 
  Child process does not terminate!

  /proc/drbd on both nodes then:
   0: cs:Disconnecting st:Secondary/Unknown ds:UpToDate/Inconsistent C r---

  The logs show:
   Becoming sync source due to disk states.
   peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 24
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 23
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 22
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 21
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 20
   [drbd0_receiver/13018] sock_sendmsg time expired, ko = 19
   PingAck did not arrive in time.
   peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure )
   asender terminated
   short sent ReportBitMap size=4096 sent=2012
   Writing meta data super block now.
   md_sync_timer expired! Worker calls drbd_md_sync().
   role( Primary -> Secondary )
   conn( NetworkFailure -> Disconnecting )

  The other node logs:
   Becoming sync target due to disk states.
   peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
   Writing meta data super block now.
   sock_recvmsg returned -104
   peer( Primary -> Unknown ) conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
   asender terminated
   error receiving ReportBitMap, l: 4088!
   Connection closed
   Writing meta data super block now.
   conn( NetworkFailure -> Unconnected )

  Is there any way to force a disconnect? So far only rebooting both nodes solves this as drbd will reconnect then. Well, until the next network failure. 

  How can I further diagnose this? Do you need more information, like the drbd.conf setup? 

  Any hints are appreciated!

  Thanks, Walter
  Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
  Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
  drbd-user mailing list
  drbd-user at lists.linbit.com

  __________ NOD32 2808 (20080120) Information __________

  This message was checked by NOD32 antivirus system.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080121/e52faa25/attachment.htm>

More information about the drbd-user mailing list