Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sat, Jan 19, 2008 at 03:12:23PM +0100, Walter Haidinger wrote: > Hi! > > I've got a reproducible problem with drbd-8.0.8. Well, the problem > surfaced after migrating to v8, it never showed with drbd v7. and you are sure that nothing else but the drbd version changed? same kernel, same wlan drivers, those metal-wire-fruit baskets in the middle of the room did not start to dance, and your neighbor still has the same old microwave oven? > Both peers run openSUSE 10.3 and a self-compiled drbd 8.0.8. They're > connected over an 11b WLAN-Link with a throughput of about 550 kB/s. > Exchanged data is tunneled using OpenVPN. The link is pretty stable, I > had a ping running for over a week and did not lose a single packet > with an average rtt of 2.4 ms. well. what about a flood ping with big packets? # ping -w 20 -f -s 4100 peer-node or saturating your link using dd and netcat... > Now, ever since going to v8, the peer is considered dead due to the > ko-count (raised to 25 already). I've yet to figure out why this > happens but there is a more annoying issue: > > Any subsequent attempt to shut down drbd on either node makes drbd > hang in the disconnecting state, i.e. 'drbdadm disconnect res0' will > show: Child process does not terminate! Exiting. please do # ps -eo pid,state,wchan:30,cmd | grep -e D -e drbd > /proc/drbd on both nodes then: > 0: cs:Disconnecting st:Secondary/Unknown ds:UpToDate/Inconsistent C r--- > > The logs show: > Becoming sync source due to disk states. > peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 24 > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 23 > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 22 > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 21 > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 20 > [drbd0_receiver/13018] sock_sendmsg time expired, ko = 19 > PingAck did not arrive in time. > peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) > asender terminated > short sent ReportBitMap size=4096 sent=2012 > Writing meta data super block now. > md_sync_timer expired! Worker calls drbd_md_sync(). > role( Primary -> Secondary ) > conn( NetworkFailure -> Disconnecting ) > > The other node logs: > Becoming sync target due to disk states. > peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Writing meta data super block now. > sock_recvmsg returned -104 > peer( Primary -> Unknown ) conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > asender terminated > error receiving ReportBitMap, l: 4088! > tl_clear() > Connection closed > Writing meta data super block now. > conn( NetworkFailure -> Unconnected ) > > Is there any way to force a disconnect? So far only rebooting both > nodes solves this as drbd will reconnect then. Well, until the next > network failure. > > How can I further diagnose this? Do you need more information, like > the drbd.conf setup? * saturate your network using other means, and see if it does similar things. * use tcpdump/wirshark, once you see the first "ko-count" message, have a look and have a guess. -- : Lars Ellenberg http://www.linbit.com : : DRBD/HA support and consulting sales at linbit.com : : LINBIT Information Technologies GmbH Tel +43-1-8178292-0 : : Vivenotgasse 48, A-1120 Vienna/Europe Fax +43-1-8178292-82 : __ please use the "List-Reply" function of your email client.