[DRBD-user] Stalling drbd?

Wed Sep 19 22:49:26 CEST 2007

I've moved on to testing 24 drbds on top of LVM as Lars suggested rather
than 24 LVs on top of drbd but I did run across a fairly reproducible
problem, I'm not completely sure it's drbd, though

CentOS5 - x86_64 2.6.18-8.1.10.el5 kernel, drbd 8.0.4 installed with yum
from the repositories.

- 4 disks in an md raid0 -> 1 drbd in pri/pri -> 24 LVs, each ext3
formatted.

- Two nodes connected back to back with bonded gigE, bond mode 0.

- Mount LVs 1-12 on A, LVs 13-24 on box B.

- do 'for 02 03 04 05 06 07 08 09 10 11 12; do ( rsync -a /mnt/lv01/foo
/mnt/lv$i/ & ); done;
  do the equivalent for 13-24 on the other box at the same time. The end
result is each
  box has 11 readers to one spot, and 11 writers to different spots. The
'foo' being copied
  is ~2.2GB. 

- Soon (sometimes within seconds). My disk activity lights stop blinking.
The network traffic
  does not. 'dstat' tells me I have identical sizes of network traffic
recv/sent.

- This will time out eventually. If I set ko-count > 0, it'll time out
then. If I set it to 
0 or just default, it'll time out before ko-count hits 0.

dmesg sez:

| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 7
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 6 
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 5 
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 4 
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 3 
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 2 
| drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 1 
| drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk( UpToDate -> DUnknown ) 
| drbd0: Creating new current UUID 
| drbd0: short read receiving data: read 2564 expected 4096 
| drbd0: error receiving Data, l: 4120! 
| drbd0: asender terminated 
| drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync(). 
| drbd0: Writing meta data super block now. 
| drbd0: tl_clear() | drbd0: Connection closed 
| drbd0: conn( NetworkFailure -> Unconnected ) 
| drbd0: receiver terminated 
| drbd0: receiver (re)started 
| drbd0: conn( Unconnected -> WFConnection ) 
| drbd0: conn( WFConnection -> WFReportParams ) 
| drbd0: Handshake successful: DRBD Network Protocol version 86 
| drbd0: meta connection shut down by peer. 
| drbd0: asender terminated 
| drbd0: Split-Brain detected, dropping connection! 
| drbd0: self
113D16A8CB3D8F2F:4B601577E9A3629F:7611D86007610640:AFAC4A775AFB24A8 
| drbd0: peer
66EDD42691B2439B:4B601577E9A3629E:7611D86007610640:AFAC4A775AFB24A8 
| drbd0: conn( WFReportParams -> Disconnecting ) 
| drbd0: error receiving ReportState, l: 4! 
| drbd0: tl_clear() | drbd0: Connection closed 
| drbd0: conn( Disconnecting -> StandAlone ) 
| drbd0: receiver terminated

Both nodes do this. Drbd disconnects and my after-sb policies tell the to
primaries not 
to talk to each other again. Each is pri/standalone and the rsync torture
continues. Yes,
my disks are no longer consistent. This is testing, that's ok for now. I
get all that.
But, what causes this lock up in the first place? While "locked" drbd
workers are taking
up CPU and the disks aren't busy but the network is. Lots of CPU time spent
in "wait".
I can interact with the system, the disks, whatever. If I drbdamd down the
nodes, make
one primary and up it again. Life goes on after sync.

I can do this consistently. Every time even. If I switch the bonding to
'mode=2
xmit_hash_policy=layer3+4' I don't see this problem (well never in 7+
tries).

Any ideas? Or what other info may I provide?