Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I've moved on to testing 24 drbds on top of LVM as Lars suggested rather than 24 LVs on top of drbd but I did run across a fairly reproducible problem, I'm not completely sure it's drbd, though CentOS5 - x86_64 2.6.18-8.1.10.el5 kernel, drbd 8.0.4 installed with yum from the repositories. - 4 disks in an md raid0 -> 1 drbd in pri/pri -> 24 LVs, each ext3 formatted. - Two nodes connected back to back with bonded gigE, bond mode 0. - Mount LVs 1-12 on A, LVs 13-24 on box B. - do 'for 02 03 04 05 06 07 08 09 10 11 12; do ( rsync -a /mnt/lv01/foo /mnt/lv$i/ & ); done; do the equivalent for 13-24 on the other box at the same time. The end result is each box has 11 readers to one spot, and 11 writers to different spots. The 'foo' being copied is ~2.2GB. - Soon (sometimes within seconds). My disk activity lights stop blinking. The network traffic does not. 'dstat' tells me I have identical sizes of network traffic recv/sent. - This will time out eventually. If I set ko-count > 0, it'll time out then. If I set it to 0 or just default, it'll time out before ko-count hits 0. dmesg sez: | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 7 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 6 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 5 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 4 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 3 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 2 | drbd0: [drbd0_worker/3707] sock_sendmsg time expired, ko = 1 | drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) | drbd0: Creating new current UUID | drbd0: short read receiving data: read 2564 expected 4096 | drbd0: error receiving Data, l: 4120! | drbd0: asender terminated | drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync(). | drbd0: Writing meta data super block now. | drbd0: tl_clear() | drbd0: Connection closed | drbd0: conn( NetworkFailure -> Unconnected ) | drbd0: receiver terminated | drbd0: receiver (re)started | drbd0: conn( Unconnected -> WFConnection ) | drbd0: conn( WFConnection -> WFReportParams ) | drbd0: Handshake successful: DRBD Network Protocol version 86 | drbd0: meta connection shut down by peer. | drbd0: asender terminated | drbd0: Split-Brain detected, dropping connection! | drbd0: self 113D16A8CB3D8F2F:4B601577E9A3629F:7611D86007610640:AFAC4A775AFB24A8 | drbd0: peer 66EDD42691B2439B:4B601577E9A3629E:7611D86007610640:AFAC4A775AFB24A8 | drbd0: conn( WFReportParams -> Disconnecting ) | drbd0: error receiving ReportState, l: 4! | drbd0: tl_clear() | drbd0: Connection closed | drbd0: conn( Disconnecting -> StandAlone ) | drbd0: receiver terminated Both nodes do this. Drbd disconnects and my after-sb policies tell the to primaries not to talk to each other again. Each is pri/standalone and the rsync torture continues. Yes, my disks are no longer consistent. This is testing, that's ok for now. I get all that. But, what causes this lock up in the first place? While "locked" drbd workers are taking up CPU and the disks aren't busy but the network is. Lots of CPU time spent in "wait". I can interact with the system, the disks, whatever. If I drbdamd down the nodes, make one primary and up it again. Life goes on after sync. I can do this consistently. Every time even. If I switch the bonding to 'mode=2 xmit_hash_policy=layer3+4' I don't see this problem (well never in 7+ tries). Any ideas? Or what other info may I provide?