[DRBD-user] Master - master and split brain if network is saturated

Steve Kieu msh.computing at gmail.com
Mon Jan 16 01:36:38 CET 2012


Hello everyone,

I am still currently testing DRBD + OCFS2 in both master nodes and notices
that if the networks interface is sort of saturated DRBD drop the
connection.

Node A and B both centos 6 - drbd-8.3.12 build from vanilla source
 kernel 2.6.37.6 with vserver patch vs2.3.0.37-rc5
It has happened before I do not know what the cause but I suspect network
related problems. Today I was just running rsync a large directory from A
to another host (not B) but using the same interface that drbd is using
(that is, I do not have dedicated interface for drbd). The rsync takes long
time I I found in dmesg

[ 1844.822895] block drbd0: Resync done (total 1 sec; paused 0 sec; 8 K/sec)
[ 1844.822904] block drbd0: updated UUIDs
56D7AD5A0D1052FC:0000000000000000:34133820BE9332EE:34123820BE9332EF
[ 1844.822910] block drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[ 1844.823072] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0
[ 1844.825128] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0 exit code 0 (0x0)
[ 1844.825584] block drbd0: bitmap WRITE of 22 pages took 0 jiffies
[ 1844.825592] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[ 1893.341247] block drbd0: peer( Secondary -> Primary )
[ 1899.914431] block drbd0: role( Secondary -> Primary )
[ 2639.422664] Loading kernel module for a network device with
CAP_SYS_MODULE (deprecated).  Use CAP_NET_ADMIN and alias netdev-dummy0
instead
[ 2648.421314] dummy0: no IPv6 routers present
[72370.860513] block drbd0: role( Primary -> Secondary )
[72370.860600] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
[72370.860611] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[72393.260995] block drbd0: Considerable difference in lower level device
sizes: 37747512s vs. 6291192s
[72491.794195] block drbd0: role( Secondary -> Primary )
[72566.421735] block drbd0: Considerable difference in lower level device
sizes: 37747512s vs. 6291192s
[72664.997280] block drbd0: role( Primary -> Secondary )
[72664.997365] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
[72664.997375] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[72708.334128] block drbd0: Considerable difference in lower level device
sizes: 37747512s vs. 6291192s
[72708.334134] block drbd0: drbd_bm_resize called with capacity == 37747512
[72708.334238] block drbd0: resync bitmap: bits=4718439 words=73726
pages=144
[72708.334242] block drbd0: size = 18 GB (18873756 KB)
[72708.334247] block drbd0: Writing the whole bitmap, size changed and md
moved
[72708.335349] block drbd0: bitmap WRITE of 120 pages took 1 jiffies
[72708.335357] block drbd0: 15 GB (3932040 bits) marked out-of-sync by on
disk bit-map.
[72708.335564] block drbd0: Resync of new storage after online grow
[72708.335573] block drbd0: conn( Connected -> WFSyncUUID ) disk( UpToDate
-> Outdated )
[72708.339931] block drbd0: updated sync uuid
067858B97D1E599C:0000000000000000:34133820BE9332EE:34123820BE9332EF
[72708.340130] block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0
[72708.342435] block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0 exit code 0 (0x0)
[72708.342443] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated
-> Inconsistent )
[72708.342454] block drbd0: Began resync as SyncTarget (will sync 15728160
KB [3932040 bits set]).
[72742.631740] block drbd0: role( Secondary -> Primary )
[75037.775246] block drbd0: Resync done (total 2335 sec; paused 0 sec; 6732
K/sec)
[75037.775254] block drbd0: updated UUIDs
56D7AD5A0D1052FD:0000000000000000:067858B97D1E599D:34133820BE9332EF
[75037.775262] block drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[75037.775455] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0
[75037.777818] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0 exit code 0 (0x0)
[75037.777837] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
[75037.777846] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[759301.427734] block drbd0: sock was shut down by peer
[759301.427743] block drbd0: peer( Primary -> Unknown ) conn( Connected ->
BrokenPipe ) pdsk( UpToDate -> DUnknown )
[759301.427751] block drbd0: short read expecting header on sock: r=0
[759301.427793] block drbd0: meta connection shut down by peer.
[759301.427812] block drbd0: asender terminated
[759301.427814] block drbd0: Terminating asender thread
[759301.427848] block drbd0: new current UUID
4E3A8EF514190575:56D7AD5A0D1052FD:067858B97D1E599D:34133820BE9332EF
[759301.428512] block drbd0: Connection closed
[759301.428518] block drbd0: conn( BrokenPipe -> Unconnected )
[759301.428523] block drbd0: receiver terminated
[759301.428526] block drbd0: Restarting receiver thread
[759301.428528] block drbd0: receiver (re)started
[759301.428532] block drbd0: conn( Unconnected -> WFConnection )
[759302.119164] block drbd0: Handshake successful: Agreed network protocol
version 96
[759302.119192] block drbd0: conn( WFConnection -> WFReportParams )
[759302.119342] block drbd0: Starting asender thread (from drbd0_receiver
[2247])
[759302.119563] block drbd0: data-integrity-alg: <not-used>
[759302.119671] block drbd0: drbd_sync_handshake:
[759302.119676] block drbd0: self
4E3A8EF514190575:56D7AD5A0D1052FD:067858B97D1E599D:34133820BE9332EF bits:0
flags:0
[759302.119681] block drbd0: peer
B8D842CF6950FAB3:56D7AD5A0D1052FD:067858B97D1E599C:34133820BE9332EF bits:1
flags:0
[759302.119685] block drbd0: uuid_compare()=100 by rule 90
[759302.119689] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0
[759302.121991] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0 exit code 0 (0x0)
[759302.121995] block drbd0: Split-Brain detected but unresolved, dropping
connection!
[759302.122016] block drbd0: helper command: /sbin/drbdadm split-brain
minor-0
[759302.124199] block drbd0: helper command: /sbin/drbdadm split-brain
minor-0 exit code 0 (0x0)
[759302.124205] block drbd0: conn( WFReportParams -> Disconnecting )
[759302.124213] block drbd0: error receiving ReportState, l: 4!
[759302.124242] block drbd0: asender terminated
[759302.124247] block drbd0: Terminating asender thread
[759302.124288] block drbd0: Connection closed
[759302.124294] block drbd0: conn( Disconnecting -> StandAlone )
[759302.124401] block drbd0: receiver terminated
[759302.124405] block drbd0: Terminating receiver thread
[842477.436163] block drbd0: role( Primary -> Secondary )
[842477.436245] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
[842477.436256] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[842477.436817] block drbd0: disk( UpToDate -> Failed )
[842477.436906] block drbd0: disk( Failed -> Diskless )
[842477.437531] block drbd0: drbd_bm_resize called with capacity == 0
[842477.437591] block drbd0: worker terminated
[842477.437594] block drbd0: Terminating worker thread
[842477.590493] drbd: module cleanup done.
[842488.657890] drbd: initialized. Version: 8.3.12 (api:88/proto:86-96)
[842488.657893] drbd: GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f
build by root at cosmos, 2011-12-25 16:56:17
[842488.657896] drbd: registered as block device major 147
[842488.657898] drbd: minor_table @ 0xffff88041c35e100
[842488.680349] block drbd0: Starting worker thread (from kworker/u:3 [345])
[842488.680729] block drbd0: disk( Diskless -> Attaching )
[842488.694650] block drbd0: No usable activity log found.
[842488.694655] block drbd0: Method to ensure write ordering: barrier
[842488.694658] block drbd0: max BIO size = 131072
[842488.694664] block drbd0: drbd_bm_resize called with capacity == 37747512
[842488.694806] block drbd0: resync bitmap: bits=4718439 words=73726
pages=144
[842488.694809] block drbd0: size = 18 GB (18873756 KB)
[842488.703982] block drbd0: bitmap READ of 144 pages took 1 jiffies
[842488.704114] block drbd0: recounting of set bits took additional 0
jiffies
[842488.704117] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[842488.704123] block drbd0: disk( Attaching -> UpToDate )
[842488.704127] block drbd0: attached to UUIDs
4E3A8EF514190575:56D7AD5A0D1052FD:067858B97D1E599D:34133820BE9332EF
[842488.707024] block drbd0: conn( StandAlone -> Unconnected )
[842488.707043] block drbd0: Starting receiver thread (from drbd0_worker
[20773])
[842488.707244] block drbd0: receiver (re)started
[842488.707253] block drbd0: conn( Unconnected -> WFConnection )
[842489.201687] block drbd0: Handshake successful: Agreed network protocol
version 96
[842489.201713] block drbd0: conn( WFConnection -> WFReportParams )
[842489.201905] block drbd0: Starting asender thread (from drbd0_receiver
[20782])
[842489.202075] block drbd0: data-integrity-alg: <not-used>
[842489.202090] block drbd0: drbd_sync_handshake:
[842489.202094] block drbd0: self
4E3A8EF514190574:56D7AD5A0D1052FD:067858B97D1E599D:34133820BE9332EF bits:0
flags:0
[842489.202099] block drbd0: peer
B8D842CF6950FAB2:56D7AD5A0D1052FD:067858B97D1E599C:34133820BE9332EF
bits:36737 flags:0
[842489.202102] block drbd0: uuid_compare()=100 by rule 90
[842489.202107] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0
[842489.204369] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0 exit code 0 (0x0)
[842489.204374] block drbd0: Split-Brain detected, 0 primaries,
automatically solved. Sync from peer node
[842489.204381] block drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown
-> UpToDate )
[842489.297122] block drbd0: conn( WFBitMapT -> WFSyncUUID )
[842489.307270] block drbd0: updated sync uuid
56D8AD5A0D1052FC:0000000000000000:067858B97D1E599D:34133820BE9332EF
[842489.307479] block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0
[842489.309577] block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0 exit code 0 (0x0)
[842489.309585] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk(
Outdated -> Inconsistent )
[842489.309595] block drbd0: Began resync as SyncTarget (will sync 146948
KB [36737 bits set]).
[842508.372386] block drbd0: Resync done (total 19 sec; paused 0 sec; 7732
K/sec)
[842508.372393] block drbd0: updated UUIDs
B8D842CF6950FAB2:0000000000000000:56D8AD5A0D1052FC:56D7AD5A0D1052FD
[842508.372398] block drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[842508.372485] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0
[842508.374683] block drbd0: helper command: /sbin/drbdadm
after-resync-target minor-0 exit code 0 (0x0)
[842508.375756] block drbd0: bitmap WRITE of 131 pages took 0 jiffies
[842508.375765] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[842525.184360] block drbd0: peer( Secondary -> Primary )
[842565.780288] block drbd0: role( Secondary -> Primary )
[1871271.922857] block drbd0: sock was shut down by peer
[1871271.922866] block drbd0: peer( Primary -> Unknown ) conn( Connected ->
BrokenPipe ) pdsk( UpToDate -> DUnknown )
[1871271.922873] block drbd0: short read expecting header on sock: r=0
[1871271.922900] block drbd0: new current UUID
2EDBDEEE9284642B:B8D842CF6950FAB3:56D8AD5A0D1052FC:56D7AD5A0D1052FD
[1871271.922908] block drbd0: meta connection shut down by peer.
[1871271.922929] block drbd0: asender terminated
[1871271.922932] block drbd0: Terminating asender thread
[1871271.923606] block drbd0: Connection closed
[1871271.923612] block drbd0: conn( BrokenPipe -> Unconnected )
[1871271.923618] block drbd0: receiver terminated
[1871271.923620] block drbd0: Restarting receiver thread
[1871271.923623] block drbd0: receiver (re)started
[1871271.923627] block drbd0: conn( Unconnected -> WFConnection )
[1871272.616257] block drbd0: Handshake successful: Agreed network protocol
version 96
[1871272.616282] block drbd0: conn( WFConnection -> WFReportParams )
[1871272.616433] block drbd0: Starting asender thread (from drbd0_receiver
[20782])
[1871272.616598] block drbd0: data-integrity-alg: <not-used>
[1871272.616829] block drbd0: drbd_sync_handshake:
[1871272.616835] block drbd0: self
2EDBDEEE9284642B:B8D842CF6950FAB3:56D8AD5A0D1052FC:56D7AD5A0D1052FD bits:0
flags:0
[1871272.616839] block drbd0: peer
C301948203A19D2F:B8D842CF6950FAB3:56D8AD5A0D1052FD:56D7AD5A0D1052FD
bits:408 flags:0
[1871272.616843] block drbd0: uuid_compare()=100 by rule 90
[1871272.616848] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0
[1871272.619242] block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0 exit code 0 (0x0)
[1871272.619247] block drbd0: Split-Brain detected but unresolved, dropping
connection!
[1871272.619268] block drbd0: helper command: /sbin/drbdadm split-brain
minor-0
[1871272.621532] block drbd0: helper command: /sbin/drbdadm split-brain
minor-0 exit code 0 (0x0)
[1871272.621538] block drbd0: conn( WFReportParams -> Disconnecting )
[1871272.621547] block drbd0: error receiving ReportState, l: 4!
[1871272.621627] block drbd0: asender terminated
[1871272.621634] block drbd0: Terminating asender thread
[1871272.621817] block drbd0: Connection closed
[1871272.621825] block drbd0: conn( Disconnecting -> StandAlone )
[1871272.622039] block drbd0: receiver terminated
[1871272.622043] block drbd0: Terminating receiver thread


and now I have

cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by root at cosmos,
2011-12-25 16:56:17
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----
    ns:0 nr:107240176 dw:107240176 dr:664 al:0 bm:24 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0


I do not think it is expected? how can I prevent this

Thanks in advance


-- 
Steve Kieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120116/cdeb86bd/attachment.htm>


More information about the drbd-user mailing list