[DRBD-user] io error when mounting drbd device

Lawrence Strydom qholloi at gmail.com
Fri Feb 17 12:03:23 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Thanks for the replies Felix and David,

OK losing data on the one node is not an issue for me at this point but I
cannot afford a repeat. I am very glad this happened now before going live.
I shut down ocfs2 and o2cb on the secondary node and am busy re-syncing
now. What could have caused this?  The machines were both untouched for a
week with no traffic other than developers testing the site.

I am busy setting up Nagios monitoring as well and will re-read the fencing
docs to make sure all is good.

My OS is Ubuntu 10.4 kernel version  2.6.32 on both nodes

This is the kernel log from node2:
===========================================
Feb 17 10:47:54 web02 kernel: [   12.894830] OCFS2 Node Manager 1.5.0
Feb 17 10:47:54 web02 kernel: [   12.899444] OCFS2 DLM 1.5.0
Feb 17 10:47:54 web02 kernel: [   12.901012] ocfs2: Registered cluster
interface o2cb
Feb 17 10:47:54 web02 kernel: [   12.910541] OCFS2 DLMFS 1.5.0
Feb 17 10:47:54 web02 kernel: [   12.910820] OCFS2 User DLM kernel
interface loaded
Feb 17 10:47:54 web02 kernel: [   13.013907] padlock: VIA PadLock not
detected.
Feb 17 10:47:54 web02 kernel: [   13.016874] alg: No test for
__cbc-aes-aesni (cryptd(__driver-cbc-aes-aesni))
Feb 17 10:47:54 web02 kernel: [   13.019825] padlock: VIA PadLock Hash
Engine not detected.
Feb 17 10:47:54 web02 kernel: [   13.234666] drbd: initialized. Version:
8.3.7 (api:88/proto:86-91)
Feb 17 10:47:54 web02 kernel: [   13.234669] drbd: GIT-hash:
ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at web02, 2012-01-10
09:55:21
Feb 17 10:47:54 web02 kernel: [   13.234672] drbd: registered as block
device major 147
Feb 17 10:47:54 web02 kernel: [   13.234674] drbd: minor_table @
0xffff88061239ea00
Feb 17 10:47:54 web02 kernel: [   13.238802] block drbd0: Starting worker
thread (from cqueue [1482])
Feb 17 10:47:54 web02 kernel: [   13.238983] block drbd0: disk( Diskless ->
Attaching )
Feb 17 10:47:54 web02 kernel: [   13.258795] block drbd0: Found 4
transactions (14 active extents) in activity log.
Feb 17 10:47:54 web02 kernel: [   13.258799] block drbd0: Method to ensure
write ordering: barrier
Feb 17 10:47:54 web02 kernel: [   13.258803] block drbd0: Backing device's
merge_bvec_fn() = ffffffff81439d10
Feb 17 10:47:54 web02 kernel: [   13.258806] block drbd0: max_segment_size
( = BIO size ) = 4096
Feb 17 10:47:54 web02 kernel: [   13.258808] block drbd0: Adjusting my
ra_pages to backing device's (32 -> 96)
Feb 17 10:47:54 web02 kernel: [   13.258812] block drbd0: drbd_bm_resize
called with capacity == 2726214328
Feb 17 10:47:54 web02 kernel: [   13.268969] block drbd0: resync bitmap:
bits=340776791 words=5324638
Feb 17 10:47:54 web02 kernel: [   13.268976] block drbd0: size = 1300 GB
(1363107164 KB)
Feb 17 10:47:54 web02 kernel: [   13.531587] block drbd0: recounting of set
bits took additional 5 jiffies
Feb 17 10:47:54 web02 kernel: [   13.531592] block drbd0: 56 GB (14607631
bits) marked out-of-sync by on disk bit-map.
Feb 17 10:47:54 web02 kernel: [   13.531600] block drbd0: disk( Attaching
-> UpToDate )
Feb 17 10:47:54 web02 kernel: [   13.535865] block drbd0: conn( StandAlone
-> Unconnected )
Feb 17 10:47:54 web02 kernel: [   13.535889] block drbd0: Starting receiver
thread (from drbd0_worker [1484])
Feb 17 10:47:54 web02 kernel: [   13.535998] block drbd0: receiver
(re)started
Feb 17 10:47:54 web02 kernel: [   13.536006] block drbd0: conn( Unconnected
-> WFConnection )
Feb 17 10:47:54 web02 kernel: [   13.716586] Adding 31248376k swap on
/dev/mapper/cryptswap1.  Priority:-1 extents:1 across:31248376k
Feb 17 10:47:57 web02 kernel: [   15.806435] bnx2: eth1 NIC Copper Link is
Up, 1000 Mbps full duplex
Feb 17 10:47:57 web02 kernel: [   15.808235] ADDRCONF(NETDEV_CHANGE): eth1:
link becomes ready
Feb 17 10:47:57 web02 kernel: [   16.001305] bnx2: eth0 NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Feb 17 10:47:57 web02 kernel: [   16.003044] ADDRCONF(NETDEV_CHANGE): eth0:
link becomes ready
Feb 17 10:48:07 web02 kernel: [   25.861028] eth1: no IPv6 routers present
Feb 17 10:48:07 web02 kernel: [   26.410340] eth0: no IPv6 routers present
Feb 17 12:42:40 web02 kernel: [ 6890.123541] ocfs2: Unregistered cluster
interface o2cb
===========================================================================================================

On node1 there are no kernel log entries for today, the last drbd related
entry was 2 days ago:

===================================================================================================

Feb 15 13:41:37 web01 kernel: [   24.916579] block drbd0: Handshake
successful: Agreed network protocol version 91
Feb 15 13:41:37 web01 kernel: [   24.916588] block drbd0: conn(
WFConnection -> WFReportParams )
Feb 15 13:41:37 web01 kernel: [   24.916619] block drbd0: Starting asender
thread (from drbd0_receiver [1271])
Feb 15 13:41:37 web01 kernel: [   24.917056] block drbd0:
data-integrity-alg: <not-used>
Feb 15 13:41:37 web01 kernel: [   24.917073] block drbd0:
drbd_sync_handshake:
Feb 15 13:41:37 web01 kernel: [   24.917078] block drbd0: self
37C841BC2AA49AC4:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004
bits:1407177 flags:0
Feb 15 13:41:37 web01 kernel: [   24.917082] block drbd0: peer
D3CCDACF6FD7FDB8:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004
bits:14607528 flags:0
Feb 15 13:41:37 web01 kernel: [   24.917086] block drbd0:
uuid_compare()=100 by rule 90
Feb 15 13:41:37 web01 kernel: [   24.917089] block drbd0: Split-Brain
detected, dropping connection!
Feb 15 13:41:37 web01 kernel: [   24.917463] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0
Feb 15 13:41:37 web01 kernel: [   24.919876] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Feb 15 13:41:37 web01 kernel: [   24.919882] block drbd0: conn(
WFReportParams -> Disconnecting )
Feb 15 13:41:37 web01 kernel: [   24.919892] block drbd0: error receiving
ReportState, l: 4!
Feb 15 13:41:37 web01 kernel: [   24.920241] block drbd0: meta connection
shut down by peer.
Feb 15 13:41:37 web01 kernel: [   24.920547] block drbd0: asender terminated
Feb 15 13:41:37 web01 kernel: [   24.920553] block drbd0: Terminating
asender thread
Feb 15 13:41:37 web01 kernel: [   24.920628] block drbd0: Connection closed
Feb 15 13:41:37 web01 kernel: [   24.920636] block drbd0: conn(
Disconnecting -> StandAlone )
Feb 15 13:41:37 web01 kernel: [   24.920655] block drbd0: receiver
terminated
Feb 15 13:41:37 web01 kernel: [   24.920661] block drbd0: Terminating
receiver thread
Feb 15 13:41:37 web01 kernel: [   24.923261] block drbd0: role( Secondary
-> Primary )
Feb 15 13:41:43 web01 kernel: [   31.213404] OCFS2 1.5.0
Feb 15 13:41:43 web01 kernel: [   31.226477] ocfs2_dlm: Nodes in domain
("BFE731CEF8404A02AB70F568D4BC6E03"): 1
=============================================================================================================================================


Thanks

Lawence






On 17 February 2012 12:38, David Coulson <david at davidcoulson.net> wrote:

>
>
> On 2/17/12 4:19 AM, Lawrence Strydom wrote:
>
> Hi List,
>
> I used DRBD in dual primary mode with ocfs2 for my load balancing web
> server cluster. I didn't encounter any errors during setup and when I put
> the web site on the DRBD device on the primary node, it replicated without
> any errors. It has been running fine during the week of testing but this
> morning when we updated code located on the DRBD device we noticed it was
> not replicating to the secondary node.
> the DRBD device was mounted on both nodes but /proc/drbd output this:
>
> *version: 8.3.7 (api:88/proto:86-91)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
> root at web01.junkmail.co.za, 2012-01-10 09:54:40
>  0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
>     ns:0 nr:0 dw:5960937 dr:5047235 al:1490 bm:1363 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:b oos:8840028*
>
> So you have a split brain, i think - you didn't post the drbd output from
> the other node, so that's just an educated guess.
>
> shutdown ocfs2/o2cb on one node, and follow this:
>
> http://www.drbd.org/users-guide/s-resolve-split-brain.html
>
> then validate both are primary/uptodate and restart your filesystem
> clustering.
>
> You will need to post all the drbd logs from both boxes to understand what
> root cause is. You are running a oldish version of drbd, plus you didn't
> indicate what your os/kernel was.
>
> David
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120217/e0071726/attachment.htm>


More information about the drbd-user mailing list