Thanks for the replies Felix and David,<br><br>OK losing data on the one node is not an issue for me at this point but I cannot afford a repeat. I am very glad this happened now before going live. <br>I shut down ocfs2 and o2cb on the secondary node and am busy re-syncing now. What could have caused this? The machines were both untouched for a week with no traffic other than developers testing the site.<br>
<br>I am busy setting up Nagios monitoring as well and will re-read the fencing docs to make sure all is good. <br><br>My OS is Ubuntu 10.4 kernel version 2.6.32 on both nodes<br><br>This is the kernel log from node2:<br>
===========================================<br>Feb 17 10:47:54 web02 kernel: [ 12.894830] OCFS2 Node Manager 1.5.0<br>Feb 17 10:47:54 web02 kernel: [ 12.899444] OCFS2 DLM 1.5.0<br>Feb 17 10:47:54 web02 kernel: [ 12.901012] ocfs2: Registered cluster interface o2cb<br>
Feb 17 10:47:54 web02 kernel: [ 12.910541] OCFS2 DLMFS 1.5.0<br>Feb 17 10:47:54 web02 kernel: [ 12.910820] OCFS2 User DLM kernel interface loaded<br>Feb 17 10:47:54 web02 kernel: [ 13.013907] padlock: VIA PadLock not detected.<br>
Feb 17 10:47:54 web02 kernel: [ 13.016874] alg: No test for __cbc-aes-aesni (cryptd(__driver-cbc-aes-aesni))<br>Feb 17 10:47:54 web02 kernel: [ 13.019825] padlock: VIA PadLock Hash Engine not detected.<br>Feb 17 10:47:54 web02 kernel: [ 13.234666] drbd: initialized. Version: 8.3.7 (api:88/proto:86-91)<br>
Feb 17 10:47:54 web02 kernel: [ 13.234669] drbd: GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@web02, 2012-01-10 09:55:21<br>Feb 17 10:47:54 web02 kernel: [ 13.234672] drbd: registered as block device major 147<br>
Feb 17 10:47:54 web02 kernel: [ 13.234674] drbd: minor_table @ 0xffff88061239ea00<br>Feb 17 10:47:54 web02 kernel: [ 13.238802] block drbd0: Starting worker thread (from cqueue [1482])<br>Feb 17 10:47:54 web02 kernel: [ 13.238983] block drbd0: disk( Diskless -> Attaching ) <br>
Feb 17 10:47:54 web02 kernel: [ 13.258795] block drbd0: Found 4 transactions (14 active extents) in activity log.<br>Feb 17 10:47:54 web02 kernel: [ 13.258799] block drbd0: Method to ensure write ordering: barrier<br>
Feb 17 10:47:54 web02 kernel: [ 13.258803] block drbd0: Backing device's merge_bvec_fn() = ffffffff81439d10<br>Feb 17 10:47:54 web02 kernel: [ 13.258806] block drbd0: max_segment_size ( = BIO size ) = 4096<br>Feb 17 10:47:54 web02 kernel: [ 13.258808] block drbd0: Adjusting my ra_pages to backing device's (32 -> 96)<br>
Feb 17 10:47:54 web02 kernel: [ 13.258812] block drbd0: drbd_bm_resize called with capacity == 2726214328<br>Feb 17 10:47:54 web02 kernel: [ 13.268969] block drbd0: resync bitmap: bits=340776791 words=5324638<br>Feb 17 10:47:54 web02 kernel: [ 13.268976] block drbd0: size = 1300 GB (1363107164 KB)<br>
Feb 17 10:47:54 web02 kernel: [ 13.531587] block drbd0: recounting of set bits took additional 5 jiffies<br>Feb 17 10:47:54 web02 kernel: [ 13.531592] block drbd0: 56 GB (14607631 bits) marked out-of-sync by on disk bit-map.<br>
Feb 17 10:47:54 web02 kernel: [ 13.531600] block drbd0: disk( Attaching -> UpToDate ) <br>Feb 17 10:47:54 web02 kernel: [ 13.535865] block drbd0: conn( StandAlone -> Unconnected ) <br>Feb 17 10:47:54 web02 kernel: [ 13.535889] block drbd0: Starting receiver thread (from drbd0_worker [1484])<br>
Feb 17 10:47:54 web02 kernel: [ 13.535998] block drbd0: receiver (re)started<br>Feb 17 10:47:54 web02 kernel: [ 13.536006] block drbd0: conn( Unconnected -> WFConnection ) <br>Feb 17 10:47:54 web02 kernel: [ 13.716586] Adding 31248376k swap on /dev/mapper/cryptswap1. Priority:-1 extents:1 across:31248376k <br>
Feb 17 10:47:57 web02 kernel: [ 15.806435] bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex<br>Feb 17 10:47:57 web02 kernel: [ 15.808235] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready<br>Feb 17 10:47:57 web02 kernel: [ 16.001305] bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON<br>
Feb 17 10:47:57 web02 kernel: [ 16.003044] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready<br>Feb 17 10:48:07 web02 kernel: [ 25.861028] eth1: no IPv6 routers present<br>Feb 17 10:48:07 web02 kernel: [ 26.410340] eth0: no IPv6 routers present<br>
Feb 17 12:42:40 web02 kernel: [ 6890.123541] ocfs2: Unregistered cluster interface o2cb<br>===========================================================================================================<br><br>On node1 there are no kernel log entries for today, the last drbd related entry was 2 days ago:<br>
<br>===================================================================================================<br><br>Feb 15 13:41:37 web01 kernel: [ 24.916579] block drbd0: Handshake successful: Agreed network protocol version 91<br>
Feb 15 13:41:37 web01 kernel: [ 24.916588] block drbd0: conn( WFConnection -> WFReportParams ) <br>Feb 15 13:41:37 web01 kernel: [ 24.916619] block drbd0: Starting asender thread (from drbd0_receiver [1271])<br>Feb 15 13:41:37 web01 kernel: [ 24.917056] block drbd0: data-integrity-alg: <not-used><br>
Feb 15 13:41:37 web01 kernel: [ 24.917073] block drbd0: drbd_sync_handshake:<br>Feb 15 13:41:37 web01 kernel: [ 24.917078] block drbd0: self 37C841BC2AA49AC4:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004 bits:1407177 flags:0<br>
Feb 15 13:41:37 web01 kernel: [ 24.917082] block drbd0: peer D3CCDACF6FD7FDB8:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004 bits:14607528 flags:0<br>Feb 15 13:41:37 web01 kernel: [ 24.917086] block drbd0: uuid_compare()=100 by rule 90<br>
Feb 15 13:41:37 web01 kernel: [ 24.917089] block drbd0: Split-Brain detected, dropping connection!<br>Feb 15 13:41:37 web01 kernel: [ 24.917463] block drbd0: helper command: /sbin/drbdadm split-brain minor-0<br>Feb 15 13:41:37 web01 kernel: [ 24.919876] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)<br>
Feb 15 13:41:37 web01 kernel: [ 24.919882] block drbd0: conn( WFReportParams -> Disconnecting ) <br>Feb 15 13:41:37 web01 kernel: [ 24.919892] block drbd0: error receiving ReportState, l: 4!<br>Feb 15 13:41:37 web01 kernel: [ 24.920241] block drbd0: meta connection shut down by peer.<br>
Feb 15 13:41:37 web01 kernel: [ 24.920547] block drbd0: asender terminated<br>Feb 15 13:41:37 web01 kernel: [ 24.920553] block drbd0: Terminating asender thread<br>Feb 15 13:41:37 web01 kernel: [ 24.920628] block drbd0: Connection closed<br>
Feb 15 13:41:37 web01 kernel: [ 24.920636] block drbd0: conn( Disconnecting -> StandAlone ) <br>Feb 15 13:41:37 web01 kernel: [ 24.920655] block drbd0: receiver terminated<br>Feb 15 13:41:37 web01 kernel: [ 24.920661] block drbd0: Terminating receiver thread<br>
Feb 15 13:41:37 web01 kernel: [ 24.923261] block drbd0: role( Secondary -> Primary ) <br>Feb 15 13:41:43 web01 kernel: [ 31.213404] OCFS2 1.5.0<br>Feb 15 13:41:43 web01 kernel: [ 31.226477] ocfs2_dlm: Nodes in domain ("BFE731CEF8404A02AB70F568D4BC6E03"): 1 <br>
=============================================================================================================================================<br><br><br>Thanks<br><br>Lawence<br><br><br><br><br><br><br><div class="gmail_quote">
On 17 February 2012 12:38, David Coulson <span dir="ltr"><<a href="mailto:david@davidcoulson.net">david@davidcoulson.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><div class="im">
<br>
<br>
On 2/17/12 4:19 AM, Lawrence Strydom wrote:
<blockquote type="cite">Hi List,<br>
<br>
I used DRBD in dual primary mode with ocfs2 for my load balancing
web server cluster. I didn't encounter any errors during setup and
when I put the web site on the DRBD device on the primary node, it
replicated without any errors. It has been running fine during the
week of testing but this morning when we updated code located on
the DRBD device we noticed it was not replicating to the secondary
node. <br>
the DRBD device was mounted on both nodes but /proc/drbd output
this:<br>
<br>
<b>version: 8.3.7 (api:88/proto:86-91)<br>
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by <a href="mailto:root@web01.junkmail.co.za" target="_blank">root@web01.junkmail.co.za</a>,
2012-01-10 09:54:40<br>
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown
r----<br>
ns:0 nr:0 dw:5960937 dr:5047235 al:1490 bm:1363 lo:0 pe:0
ua:0 ap:0 ep:1 wo:b oos:8840028</b></blockquote></div>
So you have a split brain, i think - you didn't post the drbd output
from the other node, so that's just an educated guess.<br>
<br>
shutdown ocfs2/o2cb on one node, and follow this:<br>
<br>
<a href="http://www.drbd.org/users-guide/s-resolve-split-brain.html" target="_blank">http://www.drbd.org/users-guide/s-resolve-split-brain.html</a><br>
<br>
then validate both are primary/uptodate and restart your filesystem
clustering.<br>
<br>
You will need to post all the drbd logs from both boxes to
understand what root cause is. You are running a oldish version of
drbd, plus you didn't indicate what your os/kernel was.<span class="HOEnZb"><font color="#888888"><br>
<br>
David<br>
</font></span></div>
</blockquote></div><br>