Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks for the replies Felix and David, OK losing data on the one node is not an issue for me at this point but I cannot afford a repeat. I am very glad this happened now before going live. I shut down ocfs2 and o2cb on the secondary node and am busy re-syncing now. What could have caused this? The machines were both untouched for a week with no traffic other than developers testing the site. I am busy setting up Nagios monitoring as well and will re-read the fencing docs to make sure all is good. My OS is Ubuntu 10.4 kernel version 2.6.32 on both nodes This is the kernel log from node2: =========================================== Feb 17 10:47:54 web02 kernel: [ 12.894830] OCFS2 Node Manager 1.5.0 Feb 17 10:47:54 web02 kernel: [ 12.899444] OCFS2 DLM 1.5.0 Feb 17 10:47:54 web02 kernel: [ 12.901012] ocfs2: Registered cluster interface o2cb Feb 17 10:47:54 web02 kernel: [ 12.910541] OCFS2 DLMFS 1.5.0 Feb 17 10:47:54 web02 kernel: [ 12.910820] OCFS2 User DLM kernel interface loaded Feb 17 10:47:54 web02 kernel: [ 13.013907] padlock: VIA PadLock not detected. Feb 17 10:47:54 web02 kernel: [ 13.016874] alg: No test for __cbc-aes-aesni (cryptd(__driver-cbc-aes-aesni)) Feb 17 10:47:54 web02 kernel: [ 13.019825] padlock: VIA PadLock Hash Engine not detected. Feb 17 10:47:54 web02 kernel: [ 13.234666] drbd: initialized. Version: 8.3.7 (api:88/proto:86-91) Feb 17 10:47:54 web02 kernel: [ 13.234669] drbd: GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at web02, 2012-01-10 09:55:21 Feb 17 10:47:54 web02 kernel: [ 13.234672] drbd: registered as block device major 147 Feb 17 10:47:54 web02 kernel: [ 13.234674] drbd: minor_table @ 0xffff88061239ea00 Feb 17 10:47:54 web02 kernel: [ 13.238802] block drbd0: Starting worker thread (from cqueue [1482]) Feb 17 10:47:54 web02 kernel: [ 13.238983] block drbd0: disk( Diskless -> Attaching ) Feb 17 10:47:54 web02 kernel: [ 13.258795] block drbd0: Found 4 transactions (14 active extents) in activity log. Feb 17 10:47:54 web02 kernel: [ 13.258799] block drbd0: Method to ensure write ordering: barrier Feb 17 10:47:54 web02 kernel: [ 13.258803] block drbd0: Backing device's merge_bvec_fn() = ffffffff81439d10 Feb 17 10:47:54 web02 kernel: [ 13.258806] block drbd0: max_segment_size ( = BIO size ) = 4096 Feb 17 10:47:54 web02 kernel: [ 13.258808] block drbd0: Adjusting my ra_pages to backing device's (32 -> 96) Feb 17 10:47:54 web02 kernel: [ 13.258812] block drbd0: drbd_bm_resize called with capacity == 2726214328 Feb 17 10:47:54 web02 kernel: [ 13.268969] block drbd0: resync bitmap: bits=340776791 words=5324638 Feb 17 10:47:54 web02 kernel: [ 13.268976] block drbd0: size = 1300 GB (1363107164 KB) Feb 17 10:47:54 web02 kernel: [ 13.531587] block drbd0: recounting of set bits took additional 5 jiffies Feb 17 10:47:54 web02 kernel: [ 13.531592] block drbd0: 56 GB (14607631 bits) marked out-of-sync by on disk bit-map. Feb 17 10:47:54 web02 kernel: [ 13.531600] block drbd0: disk( Attaching -> UpToDate ) Feb 17 10:47:54 web02 kernel: [ 13.535865] block drbd0: conn( StandAlone -> Unconnected ) Feb 17 10:47:54 web02 kernel: [ 13.535889] block drbd0: Starting receiver thread (from drbd0_worker [1484]) Feb 17 10:47:54 web02 kernel: [ 13.535998] block drbd0: receiver (re)started Feb 17 10:47:54 web02 kernel: [ 13.536006] block drbd0: conn( Unconnected -> WFConnection ) Feb 17 10:47:54 web02 kernel: [ 13.716586] Adding 31248376k swap on /dev/mapper/cryptswap1. Priority:-1 extents:1 across:31248376k Feb 17 10:47:57 web02 kernel: [ 15.806435] bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex Feb 17 10:47:57 web02 kernel: [ 15.808235] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready Feb 17 10:47:57 web02 kernel: [ 16.001305] bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Feb 17 10:47:57 web02 kernel: [ 16.003044] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Feb 17 10:48:07 web02 kernel: [ 25.861028] eth1: no IPv6 routers present Feb 17 10:48:07 web02 kernel: [ 26.410340] eth0: no IPv6 routers present Feb 17 12:42:40 web02 kernel: [ 6890.123541] ocfs2: Unregistered cluster interface o2cb =========================================================================================================== On node1 there are no kernel log entries for today, the last drbd related entry was 2 days ago: =================================================================================================== Feb 15 13:41:37 web01 kernel: [ 24.916579] block drbd0: Handshake successful: Agreed network protocol version 91 Feb 15 13:41:37 web01 kernel: [ 24.916588] block drbd0: conn( WFConnection -> WFReportParams ) Feb 15 13:41:37 web01 kernel: [ 24.916619] block drbd0: Starting asender thread (from drbd0_receiver [1271]) Feb 15 13:41:37 web01 kernel: [ 24.917056] block drbd0: data-integrity-alg: <not-used> Feb 15 13:41:37 web01 kernel: [ 24.917073] block drbd0: drbd_sync_handshake: Feb 15 13:41:37 web01 kernel: [ 24.917078] block drbd0: self 37C841BC2AA49AC4:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004 bits:1407177 flags:0 Feb 15 13:41:37 web01 kernel: [ 24.917082] block drbd0: peer D3CCDACF6FD7FDB8:4579E80074D400D3:C117CFF0A5777F0F:0000000000000004 bits:14607528 flags:0 Feb 15 13:41:37 web01 kernel: [ 24.917086] block drbd0: uuid_compare()=100 by rule 90 Feb 15 13:41:37 web01 kernel: [ 24.917089] block drbd0: Split-Brain detected, dropping connection! Feb 15 13:41:37 web01 kernel: [ 24.917463] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Feb 15 13:41:37 web01 kernel: [ 24.919876] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Feb 15 13:41:37 web01 kernel: [ 24.919882] block drbd0: conn( WFReportParams -> Disconnecting ) Feb 15 13:41:37 web01 kernel: [ 24.919892] block drbd0: error receiving ReportState, l: 4! Feb 15 13:41:37 web01 kernel: [ 24.920241] block drbd0: meta connection shut down by peer. Feb 15 13:41:37 web01 kernel: [ 24.920547] block drbd0: asender terminated Feb 15 13:41:37 web01 kernel: [ 24.920553] block drbd0: Terminating asender thread Feb 15 13:41:37 web01 kernel: [ 24.920628] block drbd0: Connection closed Feb 15 13:41:37 web01 kernel: [ 24.920636] block drbd0: conn( Disconnecting -> StandAlone ) Feb 15 13:41:37 web01 kernel: [ 24.920655] block drbd0: receiver terminated Feb 15 13:41:37 web01 kernel: [ 24.920661] block drbd0: Terminating receiver thread Feb 15 13:41:37 web01 kernel: [ 24.923261] block drbd0: role( Secondary -> Primary ) Feb 15 13:41:43 web01 kernel: [ 31.213404] OCFS2 1.5.0 Feb 15 13:41:43 web01 kernel: [ 31.226477] ocfs2_dlm: Nodes in domain ("BFE731CEF8404A02AB70F568D4BC6E03"): 1 ============================================================================================================================================= Thanks Lawence On 17 February 2012 12:38, David Coulson <david at davidcoulson.net> wrote: > > > On 2/17/12 4:19 AM, Lawrence Strydom wrote: > > Hi List, > > I used DRBD in dual primary mode with ocfs2 for my load balancing web > server cluster. I didn't encounter any errors during setup and when I put > the web site on the DRBD device on the primary node, it replicated without > any errors. It has been running fine during the week of testing but this > morning when we updated code located on the DRBD device we noticed it was > not replicating to the secondary node. > the DRBD device was mounted on both nodes but /proc/drbd output this: > > *version: 8.3.7 (api:88/proto:86-91) > GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by > root at web01.junkmail.co.za, 2012-01-10 09:54:40 > 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r---- > ns:0 nr:0 dw:5960937 dr:5047235 al:1490 bm:1363 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:b oos:8840028* > > So you have a split brain, i think - you didn't post the drbd output from > the other node, so that's just an educated guess. > > shutdown ocfs2/o2cb on one node, and follow this: > > http://www.drbd.org/users-guide/s-resolve-split-brain.html > > then validate both are primary/uptodate and restart your filesystem > clustering. > > You will need to post all the drbd logs from both boxes to understand what > root cause is. You are running a oldish version of drbd, plus you didn't > indicate what your os/kernel was. > > David > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120217/e0071726/attachment.htm>