<div dir="ltr">I'd like to preface this by saying that I'm not overly experienced with DRBD, and I've being piecing a lot of information together from various documentation sources, including those on the DRBD site.<div>
<br></div><div>I have two servers, both the same hardware configuration, connected to two identical MD1200's. I sync these two servers in primary/primary mode using DRBD with OCFS2. Though they are in primary/primary mode, one of the servers is the primary, and the other is the secondary. Some tasks are run on the secondary that modify the drbd/ocfs2 volume, but it is kept pretty light work. The servers each have two ethernet ports, one for their LAN connection, the other for a direct connection from one to the other that is used solely for DRBD and OCFS2 communication.</div>
<div><br></div><div>The primary server, named bellerophon, is a web, database and file storage server, with the database (postgresql) and documents stored on the drbd/ocfs2 volume. The secondary server, named bia, is mainly used as a hot backup.</div>
<div><br></div><div>Up until May 29th, we've had very good success with the setup, and have even deployed another server pair with the same configuration, though with a different purpose. On the 29th of May, however, we started seeing timeouts:</div>
<div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><font face="courier new, monospace">Feb 28 15:13:01 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Feb 28 16:19:06 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">May 29 16:54:17 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">May 30 11:22:58 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 2 10:19:16 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 2 11:30:46 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 3 13:37:47 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 3 14:07:31 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 3 14:52:39 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 3 15:19:16 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 3 15:58:51 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 4 11:11:39 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 4 11:29:18 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 4 12:42:40 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br>
</font><font face="courier new, monospace">Jun 5 08:01:41 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 5 09:02:12 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting</font></blockquote>
</div><div><br></div><div>I resolve these by telling the secondary server to reconnect with discard so that the primary server syncs everything over to the secondary, then we put everything back into primary/primary mode. Because of these issues, I've moved all work that happens on the secondary server over to the primary so that when I do the resync we aren't losing anything since all the work is happening on the primary server anyway.</div>
<div><br></div><div>We haven't changed anything in the configuration of the servers on or before the 29th of May, so I'm at a loss as to why this is happening now. I've restarted both of the servers last night, but as you can see it went out of sync three times today already.</div>
<div><div><br></div><div>If it matters, here's the output of /proc/drbd:</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<font face="courier new, monospace">version: 8.4.3 (api:1/proto:86-101)<br></font><font face="courier new, monospace">built-in</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----<br>
</font><font face="courier new, monospace"> ns:1500684 nr:3532 dw:73274267 dr:187548582 al:29429 bm:594 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0</font></blockquote></div><div><br></div><div>We are using gentoo as our host OS, with kernel 3.10.25, which (as seen above) has DRBD 8.4.3. I am planning on moving up to the current stable kernel, which is 3.12.20, but that also is still at 8.4.3 (though with some code changes between them judging from the diff I did between the two kernel sources).</div>
<div><br></div><div>Here are the kernel messages that I'm seeing in the syslog for the disconnect that happened at 7:32 AM.</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: error receiving Data, e: -110 l: 512!<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: new current UUID C68B8BD454AAF64B:E240A278745017DB:9914908994F205E5:9913908994F205E5<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: asender terminated<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_a_r0<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Connection closed<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: conn( ProtocolError -> Unconnected )<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: receiver terminated<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Restarting receiver thread<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: receiver (re)started<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: conn( Unconnected -> WFConnection )<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Handshake successful: Agreed network protocol version 101<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: conn( WFConnection -> WFReportParams )<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Starting asender thread (from drbd_r_r0 [2699])<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: drbd_sync_handshake:<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: self C68B8BD454AAF64B:E240A278745017DB:9914908994F205E5:9913908994F205E5 bits:236 flags:0<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: peer 5AE3A4FE88D7DBA3:E240A278745017DB:9914908994F205E4:9913908994F205E5 bits:1 flags:0<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: uuid_compare()=100 by rule 90<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: Split-Brain detected but unresolved, dropping connection!<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: conn( WFReportParams -> Disconnecting )<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: error receiving ReportState, e: -5 l: 0!<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: asender terminated<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_a_r0<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Connection closed<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: conn( Disconnecting -> StandAlone )<br>
</font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: receiver terminated<br></font><font face="courier new, monospace">Jun 5 07:32:35 bellerophon kernel: d-con r0: Terminating drbd_r_r0</font></blockquote>
</div><div><br></div><div>Here is the configuration file that I'm using:</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<font face="courier new, monospace"># cat /etc/drbd.d/global_common.conf<br></font><font face="courier new, monospace">global {<br></font><font face="courier new, monospace"> usage-count yes;<br></font><font face="courier new, monospace"> # minor-count dialog-refresh disable-ip-verification<br>
</font><font face="courier new, monospace">}</font><font face="courier new, monospace"><br></font><font face="courier new, monospace">common {<br></font><font face="courier new, monospace"> handlers {<br></font><font face="courier new, monospace"> # These are EXAMPLE handlers only.<br>
</font><font face="courier new, monospace"> # They may have severe implications,<br></font><font face="courier new, monospace"> # like hard resetting the node under certain circumstances.<br>
</font><font face="courier new, monospace"> # Be careful when chosing your poison.</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> # pri-on-incon-degr "/usr/lib64/drbd/notify-pri-on-incon-degr.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br>
</font><font face="courier new, monospace"> # pri-lost-after-sb "/usr/lib64/drbd/notify-pri-lost-after-sb.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br>
</font><font face="courier new, monospace"> # local-io-error "/usr/lib64/drbd/notify-io-error.sh; /usr/lib64/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";<br></font><font face="courier new, monospace"> # fence-peer "/usr/lib64/drbd/crm-fence-peer.sh";<br>
</font><font face="courier new, monospace"> # split-brain "/usr/lib64/drbd/notify-split-brain.sh root";<br></font><font face="courier new, monospace"> # out-of-sync "/usr/lib64/drbd/notify-out-of-sync.sh root";<br>
</font><font face="courier new, monospace"> # before-resync-target "/usr/lib64/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";<br></font><font face="courier new, monospace"> # after-resync-target /usr/lib64/drbd/unsnapshot-resync-target-lvm.sh;<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> startup {<br></font><font face="courier new, monospace"> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> options {<br></font><font face="courier new, monospace"> # cpu-mask on-no-data-accessible<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> disk {<br></font><font face="courier new, monospace"> # size max-bio-bvecs on-io-error fencing disk-barrier disk-flushes<br>
</font><font face="courier new, monospace"> # disk-drain md-flushes resync-rate resync-after al-extents<br></font><font face="courier new, monospace"> # c-plan-ahead c-delay-target c-fill-target c-max-rate<br>
</font><font face="courier new, monospace"> # c-min-rate disk-timeout<br></font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> net {<br>
</font><font face="courier new, monospace"> # protocol timeout max-epoch-size max-buffers unplug-watermark<br></font><font face="courier new, monospace"> # connect-int ping-int sndbuf-size rcvbuf-size ko-count<br>
</font><font face="courier new, monospace"> # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri<br></font><font face="courier new, monospace"> # after-sb-1pri after-sb-2pri always-asbp rr-conflict<br>
</font><font face="courier new, monospace"> # ping-timeout data-integrity-alg tcp-cork on-congestion<br></font><font face="courier new, monospace"> # congestion-fill congestion-extents csums-alg verify-alg<br>
</font><font face="courier new, monospace"> # use-rle<br></font><font face="courier new, monospace"> }<br></font><font face="courier new, monospace">}</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"># cat /etc/drbd.d/dms.res<br>
</font><font face="courier new, monospace">resource r0 {<br></font><font face="courier new, monospace"> disk {<br></font><font face="courier new, monospace"> al-extents 3389;<br></font><font face="courier new, monospace"> disk-barrier no;<br>
</font><font face="courier new, monospace"> disk-flushes no;<br></font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> startup {<br>
</font><font face="courier new, monospace"> wfc-timeout 15;<br></font><font face="courier new, monospace"> degr-wfc-timeout 60;<br></font><font face="courier new, monospace"> become-primary-on both;<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> net {<br></font><font face="courier new, monospace"># allow-two-primaries - Generally, DRBD has a primary and a secondary node.<br>
</font><font face="courier new, monospace"># In this case, we will allow both nodes to have the filesystem mounted at<br></font><font face="courier new, monospace"># the same time. Do this only with a clustered filesystem. If you do this<br>
</font><font face="courier new, monospace"># with a non-clustered filesystem like ext2/ext3/ext4 or reiserfs, you will<br></font><font face="courier new, monospace"># have data corruption.<br></font><font face="courier new, monospace"> allow-two-primaries;</font><font face="courier new, monospace"><br>
</font><font face="courier new, monospace"># after-sb-0pri discard-zero-changes - DRBD detected a split-brain scenario,<br></font><font face="courier new, monospace"># but none of the nodes think they're a primary. DRBD will take the newest<br>
</font><font face="courier new, monospace"># modifications and apply them to the node that didn't have any changes.<br></font><font face="courier new, monospace"> after-sb-0pri discard-zero-changes;</font><font face="courier new, monospace"><br>
</font><font face="courier new, monospace"># after-sb-1pri discard-secondary - DRBD detected a split-brain scenario,<br></font><font face="courier new, monospace"># but one node is the primary and the other is the secondary. In this case,<br>
</font><font face="courier new, monospace"># DRBD will decide that the secondary node is the victim and it will sync data<br></font><font face="courier new, monospace"># from the primary to the secondary automatically.<br>
</font><font face="courier new, monospace"> after-sb-1pri discard-secondary;</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"># after-sb-2pri disconnect - DRBD detected a split-brain scenario, but it can't<br>
</font><font face="courier new, monospace"># figure out which node has the right data. It tries to protect the consistency<br></font><font face="courier new, monospace"># of both nodes by disconnecting the DRBD volume entirely. You'll have to tell<br>
</font><font face="courier new, monospace"># DRBD which node has the valid data in order to reconnect the volume.<br></font><font face="courier new, monospace"> after-sb-2pri disconnect;</font><font face="courier new, monospace"><br>
</font><font face="courier new, monospace"> max-buffers 8000;<br></font><font face="courier new, monospace"> max-epoch-size 8000;<br></font><font face="courier new, monospace"> sndbuf-size 512k;<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> on bellerophon {<br></font><font face="courier new, monospace"> device /dev/drbd1;<br>
</font><font face="courier new, monospace"> disk /dev/sda1;<br></font><font face="courier new, monospace"> address <a href="http://172.16.0.10:7789">172.16.0.10:7789</a>;<br></font><font face="courier new, monospace"> meta-disk internal;<br>
</font><font face="courier new, monospace"> }</font><font face="courier new, monospace"><br></font><font face="courier new, monospace"> on bia {<br></font><font face="courier new, monospace"> device /dev/drbd1;<br>
</font><font face="courier new, monospace"> disk /dev/sda1;<br></font><font face="courier new, monospace"> address <a href="http://172.16.0.11:7789">172.16.0.11:7789</a>;<br></font><font face="courier new, monospace"> meta-disk internal;<br>
</font><font face="courier new, monospace"> }<br></font><font face="courier new, monospace">}</font></blockquote></div><div><br></div><div>If anyone has any information that might assist me in figuring out what the issue is here, I'd really appreciate it.</div>
<div><br></div><div>Adam.</div><div><br></div>-- <br>Adam Randall<br>"To err is human... to really foul up requires the root password."
</div></div>