[DRBD-user] communication constantly terminated, always re-syncing

Cory Coager ccoager at davisvision.com
Thu Feb 24 18:03:53 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I have a two node cluster using pacemaker and drbd. The machines 
hardware and software are identical, HP ProLiant DL585 G2, 4x dual core 
(8 total) cpu's, 96gb RAM, 8x 146gb 15krpm drives in RAID 10, Ubuntu 
10.04.1, 2.6.32-21-server x86_64, drbd 8.3.7-1ubuntu2.1. There are three 
partitions, a root, swap and a drbd partition.

The are two NIC's, one for the server and the second dedicated to 
pacemaker/drbd. The pacemaker/drbd NIC used to be on separate switches. 
The problem has always been happening from the initial setup. We tried 
moving the NIC's off the switch and now it is a cross over cable between 
the two servers. This didn't help resolve the issue.

Pacemaker sees both nodes and drbd is syncing data. However, the slave 
is constantly getting communication terminated errors and the disk 
becomes inconsistent. This results in constant resyncs and sometimes 
very big syncs.

Feb 21 20:30:25 node1 kernel: [279371.926494] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 380 K/sec)
Feb 21 20:30:26 node1 kernel: [279372.748328] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 216 K/sec)
Feb 21 20:30:30 node1 kernel: [279376.838319] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 56 K/sec)
Feb 21 21:54:38 node1 kernel: [284425.156138] block drbd0: Resync done 
(total 409 sec; paused 0 sec; 114488 K/sec)
Feb 21 22:00:11 node1 kernel: [284758.103399] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 336 K/sec)
Feb 21 22:00:31 node1 kernel: [284778.286563] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 176 K/sec)
Feb 21 22:01:01 node1 kernel: [284807.602122] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 80 K/sec)


Here is the full error message that happens from start to finish, this 
is happening every minute or so:
Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Archived 
previous version as /var/lib/heartbeat/crm/cib-39.raw
Feb 24 11:24:41 node1 kernel: [505827.918922] block drbd0: helper 
command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Feb 24 11:24:41 node1 kernel: [505827.918927] block drbd0: fence-peer 
helper returned 4 (peer was fenced)
Feb 24 11:24:41 node1 kernel: [505827.918943] block drbd0: pdsk( 
DUnknown -> Outdated )
Feb 24 11:24:41 node1 kernel: [505827.919111] block drbd0: conn( 
BrokenPipe -> Unconnected )
Feb 24 11:24:41 node1 kernel: [505827.919115] block drbd0: receiver 
terminated
Feb 24 11:24:41 node1 kernel: [505827.919117] block drbd0: Restarting 
receiver thread
Feb 24 11:24:41 node1 kernel: [505827.919120] block drbd0: receiver 
(re)started
Feb 24 11:24:41 node1 kernel: [505827.919125] block drbd0: conn( 
Unconnected -> WFConnection )
Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Wrote 
version 0.30349.0 of the CIB to disk (digest: 
0d2aff989f5f66a397a487a4ff2d53a5)
Feb 24 11:24:41 node1 cib: [30431]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.rgAHMv (digest: 
/var/lib/heartbeat/crm/cib.Ec7gkf)
Feb 24 11:24:41 node1 kernel: [505828.010679] block drbd0: Handshake 
successful: Agreed network protocol version 91
Feb 24 11:24:41 node1 kernel: [505828.010683] block drbd0: conn( 
WFConnection -> WFReportParams )
Feb 24 11:24:41 node1 kernel: [505828.010692] block drbd0: Starting 
asender thread (from drbd0_receiver [2140])
Feb 24 11:24:41 node1 kernel: [505828.010790] block drbd0: 
data-integrity-alg: crc32c
Feb 24 11:24:41 node1 kernel: [505828.010895] block drbd0: 
drbd_sync_handshake:
Feb 24 11:24:41 node1 kernel: [505828.010898] block drbd0: self 
E3254BAE022AD895:4D7F17D05BAFCC7F:384798F69E852173:3D632ECBB4F41D4D 
bits:24 flags:0
Feb 24 11:24:41 node1 kernel: [505828.010900] block drbd0: peer 
4D7F17D05BAFCC7E:0000000000000000:384798F69E852172:3D632ECBB4F41D4D 
bits:0 flags:0
Feb 24 11:24:41 node1 kernel: [505828.010903] block drbd0: 
uuid_compare()=1 by rule 70
Feb 24 11:24:41 node1 kernel: [505828.010907] block drbd0: peer( Unknown 
-> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> 
UpToDate )
Feb 24 11:24:41 node1 kernel: [505828.307264] block drbd0: conn( 
WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Feb 24 11:24:41 node1 kernel: [505828.307281] block drbd0: Began resync 
as SyncSource (will sync 96 KB [24 bits set]).
Feb 24 11:24:41 node1 kernel: [505828.361005] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 96 K/sec)
Feb 24 11:24:41 node1 kernel: [505828.361014] block drbd0: conn( 
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Archived 
previous version as /var/lib/heartbeat/crm/cib-40.raw
Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Wrote 
version 0.30350.0 of the CIB to disk (digest: 
21cd5c7cecba96478c4ad28cebfe4209)
Feb 24 11:24:42 node1 cib: [30433]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.qA3DP2 (digest: 
/var/lib/heartbeat/crm/cib.GVirfO)


Here is the drbd config:
common {
   protocol C;
}
resource postgres {
   startup {
     wfc-timeout 0;
     degr-wfc-timeout 120;
   }
   handlers {
     fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
     after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
     #split-brain "/usr/local/scripts/splitbrain.sh";
   }
   disk {
     fencing resource-only;
     on-io-error detach;
   }
   net {
     data-integrity-alg crc32c;
     after-sb-0pri discard-younger-primary;
     after-sb-1pri discard-secondary;
     after-sb-2pri disconnect;
   }
   syncer {
     rate 125M;
     verify-alg sha1;
   }
   on node2 {
     device    /dev/drbd0;
     disk      /dev/cciss/c0d0p3;
     address   172.29.10.101:7789;
     meta-disk internal;
   }
   on node1 {
     device    /dev/drbd0;
     disk      /dev/cciss/c0d0p3;
     address   172.29.10.100:7789;
     meta-disk internal;
   }
}


I am looking to make this setup more stable.  Let me know what other 
information you need from me.



------------------------------------------------------------------------
The information contained in this communication is intended
only for the use of the recipient(s) named above. It may
contain information that is privileged or confidential, and
may be protected by State and/or Federal Regulations. If
the reader of this message is not the intended recipient,
you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of
its contents, is strictly prohibited. If you have received
this communication in error, please return it to the sender
immediately and delete the original message and any copy
of it from your computer system. If you have any questions
concerning this message, please contact the sender.
------------------------------------------------------------------------




More information about the drbd-user mailing list