Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have a two node cluster using pacemaker and drbd. The machines hardware and software are identical, HP ProLiant DL585 G2, 4x dual core (8 total) cpu's, 96gb RAM, 8x 146gb 15krpm drives in RAID 10, Ubuntu 10.04.1, 2.6.32-21-server x86_64, drbd 8.3.7-1ubuntu2.1. There are three partitions, a root, swap and a drbd partition. The are two NIC's, one for the server and the second dedicated to pacemaker/drbd. The pacemaker/drbd NIC used to be on separate switches. The problem has always been happening from the initial setup. We tried moving the NIC's off the switch and now it is a cross over cable between the two servers. This didn't help resolve the issue. Pacemaker sees both nodes and drbd is syncing data. However, the slave is constantly getting communication terminated errors and the disk becomes inconsistent. This results in constant resyncs and sometimes very big syncs. Feb 21 20:30:25 node1 kernel: [279371.926494] block drbd0: Resync done (total 1 sec; paused 0 sec; 380 K/sec) Feb 21 20:30:26 node1 kernel: [279372.748328] block drbd0: Resync done (total 1 sec; paused 0 sec; 216 K/sec) Feb 21 20:30:30 node1 kernel: [279376.838319] block drbd0: Resync done (total 1 sec; paused 0 sec; 56 K/sec) Feb 21 21:54:38 node1 kernel: [284425.156138] block drbd0: Resync done (total 409 sec; paused 0 sec; 114488 K/sec) Feb 21 22:00:11 node1 kernel: [284758.103399] block drbd0: Resync done (total 1 sec; paused 0 sec; 336 K/sec) Feb 21 22:00:31 node1 kernel: [284778.286563] block drbd0: Resync done (total 1 sec; paused 0 sec; 176 K/sec) Feb 21 22:01:01 node1 kernel: [284807.602122] block drbd0: Resync done (total 1 sec; paused 0 sec; 80 K/sec) Here is the full error message that happens from start to finish, this is happening every minute or so: Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-39.raw Feb 24 11:24:41 node1 kernel: [505827.918922] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400) Feb 24 11:24:41 node1 kernel: [505827.918927] block drbd0: fence-peer helper returned 4 (peer was fenced) Feb 24 11:24:41 node1 kernel: [505827.918943] block drbd0: pdsk( DUnknown -> Outdated ) Feb 24 11:24:41 node1 kernel: [505827.919111] block drbd0: conn( BrokenPipe -> Unconnected ) Feb 24 11:24:41 node1 kernel: [505827.919115] block drbd0: receiver terminated Feb 24 11:24:41 node1 kernel: [505827.919117] block drbd0: Restarting receiver thread Feb 24 11:24:41 node1 kernel: [505827.919120] block drbd0: receiver (re)started Feb 24 11:24:41 node1 kernel: [505827.919125] block drbd0: conn( Unconnected -> WFConnection ) Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Wrote version 0.30349.0 of the CIB to disk (digest: 0d2aff989f5f66a397a487a4ff2d53a5) Feb 24 11:24:41 node1 cib: [30431]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.rgAHMv (digest: /var/lib/heartbeat/crm/cib.Ec7gkf) Feb 24 11:24:41 node1 kernel: [505828.010679] block drbd0: Handshake successful: Agreed network protocol version 91 Feb 24 11:24:41 node1 kernel: [505828.010683] block drbd0: conn( WFConnection -> WFReportParams ) Feb 24 11:24:41 node1 kernel: [505828.010692] block drbd0: Starting asender thread (from drbd0_receiver [2140]) Feb 24 11:24:41 node1 kernel: [505828.010790] block drbd0: data-integrity-alg: crc32c Feb 24 11:24:41 node1 kernel: [505828.010895] block drbd0: drbd_sync_handshake: Feb 24 11:24:41 node1 kernel: [505828.010898] block drbd0: self E3254BAE022AD895:4D7F17D05BAFCC7F:384798F69E852173:3D632ECBB4F41D4D bits:24 flags:0 Feb 24 11:24:41 node1 kernel: [505828.010900] block drbd0: peer 4D7F17D05BAFCC7E:0000000000000000:384798F69E852172:3D632ECBB4F41D4D bits:0 flags:0 Feb 24 11:24:41 node1 kernel: [505828.010903] block drbd0: uuid_compare()=1 by rule 70 Feb 24 11:24:41 node1 kernel: [505828.010907] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> UpToDate ) Feb 24 11:24:41 node1 kernel: [505828.307264] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Feb 24 11:24:41 node1 kernel: [505828.307281] block drbd0: Began resync as SyncSource (will sync 96 KB [24 bits set]). Feb 24 11:24:41 node1 kernel: [505828.361005] block drbd0: Resync done (total 1 sec; paused 0 sec; 96 K/sec) Feb 24 11:24:41 node1 kernel: [505828.361014] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-40.raw Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Wrote version 0.30350.0 of the CIB to disk (digest: 21cd5c7cecba96478c4ad28cebfe4209) Feb 24 11:24:42 node1 cib: [30433]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.qA3DP2 (digest: /var/lib/heartbeat/crm/cib.GVirfO) Here is the drbd config: common { protocol C; } resource postgres { startup { wfc-timeout 0; degr-wfc-timeout 120; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; #split-brain "/usr/local/scripts/splitbrain.sh"; } disk { fencing resource-only; on-io-error detach; } net { data-integrity-alg crc32c; after-sb-0pri discard-younger-primary; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } syncer { rate 125M; verify-alg sha1; } on node2 { device /dev/drbd0; disk /dev/cciss/c0d0p3; address 172.29.10.101:7789; meta-disk internal; } on node1 { device /dev/drbd0; disk /dev/cciss/c0d0p3; address 172.29.10.100:7789; meta-disk internal; } } I am looking to make this setup more stable. Let me know what other information you need from me. ------------------------------------------------------------------------ The information contained in this communication is intended only for the use of the recipient(s) named above. It may contain information that is privileged or confidential, and may be protected by State and/or Federal Regulations. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. ------------------------------------------------------------------------