Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have a two node cluster using pacemaker and drbd. The machines
hardware and software are identical, HP ProLiant DL585 G2, 4x dual core
(8 total) cpu's, 96gb RAM, 8x 146gb 15krpm drives in RAID 10, Ubuntu
10.04.1, 2.6.32-21-server x86_64, drbd 8.3.7-1ubuntu2.1. There are three
partitions, a root, swap and a drbd partition.
The are two NIC's, one for the server and the second dedicated to
pacemaker/drbd. The pacemaker/drbd NIC used to be on separate switches.
The problem has always been happening from the initial setup. We tried
moving the NIC's off the switch and now it is a cross over cable between
the two servers. This didn't help resolve the issue.
Pacemaker sees both nodes and drbd is syncing data. However, the slave
is constantly getting communication terminated errors and the disk
becomes inconsistent. This results in constant resyncs and sometimes
very big syncs.
Feb 21 20:30:25 node1 kernel: [279371.926494] block drbd0: Resync done
(total 1 sec; paused 0 sec; 380 K/sec)
Feb 21 20:30:26 node1 kernel: [279372.748328] block drbd0: Resync done
(total 1 sec; paused 0 sec; 216 K/sec)
Feb 21 20:30:30 node1 kernel: [279376.838319] block drbd0: Resync done
(total 1 sec; paused 0 sec; 56 K/sec)
Feb 21 21:54:38 node1 kernel: [284425.156138] block drbd0: Resync done
(total 409 sec; paused 0 sec; 114488 K/sec)
Feb 21 22:00:11 node1 kernel: [284758.103399] block drbd0: Resync done
(total 1 sec; paused 0 sec; 336 K/sec)
Feb 21 22:00:31 node1 kernel: [284778.286563] block drbd0: Resync done
(total 1 sec; paused 0 sec; 176 K/sec)
Feb 21 22:01:01 node1 kernel: [284807.602122] block drbd0: Resync done
(total 1 sec; paused 0 sec; 80 K/sec)
Here is the full error message that happens from start to finish, this
is happening every minute or so:
Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Archived
previous version as /var/lib/heartbeat/crm/cib-39.raw
Feb 24 11:24:41 node1 kernel: [505827.918922] block drbd0: helper
command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Feb 24 11:24:41 node1 kernel: [505827.918927] block drbd0: fence-peer
helper returned 4 (peer was fenced)
Feb 24 11:24:41 node1 kernel: [505827.918943] block drbd0: pdsk(
DUnknown -> Outdated )
Feb 24 11:24:41 node1 kernel: [505827.919111] block drbd0: conn(
BrokenPipe -> Unconnected )
Feb 24 11:24:41 node1 kernel: [505827.919115] block drbd0: receiver
terminated
Feb 24 11:24:41 node1 kernel: [505827.919117] block drbd0: Restarting
receiver thread
Feb 24 11:24:41 node1 kernel: [505827.919120] block drbd0: receiver
(re)started
Feb 24 11:24:41 node1 kernel: [505827.919125] block drbd0: conn(
Unconnected -> WFConnection )
Feb 24 11:24:41 node1 cib: [30431]: info: write_cib_contents: Wrote
version 0.30349.0 of the CIB to disk (digest:
0d2aff989f5f66a397a487a4ff2d53a5)
Feb 24 11:24:41 node1 cib: [30431]: info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.rgAHMv (digest:
/var/lib/heartbeat/crm/cib.Ec7gkf)
Feb 24 11:24:41 node1 kernel: [505828.010679] block drbd0: Handshake
successful: Agreed network protocol version 91
Feb 24 11:24:41 node1 kernel: [505828.010683] block drbd0: conn(
WFConnection -> WFReportParams )
Feb 24 11:24:41 node1 kernel: [505828.010692] block drbd0: Starting
asender thread (from drbd0_receiver [2140])
Feb 24 11:24:41 node1 kernel: [505828.010790] block drbd0:
data-integrity-alg: crc32c
Feb 24 11:24:41 node1 kernel: [505828.010895] block drbd0:
drbd_sync_handshake:
Feb 24 11:24:41 node1 kernel: [505828.010898] block drbd0: self
E3254BAE022AD895:4D7F17D05BAFCC7F:384798F69E852173:3D632ECBB4F41D4D
bits:24 flags:0
Feb 24 11:24:41 node1 kernel: [505828.010900] block drbd0: peer
4D7F17D05BAFCC7E:0000000000000000:384798F69E852172:3D632ECBB4F41D4D
bits:0 flags:0
Feb 24 11:24:41 node1 kernel: [505828.010903] block drbd0:
uuid_compare()=1 by rule 70
Feb 24 11:24:41 node1 kernel: [505828.010907] block drbd0: peer( Unknown
-> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated ->
UpToDate )
Feb 24 11:24:41 node1 kernel: [505828.307264] block drbd0: conn(
WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Feb 24 11:24:41 node1 kernel: [505828.307281] block drbd0: Began resync
as SyncSource (will sync 96 KB [24 bits set]).
Feb 24 11:24:41 node1 kernel: [505828.361005] block drbd0: Resync done
(total 1 sec; paused 0 sec; 96 K/sec)
Feb 24 11:24:41 node1 kernel: [505828.361014] block drbd0: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Archived
previous version as /var/lib/heartbeat/crm/cib-40.raw
Feb 24 11:24:42 node1 cib: [30433]: info: write_cib_contents: Wrote
version 0.30350.0 of the CIB to disk (digest:
21cd5c7cecba96478c4ad28cebfe4209)
Feb 24 11:24:42 node1 cib: [30433]: info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.qA3DP2 (digest:
/var/lib/heartbeat/crm/cib.GVirfO)
Here is the drbd config:
common {
protocol C;
}
resource postgres {
startup {
wfc-timeout 0;
degr-wfc-timeout 120;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
#split-brain "/usr/local/scripts/splitbrain.sh";
}
disk {
fencing resource-only;
on-io-error detach;
}
net {
data-integrity-alg crc32c;
after-sb-0pri discard-younger-primary;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
syncer {
rate 125M;
verify-alg sha1;
}
on node2 {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 172.29.10.101:7789;
meta-disk internal;
}
on node1 {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 172.29.10.100:7789;
meta-disk internal;
}
}
I am looking to make this setup more stable. Let me know what other
information you need from me.
------------------------------------------------------------------------
The information contained in this communication is intended
only for the use of the recipient(s) named above. It may
contain information that is privileged or confidential, and
may be protected by State and/or Federal Regulations. If
the reader of this message is not the intended recipient,
you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of
its contents, is strictly prohibited. If you have received
this communication in error, please return it to the sender
immediately and delete the original message and any copy
of it from your computer system. If you have any questions
concerning this message, please contact the sender.
------------------------------------------------------------------------