Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear DRBD-list, we've used DRBD+Heartbeat for a couple of years in a large LAMP based CMS system with 300+ gigs of MySQL data. We use it to store MySQL data on two physical 64 bit RHEL5 5.7 servers. It has served us very well. There has been no crashes until now, but being a HA setup, it has allowed for sw / hw maintenance with very little associated downtime. Lately, however, we've been subject to huge number of 'Digest mismatch, buffer modified by upper layers ' events and associated resyncs.This happens many times an hour and makes online verification impossible. Replacing the network interfaces have not helped. Very lately, we've had two nasty kernel panics. Syslog data for the latest is included below. We know that digest verification of all transactions is not the recommended practice in a production environment, but in case that the problems should be related to hardware data corruption we want to play it safe. Loosing the DRBD data would cause hours of downtime. Is there any good advice for solving the problems, are we doing something wrong? - Michael -----------------------------------short-info--------------------------------- [root at db1 ~]# rpm -q heartbeat heartbeat-2.1.3-1 [root at db1 ~]# rpm -qa | grep drbd drbd-km-2.6.18_274.el5-8.3.11-1 drbd-utils-8.3.11-1 drbd-heartbeat-8.3.11-1 drbd-udev-8.3.11-1 Hardware: 2x Dell R710 with 96G RAM and internal disks.3 [root at db2 ~]# ethtool --driver eth0 driver: bnx2 version: 2.0.21 firmware-version: 5.0.9 bc 5.0.6 NCSI 2.0.3 bus-info: 0000:01:00.0 See below for detailed info. -----------------------------------------detailed-info-------------------------- db1: /etc/drbd.conf: global { usage-count no; } common { #syncer { rate 30M; verify-alg sha1; } syncer { rate 20M; verify-alg sha1; } net { cram-hmac-alg sha1; shared-secret "###"; data-integrity-alg sha1; } protocol C; handlers { outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater.sh -t 60 -p db1 -r sql"; } disk { no-disk-barrier; fencing resource-only; } } resource sql { on db1 { device minor 1; disk /dev/plads/WebSqlRaw; address x.y.z.4:7789; meta-disk internal; } on db2 { device minor 1; disk /dev/plads/WebSqlRaw; address x.y.z.5:7789; meta-disk internal; } } //////////////////////// [root at db2 ~]# cat /usr/lib64/heartbeat/drbd-peer-outdater.sh #!/bin/sh # FE to drbd-peer-outdater to cope with bug in HeartBeat 2.1.3 /usr/lib64/heartbeat/drbd-peer-outdater "$@" ex=$? if [ $ex = 20 ]; then ex=5 fi exit $ex //////////////////////// [root at db2 ~]# tail -3 /etc/ha.d/ha.cf respawn hacluster /usr/lib64/heartbeat/dopd apiauth dopd gid=haclient uid=hacluster //////////////////////// [root at db2 ~]# cat /etc/ha.d/haresources db1 drbddisk::sql Filesystem::/dev/drbd1::/var/lib/mysql::ext3::noatime x.y.z.150 mysqld memcached //////////////////////// [root at db1 ~]# rpm -q heartbeat heartbeat-2.1.3-1 [root at db1 ~]# rpm -qa | grep drbd drbd-km-2.6.18_274.el5-8.3.11-1 drbd-utils-8.3.11-1 drbd-heartbeat-8.3.11-1 drbd-udev-8.3.11-1 [root at db1 ~]# uname -a ; cat /etc/redhat-release Linux db1 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 5.7 (Tikanga) Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Digest mismatch, buffer modified by upper layers during write: 84581616s +4096 Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Digest integrity check FAILED: 84581616s +4096 Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: error receiving Data, l: 4140! Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: asender terminated Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Terminating asender thread Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Connection closed Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( ProtocolError -> Unconnected ) Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver terminated Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Restarting receiver thread Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver (re)started Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: sock was shut down by peer Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: short read expecting header on sock: r=0 Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: meta connection shut down by peer. Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: asender terminated Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Terminating asender thread Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: new current UUID 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Connection closed Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( BrokenPipe -> Unconnected ) Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver terminated Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Restarting receiver thread Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver (re)started Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( Unconnected -> WFConnection ) Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd peer: db2 Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd resource: sql Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Connecting channel Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Client outdater (0x11617150) connected Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: invoked: outdater Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Processing msg from outdater Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :sql) Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Starting node walk Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( Unconnected -> WFConnection ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Handshake successful: Agreed network protocol version 96 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFConnection -> WFReportParams ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Starting asender thread (from drbd1_receiver [29862]) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: data-integrity-alg: sha1 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Handshake successful: Agreed network protocol version 96 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFConnection -> WFReportParams ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Starting asender thread (from drbd1_receiver [13639]) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: data-integrity-alg: sha1 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: drbd_sync_handshake: Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: self 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 bits:40 flags:0 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 bits:0 flags:0 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: uuid_compare()=1 by rule 70 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: drbd_sync_handshake: ug 27 20:10:04 x.y.z.5 kernel:: block drbd1: self CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 bits:0 flags:0 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 bits:40 flags:0 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: uuid_compare()=-1 by rule 50 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFBitMapT -> WFSyncUUID ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Began resync as SyncSource (will sync 160 KB [40 bits set]). Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: updated sync UUID 211E1556F8380CC3:CE402D613EF98849:CE3F2D613EF98849:58345E642CB68BA5 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: updated sync uuid CE402D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Began resync as SyncTarget (will sync 160 KB [40 bits set]). Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: BAD! sector=837397720s enr=25555 rs_left=-1 rs_failed=0 count=1 Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: <4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6082"] Aug 27 20:10:04 x.y.z.4 kernel:: Call Trace: Aug 27 20:10:04 x.y.z.4 kernel:: <IRQ> [<ffffffff885e4e69>] :drbd:drbd_try_clear_on_disk_bm+0xbb/0x27b Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e6ecf>] :drbd:__drbd_set_in_sync+0x27b/0x2aa Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e1b80>] :drbd:_req_may_be_done+0x311/0x52b Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e2040>] :drbd:__req_mod+0x293/0xccc Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885d5476>] :drbd:drbd_endio_pri+0xbe/0xee Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff881495af>] :dm_mod:clone_endio+0xc1/0xd0 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8002cf01>] __end_that_request_first+0x23c/0x4b6 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a202>] :scsi_mod:scsi_end_request+0x27/0xcd Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a3f6>] :scsi_mod:scsi_io_completion+0x14e/0x324 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff880a7802>] :sd_mod:sd_rw_intr+0x252/0x28c Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a68d>] :scsi_mod:scsi_device_unbusy+0x67/0x81 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80037ba5>] blk_done_softirq+0x5f/0x6d Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80012562>] __do_softirq+0x89/0x133 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d636>] do_softirq+0x2c/0x7d Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d4c6>] do_IRQ+0xee/0xf7 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005d615>] ret_from_intr+0x0/0xa Aug 27 20:10:04 x.y.z.4 kernel:: <EOI> [<ffffffff800968ea>] ksoftirqd+0x0/0xbf Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8014fc8d>] cfq_queue_empty+0x0/0x16 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80154f55>] kobject_release+0x0/0x9 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d685>] do_softirq+0x7b/0x7d Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80096949>] ksoftirqd+0x5f/0xbf Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032722>] kthread+0xfe/0x132 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032624>] kthread+0x0/0x132 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: <4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6110"]