[DRBD-user] RedHat5.7 "buffer modified by upper layers" + kernel panic [LONG]

Michael Glad glad at science.au.dk
Sat Aug 27 21:48:03 CEST 2011


Dear DRBD-list, we've used DRBD+Heartbeat for a couple of years in a 
large LAMP based CMS system with 300+ gigs of MySQL data.
We use it to store MySQL data on two physical 64 bit RHEL5 5.7 servers. 
It has served us very well. There has been no crashes
until now, but being a HA setup, it has allowed for sw / hw maintenance 
with very little associated downtime.

Lately, however, we've been subject to huge number of 'Digest mismatch, 
buffer modified by upper layers '
events and associated resyncs.This happens many times an hour and makes 
online verification
impossible. Replacing the network interfaces have not helped.

Very lately, we've had two nasty kernel panics. Syslog data for the 
latest is included below.

We know that digest verification of all transactions is not the 
recommended practice in a production environment,
but in case that the problems should be related to hardware data 
corruption we want to play it safe.
Loosing the DRBD data would cause hours of downtime.

Is there any good advice for solving the problems, are we doing 
something wrong?

   - Michael

-----------------------------------short-info---------------------------------

[root at db1 ~]# rpm -q heartbeat
heartbeat-2.1.3-1

[root at db1 ~]# rpm -qa | grep drbd
drbd-km-2.6.18_274.el5-8.3.11-1
drbd-utils-8.3.11-1
drbd-heartbeat-8.3.11-1
drbd-udev-8.3.11-1

Hardware: 2x Dell R710 with 96G RAM and internal disks.3
[root at db2 ~]# ethtool  --driver eth0
driver: bnx2
version: 2.0.21
firmware-version: 5.0.9 bc 5.0.6 NCSI 2.0.3
bus-info: 0000:01:00.0

See below for detailed info.

-----------------------------------------detailed-info--------------------------

db1: /etc/drbd.conf:

global {
         usage-count no;
}
common {
         #syncer { rate 30M; verify-alg sha1; }
         syncer { rate 20M; verify-alg sha1; }
         net {
                 cram-hmac-alg sha1;
                 shared-secret "###";
                 data-integrity-alg sha1;
         }
         protocol C;
         handlers {
                 outdate-peer 
"/usr/lib64/heartbeat/drbd-peer-outdater.sh -t 60 -p db1 -r sql";
         }
         disk {
                 no-disk-barrier;
                 fencing resource-only;
         }
}
resource sql {
         on db1 {
                 device    minor 1;
                 disk      /dev/plads/WebSqlRaw;
                 address   x.y.z.4:7789;
                 meta-disk internal;
         }
         on db2 {
                 device    minor 1;
                 disk      /dev/plads/WebSqlRaw;
                 address   x.y.z.5:7789;
                 meta-disk internal;
         }
}
////////////////////////
[root at db2 ~]# cat /usr/lib64/heartbeat/drbd-peer-outdater.sh
#!/bin/sh

# FE to drbd-peer-outdater to cope with bug in HeartBeat 2.1.3

/usr/lib64/heartbeat/drbd-peer-outdater "$@"
ex=$?
if [ $ex = 20 ]; then
ex=5
fi
exit $ex
////////////////////////
[root at db2 ~]# tail -3 /etc/ha.d/ha.cf
respawn hacluster /usr/lib64/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

////////////////////////
[root at db2 ~]# cat /etc/ha.d/haresources
db1 drbddisk::sql Filesystem::/dev/drbd1::/var/lib/mysql::ext3::noatime 
x.y.z.150  mysqld memcached
////////////////////////

[root at db1 ~]# rpm -q heartbeat
heartbeat-2.1.3-1

[root at db1 ~]# rpm -qa | grep drbd
drbd-km-2.6.18_274.el5-8.3.11-1
drbd-utils-8.3.11-1
drbd-heartbeat-8.3.11-1
drbd-udev-8.3.11-1

[root at db1 ~]# uname -a ; cat /etc/redhat-release
Linux db1 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 
x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 5.7 (Tikanga)



Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Digest mismatch, buffer 
modified by upper layers during write: 84581616s +4096
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Digest integrity check 
FAILED: 84581616s +4096
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: error receiving Data, l: 4140!
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: peer( Primary -> Unknown ) 
conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: asender terminated
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Terminating asender thread
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Connection closed
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( ProtocolError -> 
Unconnected )
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver terminated
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Restarting receiver thread
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver (re)started
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: sock was shut down by peer
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: peer( Secondary -> Unknown 
) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: short read expecting 
header on sock: r=0
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: meta connection shut down 
by peer.
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: asender terminated
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Terminating asender thread
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: new current UUID 
211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Connection closed
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( BrokenPipe -> 
Unconnected )
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver terminated
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Restarting receiver thread
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver (re)started
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: helper command: 
/sbin/drbdadm fence-peer minor-1
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( Unconnected -> 
WFConnection )
Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd peer: db2
Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd 
resource: sql
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: 
Connecting channel
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: 
Client outdater (0x11617150) connected
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: 
invoked: outdater
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: 
Processing msg from outdater
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Got 
message from (drbd-peer-outdater). (peer: db2, res :sql)
Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: 
Starting node walk
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( Unconnected -> 
WFConnection )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Handshake successful: 
Agreed network protocol version 96
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Peer authenticated using 
20 bytes of 'sha1' HMAC
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFConnection -> 
WFReportParams )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Starting asender thread 
(from drbd1_receiver [29862])
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: data-integrity-alg: sha1
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Handshake successful: 
Agreed network protocol version 96
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Peer authenticated using 
20 bytes of 'sha1' HMAC
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFConnection -> 
WFReportParams )
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Starting asender thread 
(from drbd1_receiver [13639])
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: data-integrity-alg: sha1
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: drbd_sync_handshake:
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: self 
211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 
bits:40 flags:0
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer 
CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 
bits:0 flags:0
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: uuid_compare()=1 by rule 70
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer( Unknown -> Secondary 
) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: drbd_sync_handshake:
ug 27 20:10:04 x.y.z.5 kernel:: block drbd1: self 
CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 
bits:0 flags:0
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer 
211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 
bits:40 flags:0
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: uuid_compare()=-1 by rule 50
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer( Unknown -> Primary ) 
conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( 
DUnknown -> UpToDate )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFBitMapT -> 
WFSyncUUID )
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: 
/sbin/drbdadm before-resync-source minor-1
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: 
/sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFBitMapS -> 
SyncSource ) pdsk( Consistent -> Inconsistent )
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Began resync as SyncSource 
(will sync 160 KB [40 bits set]).
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: updated sync UUID 
211E1556F8380CC3:CE402D613EF98849:CE3F2D613EF98849:58345E642CB68BA5
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: updated sync uuid 
CE402D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: 
/sbin/drbdadm before-resync-target minor-1
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: 
/sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFSyncUUID -> 
SyncTarget ) disk( Outdated -> Inconsistent )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Began resync as SyncTarget 
(will sync 160 KB [40 bits set]).
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: BAD! sector=837397720s 
enr=25555 rs_left=-1 rs_failed=0 count=1
Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: 
<4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6082"]
Aug 27 20:10:04 x.y.z.4 kernel:: Call Trace:
Aug 27 20:10:04 x.y.z.4 kernel:: <IRQ>  [<ffffffff885e4e69>] 
:drbd:drbd_try_clear_on_disk_bm+0xbb/0x27b
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e6ecf>] 
:drbd:__drbd_set_in_sync+0x27b/0x2aa
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e1b80>] 
:drbd:_req_may_be_done+0x311/0x52b
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e2040>] 
:drbd:__req_mod+0x293/0xccc
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885d5476>] 
:drbd:drbd_endio_pri+0xbe/0xee
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff881495af>] 
:dm_mod:clone_endio+0xc1/0xd0
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8002cf01>] 
__end_that_request_first+0x23c/0x4b6
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a202>] 
:scsi_mod:scsi_end_request+0x27/0xcd
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a3f6>] 
:scsi_mod:scsi_io_completion+0x14e/0x324
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff880a7802>] 
:sd_mod:sd_rw_intr+0x252/0x28c
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a68d>] 
:scsi_mod:scsi_device_unbusy+0x67/0x81
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80037ba5>] 
blk_done_softirq+0x5f/0x6d
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80012562>] 
__do_softirq+0x89/0x133
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d636>] do_softirq+0x2c/0x7d
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d4c6>] do_IRQ+0xee/0xf7
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Aug 27 20:10:04 x.y.z.4 kernel:: <EOI>  [<ffffffff800968ea>] 
ksoftirqd+0x0/0xbf
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8014fc8d>] 
cfq_queue_empty+0x0/0x16
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80154f55>] 
kobject_release+0x0/0x9
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d685>] do_softirq+0x7b/0x7d
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80096949>] ksoftirqd+0x5f/0xbf
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032722>] kthread+0xfe/0x132
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032624>] kthread+0x0/0x132
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: 
<4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6110"]




More information about the drbd-user mailing list