[DRBD-user] DRBD 8.4.4, kernel 3.10 major problem under high parallel I/O load

Tue Nov 5 09:23:09 CET 2013

Hello,

Debian wheezy, hand rolled 3.10.17 kernel and DRBD 8.4.4. 
Standard linbit DRBD resource, dopd outdater.

Note that this hardware has run under 3.2 and 8.3.13 flawlessly for 1.5
years.

Note also that this cluster is one of 3 nearly identical ones , this one
has Adaptec 51645 controllers, the others Areca 1882 (I'm much happier
with Areca performance and otherwise). 
The other 2 clusters are running 3.4.x kernels and DRBD 8.4.3 w/o any
problems.

These clusters are mailbox servers, so they after running bonnie++ to get
some performance data the next step in my burn-in process is to unleash a
torrent of small mails via postal. 
This is when the below fireworks happens, easily reproducible within one
minute of the post run.

Here's the gritty bits:
---
 [  700.691264] block drbd0: local WRITE IO error sector 5941500912+24 on sda
*** Really? Nothing in the logs indicating a problem, nothing in the controller logs
 [  700.691273] block drbd0: disk( UpToDate -> Failed ) 
 [  700.691277] block drbd0: Local IO failed in __req_mod. Detaching...
 [  700.693270] block drbd0: bitmap WRITE of 0 pages took 1 jiffies
 [  700.693276] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
 [  700.693286] block drbd0: disk( Failed -> Diskless ) 
 [  721.747108] drbd mb01: sock was shut down by peer
*** Why did the peer drop the ball? See below.
 [  721.747134] drbd mb01: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
 [  721.747137] drbd mb01: short read (expected size 16)
 [  721.747152] block drbd0: IO ERROR: neither local nor remote data, sector 5939137608+8
 [  721.747251] Buffer I/O error on device drbd0, logical block 742392201
 [  721.747322] block drbd0: helper command: /sbin/drbdadm pri-on-incon-degr minor-0
 [  721.747336] lost page write due to I/O error on drbd0
 [  721.747352] block drbd0: IO ERROR: neither local nor remote data, sector 5939137616+8
 [  721.747362] drbd mb01: meta connection shut down by peer.
 [  721.747513] Buffer I/O error on device drbd0, logical block 742392202
 [  721.747522] drbd mb01: asender terminated
 [  721.747532] drbd mb01: Terminating drbd_a_mb01
 [  721.747598] lost page write due to I/O error on drbd0
 [  721.747612] block drbd0: IO ERROR: neither local nor remote data, sector 5939137624+8
 [  721.747697] Buffer I/O error on device drbd0, logical block 742392203
 [  721.747776] lost page write due to I/O error on drbd0
*** a few dozen more of these
 [  721.749045] block drbd0: Should have called drbd_al_complete_io(, 5941500912, 12288), but my Disk seems to have failed :(
 [  721.749573] drbd mb01: Connection closed
 [  721.749608] drbd0: WRITE SAME failed. Manually zeroing.
 [  721.749747] EXT4-fs (drbd0): delayed block allocation failed for inode 92810956 at logical offset 0 with max blocks 1 with error -5
 [  721.749898] EXT4-fs (drbd0): This should not happen!! Data will be lost
*** Indeed, data was lost, requiring a fsck later.
 [  721.750093] drbd mb01: conn( BrokenPipe -> Unconnected ) 
 [  721.750106] drbd mb01: receiver terminated
 [  721.750111] drbd mb01: Restarting receiver thread
 [  721.750115] drbd mb01: receiver (re)started
 [  721.750172] drbd mb01: conn( Unconnected -> WFConnection ) 
 [  721.750317] drbd mb01: Not fencing peer, I'm not even Consistent myself.
 [  721.750605] Aborting journal on device drbd0-8.
 [  721.750717] EXT4-fs error (device drbd0) in __ext4_new_inode:750: Readonly filesystem
 [  721.750832] EXT4-fs error (device drbd0) in __ext4_new_inode:750: Readonly filesystem
*** a few dozen more of those
 [  721.756323] EXT4-fs (drbd0): Remounting filesystem read-only
 [  721.756434] EXT4-fs (drbd0): previous I/O error to superblock detected
 [  721.757152] EXT4-fs (drbd0): I/O error while writing superblock
 [  722.247321] drbd mb01: Handshake successful: Agreed network protocol version 101
 [  722.247328] drbd mb01: Agreed to support TRIM on protocol level
 [  722.247458] drbd mb01: conn( WFConnection -> WFReportParams ) 
 [  722.247465] drbd mb01: Starting asender thread (from drbd_r_mb01 [4405])
 [  722.329390] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) 
---

Again, when the postal run starts, the machine gets hammered. But that
shouldn't be a problem (other than slowing down of course) and wasn't in
the past. 
So firstly I think that initial I/O error is a fluke, not sure which layer
messed up there.
Secondly the other side drop the link right when it was needed most.
This is the pertinent log entry from the other node for that time:
---
Nov  5 15:38:40 mbx02 kernel: [513179.376781] drbd mb01: Wrong magic value 0x00000000 in protocol version 101
Nov  5 15:38:40 mbx02 kernel: [513179.376903] drbd mb01: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) 
---

I noticed that (as mentioned in some earlier post) during mkfs.ext4 the
drbd_a_<resource> process on the primary node goes to 100% when writing
the inode tables. If the watchdog (/proc/sys/kernel/nmi_watchdog) is
enabled (which it is by default), this will trigger the watchdog process
of the respective CPU to go berserk, I've seen 300% CPU usage there. OTOH,
I didn't notice any ill effects. 
Well, it does this with the watchdog disabled:
---
Nov  5 16:30:39 mbx01 kernel: [ 3842.517383] INFO: rcu_sched self-detected stall on CPU { 1}  (t=15001 jiffies g=93507 c=93506 q=26655)
Nov  5 16:30:39 mbx01 kernel: [ 3842.517500] sending NMI to all CPUs:
---
Culprit being the drbd_a_<resource> process...
But just to make sure I disabled it when running into the problem above,
but it made no difference at all.

So the obvious question who's to blame, 3.10 or DRBD 8.4.4 or that
combination?

If there's anything further to test and try, let me know, these machines
don't have to go back into production for another 2-3 weeks.

Next thing I will test is 8.4.3 with that kernel, then the same version of
3.4 and 8.4.3 that works flawlessly on the other clusters.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/