[DRBD-user] kernel crash when secondary disappears. centos 5.3 kernel-xen issue?

Lars Ellenberg lars.ellenberg at linbit.com
Fri Apr 17 01:02:15 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Apr 16, 2009 at 01:38:10PM -0700, Tom Brown wrote:
> 
> Question. Are the centos 5.3 kernel-xen kernels known to conflict
> with DRBD? I've clearly "got a problem". A bad enough one that I
> really have no choice but to quit replicating via drbd for the
> short term :(
> 
> I've just had my second "cascading crash". The old one occured when I turned 
> off checksums on a secondary... the recent one I haven't had a
> chance to look at, there is an oops logged which I will need to
> look into.
> 
> The boxes which are crashing are dual socket dual-core opterons,
> running the current centOS 5.3 kernel-xen. They've been in
> production for well over a year, and been rock solid...
> unfortunately I can't say that for the recent xen/kernel/drbd
> combination.
> 
> all three of these nodes are running drbd 8.3.1 compiled locally
> with "make rpm".
> 
>     # uname -a
>     Linux gt5.baremetal.com 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1
>     10:38:05 EDT 2009 i686 athlon i386 GNU/Linux
> 
>     # rpm -qi kernel-xen
>     Name        : kernel-xen                   Relocations: (not
>     relocatable)
>     Version     : 2.6.18                            Vendor: CentOS
>     Release     : 128.1.6.el5                   Build Date: Wed 01
>     Apr 2009 08:51:53 AM PDT
>     Install Date: Thu 02 Apr 2009 09:33:35 PM PDT      Build Host:
>     builder10.centos.
> 
> Unfortunately, the clocks are not synchronized on these two
> machines.
> 
> The machines:
> 
> gt5. dual socket, dual core opteron (4 cores total), 16 gig ram.
>    primary for bk4-sys and bk4-dat resources
>    secondary for pd4-sys and pd4-dat resources
> 
> gt6. same hardware as gt5.
>    primary for pd4-sys and pd4-dat
>    primary for nineteen-sys and nineteen-dat
> 
> sas1. single socket, dual core xeon, 4 gig ram.
>    secondary for nineteen-sys and nineteen-dat
>    This box DOESN'T CRASH. (implicating kernel-xen?)
> 
>     [root at sas1 ~]# uname -a
>     Linux sas1.baremetal.com 2.6.18-128.1.6.el5PAE #1 SMP Wed Apr 1
>     10:02:22 EDT 2009 i686 i686 i386 GNU/Linux
> 
>     [root at sas1 ~]# rpm -qi kernel-PAE-2.6.18-128.1.6.el5
>     Name        : kernel-PAE                   Relocations: (not
>     Version     : 2.6.18                            Vendor: CentOS
>     Release     : 128.1.6.el5                   Build Date: Wed 01
>     Apr 2009 08:51:53 AM PDT
>     Install Date: Thu 02 Apr 2009 09:45:42 PM PDT      Build Host:
>     builder10.centos.org
> 
> 
> Most recent crash. Appears to have started on gt6, but gt5
> by the time I got paged both boxes had crashed and rebooted.
> 
> Apr 16 12:05:57 gt6 kernel: BUG: unable to handle kernel paging request at 
> virtual address d6ef9000
> Apr 16 12:05:57 gt6 kernel:  printing eip:
> Apr 16 12:05:57 gt6 kernel: ee20d19a
> Apr 16 12:05:57 gt6 kernel: 10414000 -> *pde = 00000002:700e1001
> Apr 16 12:05:57 gt6 kernel: 0fc1b000 -> *pme = 00000000:3c0ae067
> Apr 16 12:05:57 gt6 kernel: 000ae000 -> *pte = 00000000:00000000
> Apr 16 12:05:57 gt6 kernel: Oops: 0000 [#1]
> (this would be a crash...)

Simon Graham explained an interesting effect when using scatter gather
and DRBD and xen...
you could try and tell drbd to no longer use zero copy send using sendpage,
but always do an actual data copy to the socket buffer, which should
avoid the described problem.  easiest way to do so: use DRBD protocol A,
and see if these crashes still occur.

up to now, you are the only one reporting this kind of behaviour,
and I don't think you are the only one using DRBD + Xen.
so this may also be something specific to your setup.

but as long as I don't have any backtraces, I can only speculate.
we would have to be able to reproduce this to actually debug it.

> Apr 16 12:06:38 gt6 syslogd 1.4.1: restart.
> 
> Apr 16 12:06:15 gt5 kernel: drbd1: PingAck did not arrive in time.
> Apr 16 12:06:15 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( 
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr 16 12:06:15 gt5 kernel: drbd1: asender terminated
> Apr 16 12:06:15 gt5 kernel: drbd1: Terminating asender thread
> Apr 16 12:06:15 gt5 kernel: drbd1: short read expecting header on sock: 
> r=-512
> Apr 16 12:06:15 gt5 kernel: drbd1: Creating new current UUID
> Apr 16 12:06:15 gt5 kernel: drbd1: Connection closed
> Apr 16 12:06:15 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected )
> Apr 16 12:06:15 gt5 kernel: drbd1: receiver terminated
> Apr 16 12:06:15 gt5 kernel: drbd1: Restarting receiver thread
> Apr 16 12:06:15 gt5 kernel: drbd1: receiver (re)started
> Apr 16 12:06:15 gt5 kernel: drbd1: conn( Unconnected -> WFConnection )
> (this would be a crash...)
> Apr 16 12:07:36 gt5 syslogd 1.4.1: restart.
> 
> that said, it may have been caused/triggered/complicated by my
> running a verify on gt6.... I did this because I've been seeing
> integrity failures, and I believe they are false positives,
> probably buffers being changed "in flight". Anyhow, I'd turned
> off the integrity checks last night and wanted to prove to myself
> that nothing was being corrupted. But that's a thread from
> another post I haven't made yet (I was collecting data for it).
> 
> Apr 16 11:29:42 sas1 kernel: drbd5: conn( Connected -> VerifyT )
> Apr 16 12:05:03 sas1 kernel: drbd4: PingAck did not arrive in time.
> Apr 16 12:05:03 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected -
> Apr 16 12:05:03 sas1 kernel: drbd4: asender terminated
> (same for drbd5...)
> 
> 
> 
> -----------------------
> 
> This other (older crash) is "more interesting". I turned off
> checksums on sas1, and that apparently caused the network card to
> vanish for about 20 seconds... which apparently pissed of (and
> crashed) gt6, which then cascaded to crashing gt5.
> 
> Note that the network card took closer to 20 seconds than the 3
> seconds shown in the log to change configs.
> 
> Apr 11 23:26:32 sas1 kernel: eth0: TSO is Disabled
> Apr 11 23:26:35 sas1 kernel: eth0: Link is Up 1000 Mbps Full
> Duplex, Flow Contro
> Apr 11 23:26:41 sas1 kernel: drbd4: PingAck did not arrive in time.
> Apr 11 23:26:41 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected -
> Apr 11 23:26:41 sas1 kernel: drbd4: asender terminated
> Apr 11 23:26:41 sas1 kernel: drbd4: Terminating asender thread
> Apr 11 23:26:41 sas1 kernel: drbd4: short read expecting header on sock: r=-512
> Apr 11 23:26:41 sas1 kernel: drbd4: Connection closed
> Apr 11 23:26:41 sas1 kernel: drbd4: conn( NetworkFailure -> Unconnected )
> Apr 11 23:26:41 sas1 kernel: drbd4: receiver terminated
> Apr 11 23:26:41 sas1 kernel: drbd4: Restarting receiver thread
> Apr 11 23:26:41 sas1 kernel: drbd4: receiver (re)started
> Apr 11 23:26:41 sas1 kernel: drbd4: conn( Unconnected -> WFConnection )
> Apr 11 23:26:42 sas1 kernel: drbd5: PingAck did not arrive in time.
> Apr 11 23:26:42 sas1 kernel: drbd5: peer( Primary -> Unknown ) conn( Connected -
> Apr 11 23:26:42 sas1 kernel: drbd5: asender terminated
> Apr 11 23:26:42 sas1 kernel: drbd5: Terminating asender thread
> Apr 11 23:26:42 sas1 kernel: drbd5: short read expecting header on sock: r=-512
> Apr 11 23:26:42 sas1 kernel: drbd5: Connection closed
> Apr 11 23:26:42 sas1 kernel: drbd5: conn( NetworkFailure -> Unconnected )
> Apr 11 23:26:42 sas1 kernel: drbd5: receiver terminated
> Apr 11 23:26:42 sas1 kernel: drbd5: Restarting receiver thread
> Apr 11 23:26:42 sas1 kernel: drbd5: receiver (re)started
> Apr 11 23:26:42 sas1 kernel: drbd5: conn( Unconnected -> WFConnection )
> (no reboot... it just did the expected, waiting for the primary
> to re-appear).
> Apr 11 23:33:22 sas1 kernel: drbd4: Handshake successful: Agreed network protoco
> 
> Apr 11 23:28:22 gt6 kernel: drbd4: PingAck did not arrive in time.
> Apr 11 23:28:22 gt6 kernel: drbd4: peer( Secondary -> Unknown ) conn( 
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr 11 23:28:22 gt6 kernel: drbd4: asender terminated
> Apr 11 23:28:22 gt6 kernel: drbd4: Terminating asender thread
> Apr 11 23:28:22 gt6 kernel: drbd4: short read expecting header on sock: 
> r=-512
> Apr 11 23:28:22 gt6 kernel: drbd4: Creating new current UUID
> Apr 11 23:28:22 gt6 kernel: drbd4: Writing meta data super block now.
> Apr 11 23:28:22 gt6 kernel: drbd4: tl_clear()
> Apr 11 23:28:22 gt6 kernel: drbd4: Connection closed
> Apr 11 23:28:22 gt6 kernel: drbd4: conn( NetworkFailure -> Unconnected )
> Apr 11 23:28:22 gt6 kernel: drbd4: receiver terminated
> Apr 11 23:28:22 gt6 kernel: drbd4: receiver (re)started
> Apr 11 23:28:22 gt6 kernel: drbd4: conn( Unconnected -> WFConnection )
> Apr 11 23:28:22 gt6 kernel: drbd5: PingAck did not arrive in time.
> Apr 11 23:28:22 gt6 kernel: drbd5: peer( Secondary -> Unknown ) conn( 
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr 11 23:28:22 gt6 kernel: drbd5: asender terminated
> Apr 11 23:28:22 gt6 kernel: drbd5: Terminating asender thread
> Apr 11 23:28:22 gt6 kernel: drbd5: short read expecting header on sock: 
> r=-512
> Apr 11 23:28:22 gt6 kernel: drbd5: Creating new current UUID
> Apr 11 23:28:22 gt6 kernel: drbd5: Writing meta data super block now.
> Apr 11 23:28:22 gt6 kernel: drbd5: tl_clear()
> Apr 11 23:28:22 gt6 kernel: drbd5: Connection closed
> Apr 11 23:28:22 gt6 kernel: drbd5: conn( NetworkFailure -> Unconnected )
> Apr 11 23:28:22 gt6 kernel: drbd5: receiver terminated
> Apr 11 23:28:22 gt6 kernel: drbd5: receiver (re)started
> Apr 11 23:28:22 gt6 kernel: drbd5: conn( Unconnected -> WFConnection )
> (this would be a crash...)
> Apr 11 23:29:17 gt6 syslogd 1.4.1: restart.
> 
> 
> Apr 11 23:27:42 gt5 kernel: peth0: received packet with  own address as 
> source a
> Apr 11 23:27:42 gt5 kernel: peth0: received packet with  own address as 
> source a
> Apr 11 23:27:49 gt5 kernel: drbd1: PingAck did not arrive in time.
> Apr 11 23:27:49 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( 
> Connected
> Apr 11 23:27:49 gt5 kernel: drbd1: asender terminated
> Apr 11 23:27:49 gt5 kernel: drbd1: Terminating asender thread
> Apr 11 23:27:49 gt5 kernel: drbd1: short read expecting header on sock: 
> r=-512
> Apr 11 23:27:49 gt5 kernel: drbd1: Creating new current UUID
> Apr 11 23:27:49 gt5 kernel: drbd1: Connection closed
> Apr 11 23:27:49 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected )
> Apr 11 23:27:49 gt5 kernel: drbd1: receiver terminated
> Apr 11 23:27:49 gt5 kernel: drbd1: Restarting receiver thread
> Apr 11 23:27:49 gt5 kernel: drbd1: receiver (re)started
> Apr 11 23:27:49 gt5 kernel: drbd1: conn( Unconnected -> WFConnection )
> (this would be a crash...)
> Apr 11 23:29:29 gt5 syslogd 1.4.1: restart.
> 
> 
> 
> ----------------------------------------------------------------------
> tbrown at BareMetal.com   | Always bear in mind that your own resolution to
> http://BareMetal.com/  | success is more important than any other one
> web hosting since '95  | thing. - Abraham Lincoln
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list