[DRBD-user] kernel crash when secondary disappears. centos 5.3 kernel-xen issue?

Tom Brown wc-linbit.com at vmail.baremetal.com
Thu Apr 16 22:38:10 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Question. Are the centos 5.3 kernel-xen kernels known to conflict
with DRBD? I've clearly "got a problem". A bad enough one that I
really have no choice but to quit replicating via drbd for the
short term :(

I've just had my second "cascading crash". The old one occured when I turned 
off checksums on a secondary... the recent one I haven't had a
chance to look at, there is an oops logged which I will need to
look into.

The boxes which are crashing are dual socket dual-core opterons,
running the current centOS 5.3 kernel-xen. They've been in
production for well over a year, and been rock solid...
unfortunately I can't say that for the recent xen/kernel/drbd
combination.

all three of these nodes are running drbd 8.3.1 compiled locally
with "make rpm".

    # uname -a
    Linux gt5.baremetal.com 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1
    10:38:05 EDT 2009 i686 athlon i386 GNU/Linux

    # rpm -qi kernel-xen
    Name        : kernel-xen                   Relocations: (not
    relocatable)
    Version     : 2.6.18                            Vendor: CentOS
    Release     : 128.1.6.el5                   Build Date: Wed 01
    Apr 2009 08:51:53 AM PDT
    Install Date: Thu 02 Apr 2009 09:33:35 PM PDT      Build Host:
    builder10.centos.

Unfortunately, the clocks are not synchronized on these two
machines.

The machines:

gt5. dual socket, dual core opteron (4 cores total), 16 gig ram.
   primary for bk4-sys and bk4-dat resources
   secondary for pd4-sys and pd4-dat resources

gt6. same hardware as gt5.
   primary for pd4-sys and pd4-dat
   primary for nineteen-sys and nineteen-dat

sas1. single socket, dual core xeon, 4 gig ram.
   secondary for nineteen-sys and nineteen-dat
   This box DOESN'T CRASH. (implicating kernel-xen?)

    [root at sas1 ~]# uname -a
    Linux sas1.baremetal.com 2.6.18-128.1.6.el5PAE #1 SMP Wed Apr 1
    10:02:22 EDT 2009 i686 i686 i386 GNU/Linux

    [root at sas1 ~]# rpm -qi kernel-PAE-2.6.18-128.1.6.el5
    Name        : kernel-PAE                   Relocations: (not
    Version     : 2.6.18                            Vendor: CentOS
    Release     : 128.1.6.el5                   Build Date: Wed 01
    Apr 2009 08:51:53 AM PDT
    Install Date: Thu 02 Apr 2009 09:45:42 PM PDT      Build Host:
    builder10.centos.org


Most recent crash. Appears to have started on gt6, but gt5
by the time I got paged both boxes had crashed and rebooted.

Apr 16 12:05:57 gt6 kernel: BUG: unable to handle kernel paging request at 
virtual address d6ef9000
Apr 16 12:05:57 gt6 kernel:  printing eip:
Apr 16 12:05:57 gt6 kernel: ee20d19a
Apr 16 12:05:57 gt6 kernel: 10414000 -> *pde = 00000002:700e1001
Apr 16 12:05:57 gt6 kernel: 0fc1b000 -> *pme = 00000000:3c0ae067
Apr 16 12:05:57 gt6 kernel: 000ae000 -> *pte = 00000000:00000000
Apr 16 12:05:57 gt6 kernel: Oops: 0000 [#1]
(this would be a crash...)
Apr 16 12:06:38 gt6 syslogd 1.4.1: restart.

Apr 16 12:06:15 gt5 kernel: drbd1: PingAck did not arrive in time.
Apr 16 12:06:15 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr 16 12:06:15 gt5 kernel: drbd1: asender terminated
Apr 16 12:06:15 gt5 kernel: drbd1: Terminating asender thread
Apr 16 12:06:15 gt5 kernel: drbd1: short read expecting header on sock: 
r=-512
Apr 16 12:06:15 gt5 kernel: drbd1: Creating new current UUID
Apr 16 12:06:15 gt5 kernel: drbd1: Connection closed
Apr 16 12:06:15 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected )
Apr 16 12:06:15 gt5 kernel: drbd1: receiver terminated
Apr 16 12:06:15 gt5 kernel: drbd1: Restarting receiver thread
Apr 16 12:06:15 gt5 kernel: drbd1: receiver (re)started
Apr 16 12:06:15 gt5 kernel: drbd1: conn( Unconnected -> WFConnection )
(this would be a crash...)
Apr 16 12:07:36 gt5 syslogd 1.4.1: restart.

that said, it may have been caused/triggered/complicated by my
running a verify on gt6.... I did this because I've been seeing
integrity failures, and I believe they are false positives,
probably buffers being changed "in flight". Anyhow, I'd turned
off the integrity checks last night and wanted to prove to myself
that nothing was being corrupted. But that's a thread from
another post I haven't made yet (I was collecting data for it).

Apr 16 11:29:42 sas1 kernel: drbd5: conn( Connected -> VerifyT )
Apr 16 12:05:03 sas1 kernel: drbd4: PingAck did not arrive in time.
Apr 16 12:05:03 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected -
Apr 16 12:05:03 sas1 kernel: drbd4: asender terminated
(same for drbd5...)



-----------------------

This other (older crash) is "more interesting". I turned off
checksums on sas1, and that apparently caused the network card to
vanish for about 20 seconds... which apparently pissed of (and
crashed) gt6, which then cascaded to crashing gt5.

Note that the network card took closer to 20 seconds than the 3
seconds shown in the log to change configs.

Apr 11 23:26:32 sas1 kernel: eth0: TSO is Disabled
Apr 11 23:26:35 sas1 kernel: eth0: Link is Up 1000 Mbps Full
Duplex, Flow Contro
Apr 11 23:26:41 sas1 kernel: drbd4: PingAck did not arrive in time.
Apr 11 23:26:41 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected -
Apr 11 23:26:41 sas1 kernel: drbd4: asender terminated
Apr 11 23:26:41 sas1 kernel: drbd4: Terminating asender thread
Apr 11 23:26:41 sas1 kernel: drbd4: short read expecting header on sock: r=-512
Apr 11 23:26:41 sas1 kernel: drbd4: Connection closed
Apr 11 23:26:41 sas1 kernel: drbd4: conn( NetworkFailure -> Unconnected )
Apr 11 23:26:41 sas1 kernel: drbd4: receiver terminated
Apr 11 23:26:41 sas1 kernel: drbd4: Restarting receiver thread
Apr 11 23:26:41 sas1 kernel: drbd4: receiver (re)started
Apr 11 23:26:41 sas1 kernel: drbd4: conn( Unconnected -> WFConnection )
Apr 11 23:26:42 sas1 kernel: drbd5: PingAck did not arrive in time.
Apr 11 23:26:42 sas1 kernel: drbd5: peer( Primary -> Unknown ) conn( Connected -
Apr 11 23:26:42 sas1 kernel: drbd5: asender terminated
Apr 11 23:26:42 sas1 kernel: drbd5: Terminating asender thread
Apr 11 23:26:42 sas1 kernel: drbd5: short read expecting header on sock: r=-512
Apr 11 23:26:42 sas1 kernel: drbd5: Connection closed
Apr 11 23:26:42 sas1 kernel: drbd5: conn( NetworkFailure -> Unconnected )
Apr 11 23:26:42 sas1 kernel: drbd5: receiver terminated
Apr 11 23:26:42 sas1 kernel: drbd5: Restarting receiver thread
Apr 11 23:26:42 sas1 kernel: drbd5: receiver (re)started
Apr 11 23:26:42 sas1 kernel: drbd5: conn( Unconnected -> WFConnection )
(no reboot... it just did the expected, waiting for the primary
to re-appear).
Apr 11 23:33:22 sas1 kernel: drbd4: Handshake successful: Agreed network protoco

Apr 11 23:28:22 gt6 kernel: drbd4: PingAck did not arrive in time.
Apr 11 23:28:22 gt6 kernel: drbd4: peer( Secondary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr 11 23:28:22 gt6 kernel: drbd4: asender terminated
Apr 11 23:28:22 gt6 kernel: drbd4: Terminating asender thread
Apr 11 23:28:22 gt6 kernel: drbd4: short read expecting header on sock: 
r=-512
Apr 11 23:28:22 gt6 kernel: drbd4: Creating new current UUID
Apr 11 23:28:22 gt6 kernel: drbd4: Writing meta data super block now.
Apr 11 23:28:22 gt6 kernel: drbd4: tl_clear()
Apr 11 23:28:22 gt6 kernel: drbd4: Connection closed
Apr 11 23:28:22 gt6 kernel: drbd4: conn( NetworkFailure -> Unconnected )
Apr 11 23:28:22 gt6 kernel: drbd4: receiver terminated
Apr 11 23:28:22 gt6 kernel: drbd4: receiver (re)started
Apr 11 23:28:22 gt6 kernel: drbd4: conn( Unconnected -> WFConnection )
Apr 11 23:28:22 gt6 kernel: drbd5: PingAck did not arrive in time.
Apr 11 23:28:22 gt6 kernel: drbd5: peer( Secondary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr 11 23:28:22 gt6 kernel: drbd5: asender terminated
Apr 11 23:28:22 gt6 kernel: drbd5: Terminating asender thread
Apr 11 23:28:22 gt6 kernel: drbd5: short read expecting header on sock: 
r=-512
Apr 11 23:28:22 gt6 kernel: drbd5: Creating new current UUID
Apr 11 23:28:22 gt6 kernel: drbd5: Writing meta data super block now.
Apr 11 23:28:22 gt6 kernel: drbd5: tl_clear()
Apr 11 23:28:22 gt6 kernel: drbd5: Connection closed
Apr 11 23:28:22 gt6 kernel: drbd5: conn( NetworkFailure -> Unconnected )
Apr 11 23:28:22 gt6 kernel: drbd5: receiver terminated
Apr 11 23:28:22 gt6 kernel: drbd5: receiver (re)started
Apr 11 23:28:22 gt6 kernel: drbd5: conn( Unconnected -> WFConnection )
(this would be a crash...)
Apr 11 23:29:17 gt6 syslogd 1.4.1: restart.


Apr 11 23:27:42 gt5 kernel: peth0: received packet with  own address as 
source a
Apr 11 23:27:42 gt5 kernel: peth0: received packet with  own address as 
source a
Apr 11 23:27:49 gt5 kernel: drbd1: PingAck did not arrive in time.
Apr 11 23:27:49 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( 
Connected
Apr 11 23:27:49 gt5 kernel: drbd1: asender terminated
Apr 11 23:27:49 gt5 kernel: drbd1: Terminating asender thread
Apr 11 23:27:49 gt5 kernel: drbd1: short read expecting header on sock: 
r=-512
Apr 11 23:27:49 gt5 kernel: drbd1: Creating new current UUID
Apr 11 23:27:49 gt5 kernel: drbd1: Connection closed
Apr 11 23:27:49 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected )
Apr 11 23:27:49 gt5 kernel: drbd1: receiver terminated
Apr 11 23:27:49 gt5 kernel: drbd1: Restarting receiver thread
Apr 11 23:27:49 gt5 kernel: drbd1: receiver (re)started
Apr 11 23:27:49 gt5 kernel: drbd1: conn( Unconnected -> WFConnection )
(this would be a crash...)
Apr 11 23:29:29 gt5 syslogd 1.4.1: restart.



----------------------------------------------------------------------
tbrown at BareMetal.com   | Always bear in mind that your own resolution to
http://BareMetal.com/  | success is more important than any other one
web hosting since '95  | thing. - Abraham Lincoln



More information about the drbd-user mailing list