Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Question. Are the centos 5.3 kernel-xen kernels known to conflict with DRBD? I've clearly "got a problem". A bad enough one that I really have no choice but to quit replicating via drbd for the short term :( I've just had my second "cascading crash". The old one occured when I turned off checksums on a secondary... the recent one I haven't had a chance to look at, there is an oops logged which I will need to look into. The boxes which are crashing are dual socket dual-core opterons, running the current centOS 5.3 kernel-xen. They've been in production for well over a year, and been rock solid... unfortunately I can't say that for the recent xen/kernel/drbd combination. all three of these nodes are running drbd 8.3.1 compiled locally with "make rpm". # uname -a Linux gt5.baremetal.com 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 10:38:05 EDT 2009 i686 athlon i386 GNU/Linux # rpm -qi kernel-xen Name : kernel-xen Relocations: (not relocatable) Version : 2.6.18 Vendor: CentOS Release : 128.1.6.el5 Build Date: Wed 01 Apr 2009 08:51:53 AM PDT Install Date: Thu 02 Apr 2009 09:33:35 PM PDT Build Host: builder10.centos. Unfortunately, the clocks are not synchronized on these two machines. The machines: gt5. dual socket, dual core opteron (4 cores total), 16 gig ram. primary for bk4-sys and bk4-dat resources secondary for pd4-sys and pd4-dat resources gt6. same hardware as gt5. primary for pd4-sys and pd4-dat primary for nineteen-sys and nineteen-dat sas1. single socket, dual core xeon, 4 gig ram. secondary for nineteen-sys and nineteen-dat This box DOESN'T CRASH. (implicating kernel-xen?) [root at sas1 ~]# uname -a Linux sas1.baremetal.com 2.6.18-128.1.6.el5PAE #1 SMP Wed Apr 1 10:02:22 EDT 2009 i686 i686 i386 GNU/Linux [root at sas1 ~]# rpm -qi kernel-PAE-2.6.18-128.1.6.el5 Name : kernel-PAE Relocations: (not Version : 2.6.18 Vendor: CentOS Release : 128.1.6.el5 Build Date: Wed 01 Apr 2009 08:51:53 AM PDT Install Date: Thu 02 Apr 2009 09:45:42 PM PDT Build Host: builder10.centos.org Most recent crash. Appears to have started on gt6, but gt5 by the time I got paged both boxes had crashed and rebooted. Apr 16 12:05:57 gt6 kernel: BUG: unable to handle kernel paging request at virtual address d6ef9000 Apr 16 12:05:57 gt6 kernel: printing eip: Apr 16 12:05:57 gt6 kernel: ee20d19a Apr 16 12:05:57 gt6 kernel: 10414000 -> *pde = 00000002:700e1001 Apr 16 12:05:57 gt6 kernel: 0fc1b000 -> *pme = 00000000:3c0ae067 Apr 16 12:05:57 gt6 kernel: 000ae000 -> *pte = 00000000:00000000 Apr 16 12:05:57 gt6 kernel: Oops: 0000 [#1] (this would be a crash...) Apr 16 12:06:38 gt6 syslogd 1.4.1: restart. Apr 16 12:06:15 gt5 kernel: drbd1: PingAck did not arrive in time. Apr 16 12:06:15 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 16 12:06:15 gt5 kernel: drbd1: asender terminated Apr 16 12:06:15 gt5 kernel: drbd1: Terminating asender thread Apr 16 12:06:15 gt5 kernel: drbd1: short read expecting header on sock: r=-512 Apr 16 12:06:15 gt5 kernel: drbd1: Creating new current UUID Apr 16 12:06:15 gt5 kernel: drbd1: Connection closed Apr 16 12:06:15 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected ) Apr 16 12:06:15 gt5 kernel: drbd1: receiver terminated Apr 16 12:06:15 gt5 kernel: drbd1: Restarting receiver thread Apr 16 12:06:15 gt5 kernel: drbd1: receiver (re)started Apr 16 12:06:15 gt5 kernel: drbd1: conn( Unconnected -> WFConnection ) (this would be a crash...) Apr 16 12:07:36 gt5 syslogd 1.4.1: restart. that said, it may have been caused/triggered/complicated by my running a verify on gt6.... I did this because I've been seeing integrity failures, and I believe they are false positives, probably buffers being changed "in flight". Anyhow, I'd turned off the integrity checks last night and wanted to prove to myself that nothing was being corrupted. But that's a thread from another post I haven't made yet (I was collecting data for it). Apr 16 11:29:42 sas1 kernel: drbd5: conn( Connected -> VerifyT ) Apr 16 12:05:03 sas1 kernel: drbd4: PingAck did not arrive in time. Apr 16 12:05:03 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected - Apr 16 12:05:03 sas1 kernel: drbd4: asender terminated (same for drbd5...) ----------------------- This other (older crash) is "more interesting". I turned off checksums on sas1, and that apparently caused the network card to vanish for about 20 seconds... which apparently pissed of (and crashed) gt6, which then cascaded to crashing gt5. Note that the network card took closer to 20 seconds than the 3 seconds shown in the log to change configs. Apr 11 23:26:32 sas1 kernel: eth0: TSO is Disabled Apr 11 23:26:35 sas1 kernel: eth0: Link is Up 1000 Mbps Full Duplex, Flow Contro Apr 11 23:26:41 sas1 kernel: drbd4: PingAck did not arrive in time. Apr 11 23:26:41 sas1 kernel: drbd4: peer( Primary -> Unknown ) conn( Connected - Apr 11 23:26:41 sas1 kernel: drbd4: asender terminated Apr 11 23:26:41 sas1 kernel: drbd4: Terminating asender thread Apr 11 23:26:41 sas1 kernel: drbd4: short read expecting header on sock: r=-512 Apr 11 23:26:41 sas1 kernel: drbd4: Connection closed Apr 11 23:26:41 sas1 kernel: drbd4: conn( NetworkFailure -> Unconnected ) Apr 11 23:26:41 sas1 kernel: drbd4: receiver terminated Apr 11 23:26:41 sas1 kernel: drbd4: Restarting receiver thread Apr 11 23:26:41 sas1 kernel: drbd4: receiver (re)started Apr 11 23:26:41 sas1 kernel: drbd4: conn( Unconnected -> WFConnection ) Apr 11 23:26:42 sas1 kernel: drbd5: PingAck did not arrive in time. Apr 11 23:26:42 sas1 kernel: drbd5: peer( Primary -> Unknown ) conn( Connected - Apr 11 23:26:42 sas1 kernel: drbd5: asender terminated Apr 11 23:26:42 sas1 kernel: drbd5: Terminating asender thread Apr 11 23:26:42 sas1 kernel: drbd5: short read expecting header on sock: r=-512 Apr 11 23:26:42 sas1 kernel: drbd5: Connection closed Apr 11 23:26:42 sas1 kernel: drbd5: conn( NetworkFailure -> Unconnected ) Apr 11 23:26:42 sas1 kernel: drbd5: receiver terminated Apr 11 23:26:42 sas1 kernel: drbd5: Restarting receiver thread Apr 11 23:26:42 sas1 kernel: drbd5: receiver (re)started Apr 11 23:26:42 sas1 kernel: drbd5: conn( Unconnected -> WFConnection ) (no reboot... it just did the expected, waiting for the primary to re-appear). Apr 11 23:33:22 sas1 kernel: drbd4: Handshake successful: Agreed network protoco Apr 11 23:28:22 gt6 kernel: drbd4: PingAck did not arrive in time. Apr 11 23:28:22 gt6 kernel: drbd4: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 11 23:28:22 gt6 kernel: drbd4: asender terminated Apr 11 23:28:22 gt6 kernel: drbd4: Terminating asender thread Apr 11 23:28:22 gt6 kernel: drbd4: short read expecting header on sock: r=-512 Apr 11 23:28:22 gt6 kernel: drbd4: Creating new current UUID Apr 11 23:28:22 gt6 kernel: drbd4: Writing meta data super block now. Apr 11 23:28:22 gt6 kernel: drbd4: tl_clear() Apr 11 23:28:22 gt6 kernel: drbd4: Connection closed Apr 11 23:28:22 gt6 kernel: drbd4: conn( NetworkFailure -> Unconnected ) Apr 11 23:28:22 gt6 kernel: drbd4: receiver terminated Apr 11 23:28:22 gt6 kernel: drbd4: receiver (re)started Apr 11 23:28:22 gt6 kernel: drbd4: conn( Unconnected -> WFConnection ) Apr 11 23:28:22 gt6 kernel: drbd5: PingAck did not arrive in time. Apr 11 23:28:22 gt6 kernel: drbd5: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 11 23:28:22 gt6 kernel: drbd5: asender terminated Apr 11 23:28:22 gt6 kernel: drbd5: Terminating asender thread Apr 11 23:28:22 gt6 kernel: drbd5: short read expecting header on sock: r=-512 Apr 11 23:28:22 gt6 kernel: drbd5: Creating new current UUID Apr 11 23:28:22 gt6 kernel: drbd5: Writing meta data super block now. Apr 11 23:28:22 gt6 kernel: drbd5: tl_clear() Apr 11 23:28:22 gt6 kernel: drbd5: Connection closed Apr 11 23:28:22 gt6 kernel: drbd5: conn( NetworkFailure -> Unconnected ) Apr 11 23:28:22 gt6 kernel: drbd5: receiver terminated Apr 11 23:28:22 gt6 kernel: drbd5: receiver (re)started Apr 11 23:28:22 gt6 kernel: drbd5: conn( Unconnected -> WFConnection ) (this would be a crash...) Apr 11 23:29:17 gt6 syslogd 1.4.1: restart. Apr 11 23:27:42 gt5 kernel: peth0: received packet with own address as source a Apr 11 23:27:42 gt5 kernel: peth0: received packet with own address as source a Apr 11 23:27:49 gt5 kernel: drbd1: PingAck did not arrive in time. Apr 11 23:27:49 gt5 kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected Apr 11 23:27:49 gt5 kernel: drbd1: asender terminated Apr 11 23:27:49 gt5 kernel: drbd1: Terminating asender thread Apr 11 23:27:49 gt5 kernel: drbd1: short read expecting header on sock: r=-512 Apr 11 23:27:49 gt5 kernel: drbd1: Creating new current UUID Apr 11 23:27:49 gt5 kernel: drbd1: Connection closed Apr 11 23:27:49 gt5 kernel: drbd1: conn( NetworkFailure -> Unconnected ) Apr 11 23:27:49 gt5 kernel: drbd1: receiver terminated Apr 11 23:27:49 gt5 kernel: drbd1: Restarting receiver thread Apr 11 23:27:49 gt5 kernel: drbd1: receiver (re)started Apr 11 23:27:49 gt5 kernel: drbd1: conn( Unconnected -> WFConnection ) (this would be a crash...) Apr 11 23:29:29 gt5 syslogd 1.4.1: restart. ---------------------------------------------------------------------- tbrown at BareMetal.com | Always bear in mind that your own resolution to http://BareMetal.com/ | success is more important than any other one web hosting since '95 | thing. - Abraham Lincoln