[DRBD-user] restart of both servers after network failure ??? (large)

Mon May 11 11:45:38 CEST 2009

On Sun, May 10, 2009 at 05:41:51PM -0400, Victor Hugo dos Santos wrote:
> On Sat, May 9, 2009 at 6:20 AM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
> > On Fri, May 08, 2009 at 05:38:06PM -0400, Victor Hugo dos Santos wrote:
> 
> [...]
> 
> > please read all posts subject
> > "kernel crash when secondary disappears. centos 5.3 kernel-xen issue"
> > (late april this year) for possible causes and work arounds.
> 
> Ok.. I read all posts (I only find 3 o 4 ??) and in special mails of
> Simon Graham about this theme.

"Kernel panic in skb_copy_bits", when using DRBD on XEN kernel with
zero copy and scatter gather enabled (three messages in thread):
http://thread.gmane.org/gmane.linux.network.drbd/17295/focus=17297

"kernel crash when secondary disappears. centos 5.3 kernel-xen issue?"
(6 messages in thread)
http://thread.gmane.org/gmane.linux.network.drbd/17537

> as far as I understand, there are two options:
> 
> 1 - disable rx/tx checksum and sg for this interface.
> what consequences or problems I should waiting ??

none?
possibly slightly higher cpu overhead.

> 2 - change DRBD protocol from C to A.
> but.. in this case (IMHO, more easy) I would can data lost.
> in second scenario... I think that I should have one scheduled job of
> verification in short period of time or no ?

the reason of this change is that protocol A must not (for other reasons)
use zero copy IO, aka sendpage. so one of the ingredients necessary to
trigger the problem is missing.

drbd 8.3.2 will have a module parameter to disable zero copy (sendpage).

> others two questions:
> 1 - how I can test if this changes work fine ???
> exist one method of make a system crash test ??

hard reboot? (echo b > /proc/sysrq-trigger)
less brutal: unplug cable?

> 2 - what of options above is more recommendable ??

I'd try with all those offloading settings disabled first.
If that does not help yet, I'd apply the following mini patch to 8.3.1,
which simply changes the fallback method to be "default", keep using
protocol B or C, then wait for 8.3.2 (and use the module parameter
there).

diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index 6edcb11..dce18ee 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -2122,7 +2122,7 @@ int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	 * doh. it triggered. so XFS _IS_ really kaputt ...
 	 * oh well...
 	 */
-	if ((page_count(page) < 1) || PageSlab(page)) {
+	if (1 || (page_count(page) < 1) || PageSlab(page)) {
 		/* e.g. XFS meta- & log-data is in slab pages, which have a
 		 * page_count of 0 and/or have PageSlab() set...
 		 */


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed