[DRBD-user] restart of both servers after network failure ??? (large)

Fri May 8 23:38:06 CEST 2009

On Fri, Feb 20, 2009 at 2:37 PM, Victor Hugo dos Santos
<listas.vhs at gmail.com> wrote:
> Hello,
>
> I have a problem with drbd-0.7.25 and drbd-0.8.2.6... my situation is:
>
> two servers Supermicro in company A connected with crossover cable and
> CentOS 5.2 (all updates installed)
> two servers Poweredge in company B connected with network fiber in
> separate sites and Citrix XenServer 4 installed.
>
> the problem is that time in time, both servers restart without
> apparent reason.. in logs, only show messages about network failure
> and after this, server restart.
> in company A... this problem occurred 2 o 3 times and the last
> incident is on 4 months ago..
> and I had forget this problem.. because, I think that could be for
> electrical energy line in this company.
> but now, in company B.. I have the same problem for first time (after
> various months work fine) and this servers is connected in UPS line.
>
> two servers groups are running a Virtualization Server.. but from
> different vendors and configurations..
> Memory, disks and network work fine in four servers and, DRBD resource
> contain only data from VMs, none files/data from owner server.
>
> and I don't understand why servers restart when recive a error from
> network !!???
> and in case of problem..I think that restart of VMs is probably but
> not of complete Server.
>
> Above, logs and config file of two servers in company B...

hello,

one time more, I have the same problem, but in other company and other
hardwares and others configurations !! :D

My problem: one node restart (not exists errors in logs files) and
after a couple seconds, the other node restart too. In the second node
I have this lines:

================
May  8 16:31:46 blueback kernel: drbd1: PingAck did not arrive in time.
May  8 16:31:46 blueback kernel: drbd1: peer( Secondary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May  8 16:31:46 blueback kernel: drbd1: asender terminated
May  8 16:31:46 blueback kernel: drbd1: Terminating asender thread
May  8 16:31:46 blueback kernel: drbd1: short read expecting header on
sock: r=-512
May  8 16:31:46 blueback kernel: drbd1: Creating new current UUID
May  8 16:31:46 blueback kernel: drbd0: PingAck did not arrive in time.
May  8 16:31:46 blueback kernel: drbd0: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May  8 16:31:46 blueback kernel: drbd0: asender terminated
May  8 16:31:46 blueback kernel: drbd0: Terminating asender thread
May  8 16:31:46 blueback kernel: drbd0: short read expecting header on
sock: r=-512
May  8 16:38:19 blueback kernel: drbd: initialised. Version: 8.2.7
(api:88/proto:86-88)
May  8 16:38:19 blueback kernel: drbd: GIT-hash:
61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d build by
root at localhost.localdomain, 2009-02-19 13:47:49
================

and I don't understand the origin of problem:
this is my actual configuration:

- two server poweredge R900, with 64G ram and 8 CPUs
- two resources drbd (drbd0 and drbd1)
- one gigabit nic exclusive for drbd

This is version of softwares:

XenServer 5.0 (update 2) with kernel 2.6.18-92.1.10.el5.xs5.0.0.404.646xen
DRBD Version: 8.2.7 (api:88)

and my drbd.conf is:
================
# drbdadm dump
# /etc/drbd.conf
resource disco1 {
    protocol               B;
    on sockeye.multiexportfoods.com {
        device           /dev/drbd0;
        disk             /dev/sdb1;
        address          ipv4 10.0.1.70:7788;
        meta-disk        internal;
    }
    on blueback.multiexportfoods.com {
        device           /dev/drbd0;
        disk             /dev/sdb1;
        address          ipv4 10.0.1.60:7788;
        meta-disk        internal;
    }
    disk {
        on-io-error      detach;
        max-bio-bvecs      1;
    }
    syncer {
        rate             50M;
        al-extents       257;
    }
    startup {
        wfc-timeout       60;
        degr-wfc-timeout  10;
    }
}

resource disco2 {
    protocol               B;
    on sockeye.multiexportfoods.com {
        device           /dev/drbd1;
        disk             /dev/sdb2;
        address          ipv4 10.0.1.70:7789;
        meta-disk        internal;
    }
    on blueback.multiexportfoods.com {
        device           /dev/drbd1;
        disk             /dev/sdb2;
        address          ipv4 10.0.1.60:7789;
        meta-disk        internal;
    }
    disk {
        on-io-error      detach;
        max-bio-bvecs      1;
    }
    syncer {
        rate              1G;
        al-extents       257;
    }
    startup {
        wfc-timeout       60;
        degr-wfc-timeout  10;
    }
}
================

any, any idea ?? please !! :-)

thanks.

-- 
-- 
Victor Hugo dos Santos
Linux Counter #224399