[DRBD-user] Changing server's IP's

Mon Mar 12 19:05:15 CET 2012

Marcelo,

I experienced difficulties using ubuntu-server 10.04 with the e1000e
driver resetting the adapter.  I updated to 1.6.3 at first, and things
have been fairly smooth since then.

However, I just looked through the logs and saw this:

[4542575.651842] e1000e 0000:01:00.0: eth1: Detected Hardware Unit Hang:
[4542575.651845]   TDH                  <29>
[4542575.651847]   TDT                  <2a>
[4542575.651848]   next_to_use          <2a>
[4542575.651850]   next_to_clean        <28>
[4542575.651851] buffer_info[next_to_clean]:
[4542575.651852]   time_stamp           <11b2455d4>
[4542575.651854]   next_to_watch        <29>
[4542575.651855]   jiffies              <11b245642>
[4542575.651858]   next_to_watch.status <0>
[4542575.651860] MAC Status             <80383>
[4542575.651862] PHY Status             <792d>
[4542575.651864] PHY 1000BASE-T Status  <3800>
[4542575.651866] PHY Extended Status    <3000>
[4542575.651869] PCI Status             <10>

Now, this was half of a bonded pair, so services remained up, but I
was kind of bummed to see that it still happened.

With that being said, the adapter _recovered_, which is something that
it did not do with the in-tree driver.  I'd have to unplug the cable,
plug it in again and reconfigure the interface to get things working
again in that situation.

I have the following adapters in this server pair:

00:19.0 Ethernet controller: Intel Corporation 82578DC Gigabit Network
Connection (rev 06)
01:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)
01:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)

...and I'm running kernel 3.0.4 vanilla, with scst 2.2 zero-copy
patches and drbd 8.3.12

Looking through older logs, it appears as if there have been a handful
of hangs, always with eth1.

So, while I would like to be more helpful, it appears as if I'm in a
similar boat as you are.  :)  The only thing that is saving me right
now is the fact that the connection between servers is bonded.

-M

On Mon, Mar 12, 2012 at 12:06 PM, Marcelo Pereira <marcelops at gmail.com> wrote:
> Hi Mark,
>
> Please, tell me more about this.
>
> I have been struggling with my e1000e NIC, and the cluster isn't synching at
> all. It goes until 2.8% (out of 16Tb) and the connection simply drops off.
> That is why I want to use another NIC, which use a chipset other that this
> e1000e.
>
> I have already updated the driver to 1.9.5, and it's exactly what you said:
> "random nic resets". I thought it was (1) the cables, (2) the switch ports
> and finally (3) the switch itself. I have changed all these hardware, and
> still can't sync.
>
> Tell more about this issue with the Intel e1000e, please!
>
> Thanks,
> Marcelo
>
> On Mon, Mar 12, 2012 at 10:41 AM, Mark Deneen <mdeneen at gmail.com> wrote:
>>
>> Also, please check your e1000e driver version and update to the latest
>> stable version (out of tree).  With the in-tree revision, we suffered
>> from random nic resets under certain conditions, which really gave
>> DRBD headaches.
>>
>> -M
>>
>> On Fri, Mar 9, 2012 at 11:37 AM, Marcelo Pereira <marcelops at gmail.com>
>> wrote:
>> > Hello guys,
>> >
>> > I have been facing a problem with a server who can't sync at all due to
>> > network issues. The NIC (two Intel e1000e) is showing several errors and
>> > is
>> > being dropped off. Sometimes at the primary node, sometimes at the
>> > secondary. It will not sync.
>> >
>> > I have an unused NIC on both servers, and I would like to use them
>> > instead.
>> > What should I do to move the configuration from one set of IP's to
>> > another??
>> >
>> > So, that is my current /etc/drbd.conf:
>> >
>> > global { usage-count no; }
>> > resource pdc0 {
>> >   protocol C;
>> >   startup { wfc-timeout 0; degr-wfc-timeout 120; }
>> >   disk { on-io-error detach; } # ACK!
>> >   net { cram-hmac-alg "sha1"; shared-secret "xxxxxxx"; }
>> >   syncer { rate 10M; verify-alg sha1; }
>> >   on node0 {
>> >     device /dev/drbd0;
>> >     disk /dev/sdb;
>> >     address 192.168.69.1:7788;
>> >     meta-disk internal;
>> >   }
>> >   on node1 {
>> >     device /dev/drbd0;
>> >     disk /dev/sdc;
>> >     address 192.168.69.2:7788;
>> >     meta-disk internal;
>> >   }
>> > }
>> >
>> > And that is what it should be after:
>> >
>> > global { usage-count no; }
>> > resource pdc0 {
>> >   protocol C;
>> >   startup { wfc-timeout 0; degr-wfc-timeout 120; }
>> >   disk { on-io-error detach; } # ACK!
>> >   net { cram-hmac-alg "sha1"; shared-secret "xxxxxxx"; }
>> >   syncer { rate 10M; verify-alg sha1; }
>> >   on node0 {
>> >     device /dev/drbd0;
>> >     disk /dev/sdb;
>> >     address 10.0.0.1:7788;
>> >     meta-disk internal;
>> >   }
>> >   on node1 {
>> >     device /dev/drbd0;
>> >     disk /dev/sdc;
>> >     address 10.0.0.2:7788;
>> >     meta-disk internal;
>> >   }
>> > }
>> >
>> > Is it enough to just change the /etc/drbd.conf on both servers? What is
>> > the
>> > exact procedure? I'm using DRBD v8.2.6.
>> >
>> > Thanks,
>> > Marcelo
>> >
>> > _______________________________________________
>> > drbd-user mailing list
>> > drbd-user at lists.linbit.com
>> > http://lists.linbit.com/mailman/listinfo/drbd-user
>> >
>
>