Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Marcelo,
I experienced difficulties using ubuntu-server 10.04 with the e1000e
driver resetting the adapter. I updated to 1.6.3 at first, and things
have been fairly smooth since then.
However, I just looked through the logs and saw this:
[4542575.651842] e1000e 0000:01:00.0: eth1: Detected Hardware Unit Hang:
[4542575.651845] TDH <29>
[4542575.651847] TDT <2a>
[4542575.651848] next_to_use <2a>
[4542575.651850] next_to_clean <28>
[4542575.651851] buffer_info[next_to_clean]:
[4542575.651852] time_stamp <11b2455d4>
[4542575.651854] next_to_watch <29>
[4542575.651855] jiffies <11b245642>
[4542575.651858] next_to_watch.status <0>
[4542575.651860] MAC Status <80383>
[4542575.651862] PHY Status <792d>
[4542575.651864] PHY 1000BASE-T Status <3800>
[4542575.651866] PHY Extended Status <3000>
[4542575.651869] PCI Status <10>
Now, this was half of a bonded pair, so services remained up, but I
was kind of bummed to see that it still happened.
With that being said, the adapter _recovered_, which is something that
it did not do with the in-tree driver. I'd have to unplug the cable,
plug it in again and reconfigure the interface to get things working
again in that situation.
I have the following adapters in this server pair:
00:19.0 Ethernet controller: Intel Corporation 82578DC Gigabit Network
Connection (rev 06)
01:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)
01:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)
...and I'm running kernel 3.0.4 vanilla, with scst 2.2 zero-copy
patches and drbd 8.3.12
Looking through older logs, it appears as if there have been a handful
of hangs, always with eth1.
So, while I would like to be more helpful, it appears as if I'm in a
similar boat as you are. :) The only thing that is saving me right
now is the fact that the connection between servers is bonded.
-M
On Mon, Mar 12, 2012 at 12:06 PM, Marcelo Pereira <marcelops at gmail.com> wrote:
> Hi Mark,
>
> Please, tell me more about this.
>
> I have been struggling with my e1000e NIC, and the cluster isn't synching at
> all. It goes until 2.8% (out of 16Tb) and the connection simply drops off.
> That is why I want to use another NIC, which use a chipset other that this
> e1000e.
>
> I have already updated the driver to 1.9.5, and it's exactly what you said:
> "random nic resets". I thought it was (1) the cables, (2) the switch ports
> and finally (3) the switch itself. I have changed all these hardware, and
> still can't sync.
>
> Tell more about this issue with the Intel e1000e, please!
>
> Thanks,
> Marcelo
>
> On Mon, Mar 12, 2012 at 10:41 AM, Mark Deneen <mdeneen at gmail.com> wrote:
>>
>> Also, please check your e1000e driver version and update to the latest
>> stable version (out of tree). With the in-tree revision, we suffered
>> from random nic resets under certain conditions, which really gave
>> DRBD headaches.
>>
>> -M
>>
>> On Fri, Mar 9, 2012 at 11:37 AM, Marcelo Pereira <marcelops at gmail.com>
>> wrote:
>> > Hello guys,
>> >
>> > I have been facing a problem with a server who can't sync at all due to
>> > network issues. The NIC (two Intel e1000e) is showing several errors and
>> > is
>> > being dropped off. Sometimes at the primary node, sometimes at the
>> > secondary. It will not sync.
>> >
>> > I have an unused NIC on both servers, and I would like to use them
>> > instead.
>> > What should I do to move the configuration from one set of IP's to
>> > another??
>> >
>> > So, that is my current /etc/drbd.conf:
>> >
>> > global { usage-count no; }
>> > resource pdc0 {
>> > protocol C;
>> > startup { wfc-timeout 0; degr-wfc-timeout 120; }
>> > disk { on-io-error detach; } # ACK!
>> > net { cram-hmac-alg "sha1"; shared-secret "xxxxxxx"; }
>> > syncer { rate 10M; verify-alg sha1; }
>> > on node0 {
>> > device /dev/drbd0;
>> > disk /dev/sdb;
>> > address 192.168.69.1:7788;
>> > meta-disk internal;
>> > }
>> > on node1 {
>> > device /dev/drbd0;
>> > disk /dev/sdc;
>> > address 192.168.69.2:7788;
>> > meta-disk internal;
>> > }
>> > }
>> >
>> > And that is what it should be after:
>> >
>> > global { usage-count no; }
>> > resource pdc0 {
>> > protocol C;
>> > startup { wfc-timeout 0; degr-wfc-timeout 120; }
>> > disk { on-io-error detach; } # ACK!
>> > net { cram-hmac-alg "sha1"; shared-secret "xxxxxxx"; }
>> > syncer { rate 10M; verify-alg sha1; }
>> > on node0 {
>> > device /dev/drbd0;
>> > disk /dev/sdb;
>> > address 10.0.0.1:7788;
>> > meta-disk internal;
>> > }
>> > on node1 {
>> > device /dev/drbd0;
>> > disk /dev/sdc;
>> > address 10.0.0.2:7788;
>> > meta-disk internal;
>> > }
>> > }
>> >
>> > Is it enough to just change the /etc/drbd.conf on both servers? What is
>> > the
>> > exact procedure? I'm using DRBD v8.2.6.
>> >
>> > Thanks,
>> > Marcelo
>> >
>> > _______________________________________________
>> > drbd-user mailing list
>> > drbd-user at lists.linbit.com
>> > http://lists.linbit.com/mailman/listinfo/drbd-user
>> >
>
>