Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
We have a problem with drbd randomly disconnecting a volume after a few
days of syncronised operation. We run other 5 volumes on the same server
however the traffic on the other volumes is significantly lower.
We got the following errors on the servers:
=== Server 1 Syslog ===
Feb 11 22:00:34 mailstore1 kernel: drbd2: [pdflush/170] sock_sendmsg
time expired, ko = 3
Feb 11 22:00:39 mailstore1 kernel: drbd2: [pdflush/170] sock_sendmsg
time expired, ko = 2
Feb 11 22:00:44 mailstore1 kernel: drbd2: [pdflush/170] sock_sendmsg
time expired, ko = 1
Feb 11 22:00:49 mailstore1 kernel: drbd2:
/var/tmp/bach-build/BUILD/drbd-0.7.21/drbd/drbd_main.c:1095: Connected
flags=0x120a
Feb 11 22:00:49 mailstore1 kernel: drbd2: pdflush [170]: cstate
Connected --> NetworkFailure
== Server 1 dmesg ==
drbd2: drbd2_receiver [23259]: cstate NetworkFailure --> BrokenPipe
drbd2: short read expecting header on sock: r=-512
drbd2: worker terminated
drbd2: asender terminated
drbd2: drbd2_receiver [23259]: cstate BrokenPipe --> Unconnected
drbd2: Connection lost.
drbd2: drbd2_receiver [23259]: cstate Unconnected --> StandAlone
=== Server 2 Syslog ===
Feb 11 22:00:49 mailstore2 kernel: drbd2: meta connection shut down by
peer.
Feb 11 22:00:59 mailstore2 kernel: drbd2: drbd2_asender [2672]: cstate
Connected --> NetworkFailure
Feb 11 22:00:59 mailstore2 kernel: drbd2: asender terminated
Feb 11 22:00:59 mailstore2 kernel: drbd2: short sent BarrierAck size=16
sent=-1001
Feb 11 22:00:59 mailstore2 kernel: drbd2: error receiving Barrier, l: 8!
Feb 11 22:01:00 mailstore2 kernel: drbd2: worker terminated
Feb 11 22:01:00 mailstore2 kernel: drbd2: unacked_cnt = 59
Feb 11 22:01:00 mailstore2 kernel: drbd2: drbd2_receiver [2570]: cstate
NetworkFailure --> Unconnected
Feb 11 22:01:00 mailstore2 kernel: drbd2: Connection lost.
== Server 2 dmesg ==
drbd2: meta connection shut down by peer.
drbd2: drbd2_asender [2672]: cstate Connected --> NetworkFailure
drbd2: asender terminated
drbd2: short sent BarrierAck size=16 sent=-1001
drbd2: error receiving Barrier, l: 8!
drbd2: worker terminated
drbd2: unacked_cnt = 59
drbd2: drbd2_receiver [2570]: cstate NetworkFailure --> Unconnected
drbd2: Connection lost.
drbd2: drbd2_receiver [2570]: cstate Unconnected --> StandAlone
drbd2: receiver terminated
The servers are connected to a dedicated via gigabit to a dedicated VLAN
on a Cisco 2960G switch.
We noticed a number of errors on the drbd interface:
Server 1
RX packets:938278799 errors:6 dropped:22308 overruns:0 frame:3
TX packets:999591802 errors:0 dropped:0 overruns:0 carrier:0
Server 2
RX packets:545215102 errors:2 dropped:7795 overruns:0 frame:1
TX packets:419240487 errors:0 dropped:0 overruns:0 carrier:0
Distribution: Fedora 4
Linux Version: 2.6.17-1.2142_FC4smp
Version: 0.7.21 (api:79/proto:74)