[DRBD-user] Loss of Connection

Wed Dec 6 22:41:16 CET 2006

> BTW, if only 1 out of the 8 losses connection spariodically, and it's
> always the same one, you could be looking at a hardware problem. You
> may want to look into that, perhaps replace the disk and see if it
> still happens. If not then you can certainly rule out hardware
> failure.

All of my 8 DRBD device pairs live on one RAID5 device, which is in
good health.

Besides, I would not associate a sudden loss of network connectivity
with the underlaying FS (and I do not have ANY other error messages
in the syslog on both nodes when DRBD loses its connection).

Any other suggestions, please?

Many thanks,

Holger

> On 12/5/06, Holgilein <lists at loomsday.co.nz> wrote:
>> > Just wondering, is drbd set up to reconnect upon a connection failure
>> > or to go stand alone?
>>
>> The devices are configured to go standalone in case of a disconnect.
>>
>> When I initially set up DRBD on those machines, I wanted to avoid
>> resyncing with swapped Primary <=> Secondary roles, i.e. destroying
>> the most recent side of the DRBD device pair on such a connection loss/
>> resync scenario. So I decided to let the device go stand alone in case
>> of a disconnect, and resync the devices manually.
>>
>> In hindsight, I am not too sure any more if that was the best decision -
>> if DRBD made a such mistake while being on "resync", it would most
>> certainly do exactly the same thing when I reconnect the devices by
>> hand.
>>
>> > Perhaps a network failure is occuring and your
>> > nodes are getting split-brain, thus no failover is being done (just a
>> > though here)
>>
>> Currently, I have 8 DRBD device pairs on this server. Only one loses
>> its connection sporadically, and Heartbeat does not kick in (which
>> is GOOD, to say the least).
>>
>> Nevertheless, I had a combination of errors once:
>>
>> 1) Loss of connection, leading to "Standalone" vs "Unknown" state
>> 2) A server problem that caused a reboot
>> 3) After which the failover node came up on the "Unknown" (aka old)
>>     side of the device pair.
>> 4) After manually resyncing, I finally had finished the good side,
>>     as now the bad side had been considered to be the most recent one
>>     and became SyncSource.
>> 5) Having daily disk backups based on rsync is wise.
>>
>> I was lucky enough to detect this right after reboot, otherwise
>> I would have been in real trouble, due to the data loss since the
>> devices had lost connection.
>>
>> Anyway, nowadays I am using SEC (http://simple-evcorr.sourceforge.net)
>> to monitor the syslog and send me an email if there are any DRBD
>> related errors popping up. It's been a life-saver...
>>
>> To recap, the question remains: Shall I simply use the reconnect
>> option for these devices and relax, or shall I be concerned about
>> the periodic loss of connection?
>>
>> > On 12/4/06, Holgilein <lists at loomsday.co.nz> wrote:
>> >> Hi there,
>> >>
>> >> I am running drbd-0.7.22 on kernel 2.6.17.11 (including Vserver patch
>> >> set v2.0.2-rc31), and I am using eth1 (tg3 driver on Broadcom BCM5780
>> >> Gigabit adapter) on a 2-node-cluster to synchronise the DRBD devices.
>> >>
>> >> This network interface is also used to pull backups across the
>> >> two nodes, and when the link is under heavy load I observed that
>> >> sometimes (every 5-6 days) DRBD loses its inter-node connection:
>> >>
>> >> Dec  5 05:40:45 wgr-host1 kernel: drbd0: [reiserfs/3/1830] 
>> sock_sendmsg
>> >> time expired, ko = 3
>> >> Dec  5 05:40:48 wgr-host1 kernel: drbd0: [reiserfs/3/1830] 
>> sock_sendmsg
>> >> time expired, ko = 2
>> >> Dec  5 05:40:51 wgr-host1 kernel: drbd0: [reiserfs/3/1830] 
>> sock_sendmsg
>> >> time expired, ko = 1
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: reiserfs/3 [1830]: cstate
>> >> Connected --> NetworkFailure
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: 
>> cstate
>> >> NetworkFailure --> BrokenPipe
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: short read expecting 
>> header on
>> >> sock: r=-512
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: asender terminated
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: worker terminated
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: 
>> cstate
>> >> BrokenPipe --> Unconnected
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: Connection lost.
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: 
>> cstate
>> >> Unconnected --> StandAlone
>> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: receiver terminated
>> >>
>> >> Is this a known behaviour, and is there anything I can do to remedy?
>> >>
>> >> Many thanks,
>> >>
>> >> Holger
>> >> _______________________________________________
>> >> drbd-user mailing list
>> >> drbd-user at lists.linbit.com
>> >> http://lists.linbit.com/mailman/listinfo/drbd-user
>> >>
>> >
>> >
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
> 
>