Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> BTW, if only 1 out of the 8 losses connection spariodically, and it's > always the same one, you could be looking at a hardware problem. You > may want to look into that, perhaps replace the disk and see if it > still happens. If not then you can certainly rule out hardware > failure. All of my 8 DRBD device pairs live on one RAID5 device, which is in good health. Besides, I would not associate a sudden loss of network connectivity with the underlaying FS (and I do not have ANY other error messages in the syslog on both nodes when DRBD loses its connection). Any other suggestions, please? Many thanks, Holger > On 12/5/06, Holgilein <lists at loomsday.co.nz> wrote: >> > Just wondering, is drbd set up to reconnect upon a connection failure >> > or to go stand alone? >> >> The devices are configured to go standalone in case of a disconnect. >> >> When I initially set up DRBD on those machines, I wanted to avoid >> resyncing with swapped Primary <=> Secondary roles, i.e. destroying >> the most recent side of the DRBD device pair on such a connection loss/ >> resync scenario. So I decided to let the device go stand alone in case >> of a disconnect, and resync the devices manually. >> >> In hindsight, I am not too sure any more if that was the best decision - >> if DRBD made a such mistake while being on "resync", it would most >> certainly do exactly the same thing when I reconnect the devices by >> hand. >> >> > Perhaps a network failure is occuring and your >> > nodes are getting split-brain, thus no failover is being done (just a >> > though here) >> >> Currently, I have 8 DRBD device pairs on this server. Only one loses >> its connection sporadically, and Heartbeat does not kick in (which >> is GOOD, to say the least). >> >> Nevertheless, I had a combination of errors once: >> >> 1) Loss of connection, leading to "Standalone" vs "Unknown" state >> 2) A server problem that caused a reboot >> 3) After which the failover node came up on the "Unknown" (aka old) >> side of the device pair. >> 4) After manually resyncing, I finally had finished the good side, >> as now the bad side had been considered to be the most recent one >> and became SyncSource. >> 5) Having daily disk backups based on rsync is wise. >> >> I was lucky enough to detect this right after reboot, otherwise >> I would have been in real trouble, due to the data loss since the >> devices had lost connection. >> >> Anyway, nowadays I am using SEC (http://simple-evcorr.sourceforge.net) >> to monitor the syslog and send me an email if there are any DRBD >> related errors popping up. It's been a life-saver... >> >> To recap, the question remains: Shall I simply use the reconnect >> option for these devices and relax, or shall I be concerned about >> the periodic loss of connection? >> >> > On 12/4/06, Holgilein <lists at loomsday.co.nz> wrote: >> >> Hi there, >> >> >> >> I am running drbd-0.7.22 on kernel 2.6.17.11 (including Vserver patch >> >> set v2.0.2-rc31), and I am using eth1 (tg3 driver on Broadcom BCM5780 >> >> Gigabit adapter) on a 2-node-cluster to synchronise the DRBD devices. >> >> >> >> This network interface is also used to pull backups across the >> >> two nodes, and when the link is under heavy load I observed that >> >> sometimes (every 5-6 days) DRBD loses its inter-node connection: >> >> >> >> Dec 5 05:40:45 wgr-host1 kernel: drbd0: [reiserfs/3/1830] >> sock_sendmsg >> >> time expired, ko = 3 >> >> Dec 5 05:40:48 wgr-host1 kernel: drbd0: [reiserfs/3/1830] >> sock_sendmsg >> >> time expired, ko = 2 >> >> Dec 5 05:40:51 wgr-host1 kernel: drbd0: [reiserfs/3/1830] >> sock_sendmsg >> >> time expired, ko = 1 >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: reiserfs/3 [1830]: cstate >> >> Connected --> NetworkFailure >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: >> cstate >> >> NetworkFailure --> BrokenPipe >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: short read expecting >> header on >> >> sock: r=-512 >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: asender terminated >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: worker terminated >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: >> cstate >> >> BrokenPipe --> Unconnected >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: Connection lost. >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: >> cstate >> >> Unconnected --> StandAlone >> >> Dec 5 05:40:54 wgr-host1 kernel: drbd0: receiver terminated >> >> >> >> Is this a known behaviour, and is there anything I can do to remedy? >> >> >> >> Many thanks, >> >> >> >> Holger >> >> _______________________________________________ >> >> drbd-user mailing list >> >> drbd-user at lists.linbit.com >> >> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> >> > >> > >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> > >