Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
It will be a little while before I can get the log information together. We will be restarting the clusters this weekend. After that the data should be synced. When that's done we'll start watching for the failure. Gary > -----Original Message----- > From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user- > bounces at lists.linbit.com] On Behalf Of Philipp Reisner > Sent: Wednesday, January 03, 2007 3:03 AM > To: drbd-user at lists.linbit.com > Subject: Re: [DRBD-user] DRBD Xen problems > > Am Mittwoch, 3. Januar 2007 04:44 schrieb Gary W. Smith: > > Hello, > > > > I'm running drbd 0.7.22 under rPath in a Xen DomU. At first everything > > was working fine. Node 1 was actually under a non-Xen RHEL4 instance > > under VMWare and node 2 was under rPath Xen DomU. Everything worked > > fine there. We migrated Node 1 to match the other node (OS and config). > > Everything continued to work fine and about a week later we started > > seeing things like this on the console: > > > > Dec 29 19:23:30 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > Unconnected --> WFConnection > > Dec 29 19:23:38 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > WFConnection --> WFReportParams > > Dec 29 19:23:38 pxtoaksql04a kernel: drbd0: sock was shut down by peer > > Dec 29 19:23:38 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > WFReportParams --> BrokenPipe > > Dec 29 19:23:38 pxtoaksql04a kernel: drbd0: short read expecting header > > on sock: r=0 > > Dec 29 19:23:38 pxtoaksql04a kernel: drbd0: Network error during initial > > handshake. I'll try again. > > Dec 29 19:23:44 pxtoaksql04a kernel: drbd0: worker terminated > > Dec 29 19:23:44 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > BrokenPipe --> Unconnected > > Dec 29 19:23:44 pxtoaksql04a kernel: drbd0: Connection lost. > > Dec 29 19:23:44 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > Unconnected --> WFConnection > > Dec 29 19:23:51 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > WFConnection --> WFReportParams > > Dec 29 19:23:51 pxtoaksql04a kernel: drbd0: sock was shut down by peer > > Dec 29 19:23:51 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > WFReportParams --> BrokenPipe > > Dec 29 19:23:51 pxtoaksql04a kernel: drbd0: short read expecting header > > on sock: r=0 > > Dec 29 19:23:51 pxtoaksql04a kernel: drbd0: Network error during initial > > handshake. I'll try again. > > Dec 29 19:23:57 pxtoaksql04a kernel: drbd0: worker terminated > > Dec 29 19:23:57 pxtoaksql04a kernel: drbd0: drbd0_receiver [731]: cstate > > BrokenPipe --> Unconnected > > Dec 29 19:23:57 pxtoaksql04a kernel: drbd0: Connection lost. > > ... > > > > Restarting DRBD on the affected node does nothing. Restarting this > > machine seems to fix the problem but then HA takes over the node and > > shuts down the active primary (gracefully even though auto failback is > > disabled -- this is a HA problem though). > > > > Any ideas on why this might be happening? Would heavy IO load cause > > this? > > > > To understand what is going on here we need the kernel-log of both > machines of such an event! > PS: Please make sure that the clocks of those machines are in sync > (use NTP!) > > -Phil > > -- > : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com : > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user