[DRBD-user] downgrading to drbd 0.7.11 worked (Was: drbd 0.7.13 slow resync and panic with RedHat kernel 2.4.21-32.0.1.ELsmp)

Sun Oct 23 00:31:13 CEST 2005

[ comment by Lars Ellenberg:                                        ]
[ Repost on request of Diego, so it appears in the archives.        ]
[ This post seems to have been lost somehow. The problem described  ]
[ has been solved since, with release of drbd 0.7.14...             ]

> On Mon, Oct 3 at 16:04, Phil wrote:
> > On Fri, 2005-09-30 at 18:58, Diego Liziero wrote:
> > > Today we tried to switch to drbd 0.7.11 and everything worked fine.
> > >
> > > We used the same kernel that repeatedly freezed with drbd 0.7.13
> > > but with Spinlock debugging and nmi_watchdog enabled.
> > >
> > > As regards our environment we can say that drdb 0.7.13 has some
> > > instability issues that disappeared (by now) using the
> > > earlier 0.7.11 version with the same kernel with Spinlock debugging and 
> > > nmi_watchdog enabled.
> > >
> > > Sorry if we can't give more information,
> > > but our cluster is in production and we couldn't repeat today
> > > the kernel panic of drbd 0.7.13 with the spinlock debug option
> > > switched on.
> > >
> > > Regards,
> > > Diego.
> >
> > Hi Diego,
> 
> Hello Phil, sorry for not quoting the previous mailing-list
> thread discussion.
> 
> > What do you mean by freeze:
> >  * Does it respond to key-strokes ?
> >  * Does it respond to pings ?
> >  * Does it respond to the "Num-Lock" key with toggling the "num-Lock" led ?
> >  * Is the screen blank, or has it the same content as before the freeze ?
> 
> As previously reported to Lars, the system is a multiprocessor
> (4 Xeon) SMP cluster, in production for little more than a year.
> We have been using drbd 0.6.12 without any problem up to the point
> where we decided to upgrade to the 0.7.x branch.
> 
> The lock we had in 0.7.13 happened repeatedly during the first
> resync of the last and biggest partition while already primary and in use.
> At that time we didn't have nmi_watchdog and spinlock debugging,
> sorry about that.
> 
> Keyboard was not responding (no led of caps-lock/num-lock was
> changing state), ping was working (but I know that
> sometimes it works even after kernel panics), the screen was blank
> (maybe the console blank screen?) apart from one time where
> there was readable the end of an Oops message
> (I remember something about tasker, irq and smp)
> but I couldn't read more because the keyboard
> shift-pageup wasn't working.
> 
> > [..]
> > You enabled nmi_watchdog and spinlock debugging for use with 0.7.11 ?
> 
> Yes, that what Lars told me to do, we didn't try again 0.7.13
> because the cluster is in production and we couldn't
> test again the version that caused the freeze.
> 
> Each 0.7.13 freeze has occurred after a noticeable slowdown of the resync process
> (at about the 10% of the normal speed).
> 
> Actually even with 0.7.11 we had initially a slowdown during
> the resync of the biggest partition (the same that never
> went over the 5% of the resync process before freezing
> with drbd 0.7.13), with a high load average (over 150),
> but without any lock.
> Then we decided to invalidate the secondary box because
> after the 0.7.13 we went back to 0.6.12 before the new
> upgrade to 0.7.11, and we thought that maybe the metadata was wrong.
> After this forced resync (that we did before the trouble
> partition completed the resync), no slowdown occurred and the load
> stayed below 5.
> 
> Regards,
> Diego.