[DRBD-user] Initial Sync - Fast then really slow

Sun Apr 11 20:19:53 CEST 2004

/ 2004-04-11 15:38:05 +0000
\ Jeff Goris:
> > > That is correct. The machine is only doing DRBD - both
> > > machines are freshly installed with just DRBD and heartbeat
> > > setup. The problem occurs whether or not /dev/nb0 is mounted
> > > or unmounted.  The only resources Heartbeat manages in the
> > > cluster is one DRBD device and one virtual IP address.
> > 
> > please reproduce it without heartbeat...  "by hand"
> > Thanks.
> > 
> 
> I managed to reproduce it again. /dev/nb0 was unmounted and heartbeat and DRBD 
> were stopped on both hosts. Then checked that all RAID devices were healthy 
> and that the time on both hosts were correct and synchronised. Started DRBD 
> (command 'service drbd start') on the host that was last primary. Started drbd 
> on the secondary and monitored the hosts durng the syncall of /dev/nb0. When 
> finished, I set the clock on the secondary forward 8 hours with the 'date' 
> command. Finally, I started a resync on the primary with the 
> command 'drbdsetup /dev/nb0 replicate'. I monitored both RAID and DRBD during 
> the resync until the secondary host locked up. I did not see the sync rate 
> drop prior to the lockup.
> 
> I suspect now that the slow sync rate was due to the software RAID 1 also 
> syncing as you "guessed" as the last two times I reproduced this lock up I did 
> not see the sync rate drop. However, I am pretty sure that the locking up on 
> the secondary occurs when it's system clock is drifting from a time in the 
> future back to the correct time whilst DRBD is resyncing. I can't see what 
> else could be causing the host to lock up whlst DRBD is resyncing. I've tried 
> to stop everything running other than DRBD and NTPD.
> 
> If you are very sure that DRBD should not be failing under these conditions, 
> then I think I will need to try a fresh "minimal" install of RedHat without 
> software RAID and without channel bonding and try introducing components and 
> see if I can ascertain which component is causing the problem.

Ok.

The only thing where DRBD has a notion of time are some local
timers, connection timeout and so on. It just deferres certain
actions by some configured amount of time (usual seconds) relative
to "now". E.g. I issue a DrbdPing, and then set a timer to notify
me after, say, 6 seconds... if the peer answers in time, the timer
action is discarded again.

A smoothly adjusting time via NTPD by sub-second amounts...
How should that be noticed at all, and by whom?  Not by DRBD.

If I really try very hard to imagine what could go wrong,
I might be able to be not too sure about what exactly might happen
when you have a more or less *large* timeskew whilst DRBD is runnig.
Though I can not think of a reason for it to lock up the box.
Worst case and still extremly unlikely, it might drop the
connection, and immediately reconnect.

If you just have a time difference between the two nodes,
DRBD is completely innocent.
DRBD does not care about, and has no knowledge of, the peers time.
TCP has timestamps, but who cares.
heartbeat is known to "dislike" time differences between its
nodes, and rightfully so, but I doubt it would lock up the box
because of that...

So I just can say ' ??  :-/ '

BTW, did you use a serial console?
NMI watchdog?  enabled sysrq?
any reaction?  any message?

	Lars Ellenberg