Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg napisał(a): > On Tue, Apr 24, 2007 at 02:14:58PM +0200, Lukasz Engel wrote: > >>> On Tue, Apr 24, 2007 at 12:10:37PM +0200, Lukasz Engel wrote: >>> >>> >>>> I have 2 machines running drdb 0.7.23 (self compiled) with configured 5 >>>> drdbX resources (and heartbeat running above), >>>> drbd uses direct cross-over cable for synchronization. Kernel 2.6.19.2 >>>> (vendor kernel - trustix 3) UP. >>>> >>>> Today I disconnected and connected direct cable and after that 2 of 5 >>>> drbds was failing to reconnect: >>>> drbd0,2,4 successuly connected >>>> drbd1 on secondary blocked in NetworkFailure state (WFConnection on >>>> primary) >>>> drbd3 was retrying to reconnect, but could not succeed (always went to >>>> BrokenPipe after WFReportParams) >>>> >>>> >>> this should not happen. >>> it is known to happen sometimes anyways. >>> it is some sort of race condition. >>> >>> the scheme to avoid it is heavily dependend on timeouts. >>> >>> >> Any chances for fix ? >> (If it help I should be able to disconnect my drbd link sometimes to >> make some test...) >> > > I remembered similar symptoms from a long time ago, > when we spend a long time to debug this. > We thought we had fixed it. > You see the same symptoms again. > It may be a different problem, it may be out "fix" back then > only mad it less likely to occur. > > Since I can not reproduce it, I can not debug it. > If you can track down _why_ it happens, great. > I'm happy to fix it then. > Any hints how to debug the problem ? This is my production environment, but I think I can add some debug (printk's) in drbd code (good question - where?) - if the problem appeared more than once it's highly probable it will appear again (I may "help" by playing with drbd eth cable...). [I am resending with correct (subscribed) address, another copy probably is already waiting for moderator...] -- Lukasz Engel