[DRBD-user] [0.7.23] reconnect problem after link loss

Tue Apr 24 17:25:09 CEST 2007

Lars Ellenberg napisał(a):
> On Tue, Apr 24, 2007 at 02:14:58PM +0200, Lukasz Engel wrote:
>   
>>> On Tue, Apr 24, 2007 at 12:10:37PM +0200, Lukasz Engel wrote:
>>>  
>>>       
>>>> I have 2 machines running drdb 0.7.23 (self compiled) with configured 5 
>>>> drdbX resources (and heartbeat running above),
>>>> drbd uses direct cross-over cable for synchronization. Kernel 2.6.19.2 
>>>> (vendor kernel - trustix 3) UP.
>>>>
>>>> Today I disconnected and connected direct cable and after that 2 of 5 
>>>> drbds was failing to reconnect:
>>>> drbd0,2,4 successuly connected
>>>> drbd1 on secondary blocked in NetworkFailure state (WFConnection on 
>>>> primary)
>>>> drbd3 was retrying to reconnect, but could not succeed (always went to 
>>>> BrokenPipe after WFReportParams)
>>>>    
>>>>         
>>> this should not happen.
>>> it is known to happen sometimes anyways.
>>> it is some sort of race condition.
>>>  
>>> the scheme to avoid it is heavily dependend on timeouts.
>>>  
>>>       
>> Any chances for fix ?
>> (If it help I should be able to disconnect my drbd link sometimes to 
>> make some test...)
>>     
>
> I remembered similar symptoms from a long time ago,
> when we spend a long time to debug this.
> We thought we had fixed it.
> You see the same symptoms again.
> It may be a different problem, it may be out "fix" back then
> only mad it less likely to occur.
>
> Since I can not reproduce it, I can not debug it.
> If you can track down _why_ it happens, great.
> I'm happy to fix it then.
>

Any hints how to debug the problem ?
This is my production environment, but I think I can add some debug
(printk's) in drbd code (good question - where?) - if the problem
appeared more than once it's highly probable it will appear again (I may
"help" by playing with drbd eth cable...).

[I am resending with correct (subscribed) address, another copy probably 
is already waiting for moderator...]

-- 
Lukasz Engel