[DRBD-user] DRBD full sync is stalled

Jeroen Groenewegen van der Weyden groen692 at grosc.com
Fri Sep 25 16:41:15 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


>Serial console?
>Netconsole?
>Logs?

Which logs are you interested about, it is the first time I'm seriously troubleshooting DRBD problem.
The /var/log/messages. just stops having messages on the time of the freeze (see snippet below). is there some debug level I can increase for DRBD?


>Network stress tests not using DRBD?
>General stress tests?
>Memtest?

The problem happens on the "production lan" as well on a 4 port "1Gig staging switch". iperf shows in all cases normal values.
The problems happens on Fujitsu Siemens server RX200/RX300. The total of Fujistu Siemens Servers having this problem is 6 in total. Other servers I have installed do not have this problem. The Fujistu Siemens server have onboard Broadcom interfaces "NIC: NetXtreme II BCM5708 Gigabit Ethernet".


---------- /var/log/messages on the target machine --------------
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: PingAck did not 
arrive in time.
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: peer( Secondary -> 
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: asender terminated
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Terminating asender 
thread
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: short read expecting 
header on sock: r=-512
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Connection closed
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: conn( NetworkFailure 
-> Unconnected )
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: receiver terminated
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Restarting receiver 
thread
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: receiver (re)started
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: conn( Unconnected -> 
WFConnection )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: PingAck did not 
arrive in time.
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: peer( Primary -> 
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: asender terminated
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Terminating asender 
thread
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: short read expecting 
header on sock: r=-512
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Connection closed
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: conn( NetworkFailure 
-> Unconnected )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: receiver terminated
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Restarting receiver 
thread
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: receiver (re)started
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: conn( Unconnected -> 
WFConnection )
---------- here it is frozen -------------------------------
---------- /var/log/messages on the target machine --------------
Here it stop until the booting messages of the reboot show up.

mfg,

jeroen.

Lars Ellenberg wrote:
> On Fri, Sep 25, 2009 at 01:10:24PM +0200, Jeroen Groenewegen van der Weyden wrote:
>   
>> Anybody?
>>
>> The same seems to happen with 8.3.3RC2. although the error is either to  
>> freeze the system or the system disconnects all network interfaces. 
>> Anybody?
>>
>> mfg,
>>
>> jeroen
>>
>> Jeroen Groenewegen van der Weyden wrote:
>>     
>>> Hello,
>>>
>>> I have a problem when full syncing with drbd the target machine  
>>> freezes. scenario is simple whenever a full sync is made manual or  
>>> automaticly the syncing is stalled after some time. after the syncing  
>>> reaches the stalled states a view moments later the target machine  
>>> freeze entirely.
>>>
>>> OpenSuse 11.1
>>> kernel 2.6.27.21-0.1-xen #
>>> drbd 8.3.1
>>>
>>> NIC: NetXtreme II BCM5708 Gigabit Ethernet
>>>
>>> On the Source Machine:
>>> cat /proc/drbd
>>> version: 8.3.1 (api:88/proto:86-89)
>>> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by  
>>> root at DefaultNode, 2009-04-27 11:34:17
>>> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
>>>    ns:324524 nr:0 dw:110988 dr:689400 al:263 bm:242 lo:0 pe:2131  
>>> ua:978 ap:36 ep:1 wo:b oos:1635880
>>>        [==>.................] sync'ed: 16.4% (1635880/1951768)K
>>>        stalled
>>>
>>> How to find out what is happening here?
>>>       
>
> Serial console?
> Netconsole?
> Logs?
>
> Network stress tests not using DRBD?
> General stress tests?
> Memtest?
>
>   
>>> (and prevent it in the future.)
>>>       
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.409 / Virus Database: 270.13.112/2393 - Release Date: 09/24/09 18:00:00
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090925/2090691b/attachment.htm>


More information about the drbd-user mailing list