[Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected

Mariusz Mazur mmazur at kernel.pl
Thu Jul 3 15:07:18 CEST 2014


My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen 
4.4.0.

Here's what happens when I reboot -f one of the nodes and the surviving node 
is kernel 3.12.23 or earlier (oldest tested was 3.6.something):

 [10512.040601] d-con vpbx_dev3: PingAck did not arrive in time.
 [10512.136930] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
 [10512.346951] d-con vpbx_dev3: asender terminated
 [10512.443875] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev
 [10512.540918] d-con vpbx_dev3: Connection closed
 [10512.636365] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected ) 
 [10512.733924] d-con vpbx_dev3: receiver terminated
 [10512.829033] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer 
vpbx_dev3
 [10512.924943] d-con vpbx_dev3: Restarting receiver thread
 [10513.022739] d-con vpbx_dev3: receiver (re)started
 [10513.116863] d-con vpbx_dev3: conn( Unconnected -> WFConnection ) 
 [10518.874822] dlm: closing connection to node 2
 [10519.468190] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer 
vpbx_dev3 exit code 5 (0x500)
 [10519.604612] d-con vpbx_dev3: fence-peer helper returned 5 (peer is 
unreachable, assumed to be dead)                                                                                                           
 [10519.700071] d-con vpbx_dev3: pdsk( DUnknown -> Outdated )                                                                                                                                                     
 [10519.810474] block drbd0: new current UUID 
0E52D87F8EBDA1BB:A20A6CC6D7B5E6ED:2E667D8F22B5DA49:2E657D8F22B5DA49                                                                                                 
 [10519.943527] d-con vpbx_dev3: susp( 1 -> 0 )                                                                                                                                                                   
 [10522.114574] dlm: clvmd: dlm_recover 3                                                                                                                                                                         
 [10522.114619] dlm: clvmd: remove member 2                                                                                                                                                                       
 [10522.114623] dlm: clvmd: dlm_recover_members 1 nodes                                                                                                                                                           
 [10522.114626] dlm: clvmd: generation 17 slots 1 1:1                                                                                                                                                             
 [10522.114627] dlm: clvmd: dlm_recover_directory                                                                                                                                                                 
 [10522.114629] dlm: clvmd: dlm_recover_directory 0 in 0 new                                                                                                                                                      
 [10522.114631] dlm: clvmd: dlm_recover_directory 0 out 0 messages                                                                                                                                                
 [10522.114633] dlm: clvmd: dlm_recover_masters                                                                                                                                                                   
 [10522.114634] dlm: clvmd: dlm_recover_masters 0 of 0                                                                                                                                                            
 [10522.114636] dlm: clvmd: dlm_recover_locks 0 out                                                                                                                                                               
 [10522.114637] dlm: clvmd: dlm_recover_locks 0 in                                                                                                                                                                
 [10522.114662] dlm: clvmd: dlm_recover 3 generation 17 done: 0 ms                     

Everything seems fine, drbd remains accessible.

And here's what happens if the surviving node is running 3.13.6 (or 3.14.8 or 
3.15.3).

 [  382.002770] d-con vpbx_dev3: PingAck did not arrive in time.                                                                                                                                                  
 [  382.079354] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )                                                                       
 [  382.245092] d-con vpbx_dev3: asender terminated                                                                                                                                                               
 [  382.322803] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev                                                                                                                                                      
 [  382.400862] d-con vpbx_dev3: Connection closed                                                                                                                                
 [  382.484773] d-con vpbx_dev3: out of mem, failed to invoke fence-peer 
helper                                                                                                   
 [  382.562487] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected )                                                                                                            
 [  382.640106] d-con vpbx_dev3: receiver terminated                                                                                                                              
 [  382.717000] d-con vpbx_dev3: Restarting receiver thread                                                                                                                       
 [  382.793857] d-con vpbx_dev3: receiver (re)started                                                                                                                             
 [  382.869857] d-con vpbx_dev3: conn( Unconnected -> WFConnection )                                                                                                              
 [  384.326309] dlm: closing connection to node 1                                                                                                                                 
 [  387.172256] dlm: clvmd: dlm_recover 3                                                                                                                                         
 [  387.172303] dlm: clvmd: remove member 1                                                                                                                                       
 [  387.172306] dlm: clvmd: dlm_recover_members 1 nodes                                                                                                                           
 [  387.172309] dlm: clvmd: generation 19 slots 1 2:2                                                                                                                             
 [  387.172311] dlm: clvmd: dlm_recover_directory                                                                                                                                 
 [  387.172312] dlm: clvmd: dlm_recover_directory 0 in 0 new                                                                                                                    
 [  387.172314] dlm: clvmd: dlm_recover_directory 0 out 0 messages
 [  387.172316] dlm: clvmd: dlm_recover_masters
 [  387.172318] dlm: clvmd: dlm_recover_masters 0 of 0
 [  387.172320] dlm: clvmd: dlm_recover_locks 0 out
 [  387.172321] dlm: clvmd: dlm_recover_locks 0 in
 [  387.172345] dlm: clvmd: dlm_recover 3 generation 19 done: 0 ms

With the result being:
[root at dev3n2 ~]# cat /proc/drbd 
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC 
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
    ns:0 nr:0 dw:0 dr:264 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

And I need to reboot the node before drbd becomes operational again.

I won't have time to properly bisect this for a few weeks, though if somebody 
has a guess at what's wrong I can test a patch or provide more info.

--mmazur


More information about the drbd-dev mailing list