[DRBD-user] Pacemaker DRBD dual-Primary setup, node shutdown before DRBD syncing completes.

Digimer lists at alteeve.ca
Sun Apr 2 07:50:40 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 01/04/17 03:45 PM, Raman Gupta wrote:
> Hi,
> 
> Problem: 
> -----------------
> In Pacemaker GFS2 DRBD dual-Primary setup, before the initial syncing
> between the 2 nodes was complete one node accidentally got shutdown
> (server4) i.e. while initial DRBD syncing from server4 --> server7 was
> going on the server4 crashed. The server7 was left in Inconsistent state.
> 
> On surviving node (server7) I could see errors in /var/log/messages: 
> 
> Apr  2 00:41:04 server7 kernel: block drbd0: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:04 server7 kernel: block drbd0:   state = { cs:SyncTarget
> ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
> Apr  2 00:41:04 server7 kernel: block drbd0:  wanted = { cs:TearDown
> ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
> Apr  2 00:41:04 server7 kernel: drbd vDrbd: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:04 server7 kernel: drbd vDrbd:  mask = 0x1e1f0 val = 0xa070
> Apr  2 00:41:04 server7 kernel: drbd vDrbd:  old_conn:WFReportParams
> wanted_conn:TearDown
> Apr  2 00:41:05 server7 kernel: block drbd0: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:05 server7 kernel: block drbd0:   state = { cs:SyncTarget
> ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
> Apr  2 00:41:05 server7 kernel: block drbd0:  wanted = { cs:TearDown
> ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
> Apr  2 00:41:05 server7 kernel: drbd vDrbd: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:05 server7 kernel: drbd vDrbd:  mask = 0x1f0 val = 0x70
> Apr  2 00:41:05 server7 kernel: drbd vDrbd:  old_conn:WFReportParams
> wanted_conn:TearDown
> Apr  2 00:41:05 server7 kernel: block drbd0: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:05 server7 kernel: block drbd0:   state = { cs:SyncTarget
> ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
> Apr  2 00:41:05 server7 kernel: block drbd0:  wanted = { cs:TearDown
> ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
> Apr  2 00:41:05 server7 kernel: drbd vDrbd: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:05 server7 kernel: drbd vDrbd:  mask = 0x1e1f0 val = 0xa070
> Apr  2 00:41:05 server7 kernel: drbd vDrbd:  old_conn:WFReportParams
> wanted_conn:TearDown
> Apr  2 00:41:06 server7 kernel: block drbd0: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:06 server7 kernel: block drbd0:   state = { cs:SyncTarget
> ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
> Apr  2 00:41:06 server7 kernel: block drbd0:  wanted = { cs:TearDown
> ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
> Apr  2 00:41:06 server7 kernel: drbd vDrbd: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:06 server7 kernel: drbd vDrbd:  mask = 0x1f0 val = 0x70
> Apr  2 00:41:06 server7 kernel: drbd vDrbd:  old_conn:WFReportParams
> wanted_conn:TearDown
> Apr  2 00:41:06 server7 kernel: block drbd0: State change failed: Need
> access to UpToDate data
> Apr  2 00:41:06 server7 kernel: block drbd0:   state = { cs:SyncTarget
> ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
> Apr  2 00:41:06 server7 kernel: block drbd0:  wanted = { cs:TearDown
> ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
> Apr  2 00:41:06 server7 kernel: drbd vDrbd: State change failed: Need
> access to UpToDate data
> 
> 
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: PingAck did not arrive in time.

Network break (though it sounds like you already knew why the node
disappeared).

> Apr  2 00:41:22 server7 kernel: drbd vDrbd: peer( Secondary -> Unknown )
> conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp(
> 0 -> 1 )

The peer was UpToDate when it died, by the looks of it. To DRBD, because
the local store had not finished synching, it dropped to prevent problems.

> Apr  2 00:41:22 server7 kernel: block drbd0: helper command:
> /sbin/drbdadm pri-on-incon-degr minor-0
> Apr  2 00:41:22 server7 kernel: block drbd0: helper command:
> /sbin/drbdadm pri-on-incon-degr minor-0 exit code 0 (0x0)
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: ack_receiver terminated
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: Terminating drbd_a_vDrbd
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: Connection closed
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: conn( NetworkFailure ->
> Unconnected )
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: receiver terminated
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: Restarting receiver thread
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: receiver (re)started
> Apr  2 00:41:22 server7 kernel: drbd vDrbd: conn( Unconnected ->
> WFConnection )
> *Apr  2 00:41:22 server7 kernel: drbd vDrbd: Not fencing peer, I'm not
> even Consistent myself.*

This is so that, if the peer was alive, this node would get shot (as the
peer was UpToDate). Were you sync'ing the two resources in both
directions at the same time?

> Apr  2 00:41:22 server7 kernel: drbd vDrbd: susp( 1 -> 0 )
> Apr  2 00:41:22 server7 kernel: block drbd0: IO ERROR: neither local nor
> remote data, sector 0+0
> Apr  2 00:41:22 server7 kernel: block drbd0: IO ERROR: neither local nor
> remote data, sector 344936+8
> Apr  2 00:41:22 server7 kernel: GFS2: fsid=vCluster:vGFS2.1: Error -5
> writing to log
> Apr  2 00:41:22 server7 kernel: block drbd0: IO ERROR: neither local nor
> remote data, sector 344944+24
> Apr  2 00:41:22 server7 kernel: GFS2: fsid=vCluster:vGFS2.1: Error -5
> writing to log
> Apr  2 00:41:22 server7 kernel: block drbd0: IO ERROR: neither local nor
> remote data, sector 0+0
> Apr  2 00:41:22 server7 kernel: block drbd0: IO ERROR: neither local nor
> remote data, sector 344968+8
> Apr  2 00:41:22 server7 kernel: GFS2: fsid=vCluster:vGFS2.1: Error -5
> writing to log
> Apr  2 00:41:22 server7 kernel: Buffer I/O error on dev dm-0, logical
> block 66218, lost async page write
> Apr  2 00:41:22 server7 kernel: GFS2: fsid=vCluster:vGFS2.1: Error -5
> writing to log
> Apr  2 00:41:22 server7 kernel: GFS2: fsid=vCluster:vGFS2.1: Error -5
> writing to log

I'm guessing this is reacting to the loss of it's backing storage.

> DRBD state on surviving node server7
> ---------------------------------------------------------------
> version: 8.4.9-1 (api:1/proto:86-101)
> GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by
> akemi at Build64R7, 2016-12-04 01:08:48
>  0: cs:WFConnection ro:Primary/Unknown ds:Inconsistent/DUnknown C r-----
>     ns:3414 nr:1438774 dw:1441849 dr:72701144 al:25 bm:0 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:29623116
> 
> 
> Question:
> ------------------
> Are these serious in nature? 
> When crashed node comes UP again and joins cluster will it cause any
> problem? 
> How this can be avoided if a node crashes before sync completes?

Obviously, I (not anyone here) can say for sure what will happen or what
state your systems are in, as we don't have a complete understanding of
your setup. Second, I am not an expert on DRBD or GFS2, though I use
both a fair bit.

My educated guess is that you're OK. Start the other node and it should
reconnect and go back to sync'ing as it was before.

As for how to avoid this before sync'ing is completed, well, that
depends on what caused the peer to lose connection. Until the local
storage is UpToDate, you're in a degraded state and can't run if the
UpToDate node goes away.

> Env:
> ---------
> CentOS 7.3
> DRBD 8.4 
> gfs2-utils-3.1.9-3.el7.x86_64
> Pacemaker 1.1.15-11.el7_3.4
> 
> 
> Pacemaker:
> ---------------------
> [root at server7 ~]# pcs status
> Cluster name: vCluster
> Stack: corosync
> Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition
> with quorum
> Last updated: Sun Apr  2 01:01:43 2017          Last change: Sun Apr  2
> 00:28:39 2017 by root via cibadmin on server4ha
> 
> 2 nodes and 9 resources configured
> 
> Online: [ server7ha ]
> OFFLINE: [ server4ha ]
> 
> Full list of resources:
> 
>  vCluster-VirtualIP-10.168.10.199       (ocf::heartbeat:IPaddr2):      
> Started server7ha
>  vCluster-Stonith-server7ha     (stonith:fence_ipmilan):        Stopped
>  vCluster-Stonith-server4ha     (stonith:fence_ipmilan):        Started
> server7ha
>  Clone Set: dlm-clone [dlm]
>      Started: [ server7ha ]
>      Stopped: [ server4ha ]
>  Clone Set: clvmd-clone [clvmd]
>      Started: [ server7ha ]
>      Stopped: [ server4ha ]
>  Master/Slave Set: drbd_data_clone [drbd_data]
>      Masters: [ server7ha ]
>      Stopped: [ server4ha ]
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root at server7 ~]# 
> 
> 
> Attaching DRBD config files.
> 
> 
> --Raman


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould



More information about the drbd-user mailing list