[DRBD-user] Is it normal that we can't directly remove the secondary node when fencing is set?

Sat Sep 10 08:46:01 CEST 2016

Hi everyone,
I have a question about removing the secondary node of DRBD9.
When fencing is set, is it normal that we can't remove the secondary node of DRBD9, but the operation is successful of DRBD8.4.6?

Version of DRBD kernel source is the newest version(9.0.4-1).Version of DRBD utils is 8.9.6.
Description:
    3 nodes, one of the nodes is primary,disk state is UpToDate.Fencing is set.
    I got an error message 'State change failed: (-7) State change was refused by peer node' when executing the command 'drbdadm down <res-name>' on any of the secondary nodes.

Analysis:
    When executing the down command on one of the secondary nodes.
    The secondary node will execute the methods 'change_cluster_wide_state' of drbd_state.c.
    change_cluster_wide_state()
    {
        ...
        if (have_peers) {
                if (wait_event_timeout(resource->state_wait,
                               cluster_wide_reply_ready(resource),
                               twopc_timeout(resource))){-------------①Waiting for peer node to reply, the thread will sleep until the peer node replies.
                    rv = get_cluster_wide_reply(resource);------------②Get the reply info.        
                }else{
                }
        ...
    }

    Process ①
        Primary node will execute the following methods.
            ..->try_state_change->is_valid_soft_transition->__is_valid_soft_transition

            Finally,__is_valid_soft_transition will return error code SS_PRIMARY_NOP。

            if (peer_device->connection->fencing_policy >= FP_RESOURCE &&
                !(role[OLD] == R_PRIMARY && repl_state[OLD] < L_ESTABLISHED && !(peer_disk_state[OLD] <= D_OUTDATED)) &&
                 (role[NEW] == R_PRIMARY && repl_state[NEW] < L_ESTABLISHED && !(peer_disk_state[NEW] <= D_OUTDATED)))

                   return SS_PRIMARY_NOP;

            Primary node will set drbd_packet to P_TWOPC_NO, seconday node will get the reply to set connection status to TWOPC_NO。
            At this time,Process ① will finish.

    Process ②
           rv will be set to SS_CW_FAILED_BY_PEER

    ====8.4.6版====
        One is primary, the next one is secondary.
        When executing 'drbdadm down <res-name>' on seconday node, the same error message will be recorded in the log file for the first time to change the peer disk to D_UNKNOWN。
        But the command will succeed by changing peer disk to D_OUTDATED for the second time.

        The following code that report the error.
        is_valid_state()
        {
            ...
            if (fp >= FP_RESOURCE &&
                     ns.role == R_PRIMARY && ns.conn < C_CONNECTED && ns.pdsk >= D_UNKNOWN①){
                        rv = SS_PRIMARY_NOP;
                     }
            ...
        }

        After executing the command 'drbdadm down <res-name>' on secondary node, the status of the primary node is:
        [root at drbd846 drbd-8.4.6]# cat /proc/drbd
        version: 8.4.6 (api:1/proto:86-101)
        GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by root at drbd846.node1, 2016-09-08 08:51:45
         0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/Outdated   r-----
            ns:1048508 nr:0 dw:0 dr:1049236 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

        The peer disk state is OutDated, not DUnknown.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160910/b762815b/attachment.htm>