[DRBD-user] DRBD9&Pacemaker

Wed Sep 28 00:35:44 CEST 2016

On 27 Sep 2016 11:51 pm, "刘丹" <mzlld1988 at 163.com> wrote:
>
> Failed to send to one or more email server, so send again.
>
>
>
> At 2016-09-27 15:47:37, "Nick Wang" <nwang at suse.com
> > wrote: >>>> On 2016-9-26 at 19:17, in message ><CACp6BS7W6PyW=
453WkrRFGSZ+f0mqHqL2m9FjSXFicOCQ+wiwA at mail.gmail.com>, Igor >Cicimov <
igorc at encompasscorporation.com> wrote: >> On 26 Sep 2016 7:26 pm,
"mzlld1988" <mzlld1988 at 163.com> wrote: >> > >> > I apply the attached patch
file to scripts／drbd.ocf，then pacemaker can >> start drbd successfully，but
only two nodes， the third node's drbd is >> down，is it right? >> Well you
didnt say you have 3 nodes. Usually you use pacemaker with 2 nodes >> and
drbd. >The patch suppose to help on 3(more) nodes scenario, as long as only
one Primary. >Is the 3 nodes DRBD cluster working without pacemaker? And
how did you >configure in pacemaker?
> Accoring to http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf, I
executed the following commands to configure drbd in pacemaker.
>  [root at pcmk-1 ~]# pcs cluster cib drbd_cfg
>  [root at pcmk-1 ~]# pcs -f drbd_cfg resource create WebData ocf:linbit:drbd
\
>      drbd_resource=wwwdata op monitor interval=60s
>  [root at pcmk-1 ~]# pcs -f drbd_cfg resource master WebDataClone WebData \
>      master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \
>      notify=true
You need clone-max=3 for 3 nodes

>  [root at pcmk-1
>  ~]# pcs cluster cib-push drbd_cfg >> > And another question is ，can
pacemaker successfully stop the slave node >> ? My result is pacemaker
can't sop the slave node. >> > >Yes, need to check the log on which
resource prevent pacemaker to stop.
>
> Pacemaker can't stop slave node's drbd, I think the reason may be the
same as my previous email(see attached file) ,but no one reply that email.
> [root at drbd ~]# pcs status
>  Cluster name: mycluster
>  Stack: corosync
>  Current DC: drbd.node103 (version 1.1.15-e174ec8) - partition with quorum
>  Last updated: Mon Sep 26 04:36:50 2016          Last change: Mon Sep 26
04:36:49 2016 by root via cibadmin on drbd.node101
>
> 3 nodes and 2 resources configured
>
> Online: [ drbd.node101 drbd.node102 drbd.node103 ]
>
> Full list of resources:
>
>  Master/Slave Set: WebDataClone [WebData]
>    Masters: [ drbd.node102 ]
>    Slaves: [ drbd.node101 ]
>
> Daemon Status:
>    corosync: active/corosync.service is not a native service, redirecting
to /sbin/chkconfig.
>  Executing /sbin/chkconfig corosync --level=5
>  enabled
>    pacemaker: active/pacemaker.service is not a native service,
redirecting to /sbin/chkconfig.
>  Executing /sbin/chkconfig pacemaker --level=5
>  enabled
>    pcsd: active/enabled
>
> -------------------------------------------------------------
> Faile to execute ‘pcs cluster stop drbd.node101’
>
> =Error message on drbd.node101(secondary node)
>   Sep 26 04:39:26 drbd lrmd[3521]:  notice: WebData_stop_0:4726:stderr [
Command 'drbdsetup down r0' terminated with exit code 11 ]
>   Sep 26 04:39:26 drbd lrmd[3521]:  notice: WebData_stop_0:4726:stderr [
r0: State change failed: (-10) State change was refused by peer node ]
>   Sep 26 04:39:26 drbd lrmd[3521]:  notice: WebData_stop_0:4726:stderr [
additional info from kernel: ]
>   Sep 26 04:39:26 drbd lrmd[3521]:  notice: WebData_stop_0:4726:stderr [
failed to disconnect ]
>   Sep 26 04:39:26 drbd lrmd[3521]:  notice: WebData_stop_0:4726:stderr [
Command 'drbdsetup down r0' terminated with exit code 11 ]
>   Sep 26 04:39:26 drbd crmd[3524]:   error: Result of stop operation for
WebData on drbd.node101: Timed Out | call=12 key=WebData_stop_0
timeout=100000ms
>   Sep 26 04:39:26 drbd crmd[3524]:  notice:
drbd.node101-WebData_stop_0:12 [ r0: State change failed: (-10) State
change was refused by peer node\nadditional info from kernel:\nfailed to
disconnect\nCommand 'drbdsetup down r0' terminated with exit code 11\nr0:
State change failed: (-10) State change was refused by peer
node\nadditional info from kernel:\nfailed to disconnect\nCommand
'drbdsetup down r0' terminated with exit code 11\nr0: State change failed:
(-10) State change was refused by peer node\nadditional info from
kernel:\nfailed t
>  =Error message on drbd.node102(primary node)
>   Sep 26 04:39:25 drbd kernel: drbd r0 drbd.node101: Preparing remote
state change 3578772780 (primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFFB)
>   Sep 26 04:39:25 drbd kernel: drbd r0: State change failed: Refusing to
be Primary while peer is not outdated
>   Sep 26 04:39:25 drbd kernel: drbd r0: Failed: susp-io( no -> fencing)
>   Sep 26 04:39:25 drbd kernel: drbd r0 drbd.node101: Failed: conn(
Connected -> TearDown ) peer( Secondary -> Unknown )
>   Sep 26 04:39:25 drbd kernel: drbd r0/0 drbd1 drbd.node101: Failed:
pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>   Sep 26 04:39:25 drbd kernel: drbd r0 drbd.node101: Aborting remote
state change 3578772780 >> > I'm looking forward to your answers.Thanks. >>
> >> Yes it works with 2 nodes drbd9 configured in standard way not via >>
drbdmanager. Haven't tried any other layout. >> >> > > >Best regards, >Nick
>
>
>
>
>
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: "刘丹" <mzlld1988 at 163.com>
> To: drbd-user <drbd-user at lists.linbit.com>
> Cc:
> Date: Sat, 10 Sep 2016 14:46:01 +0800 (CST)
> Subject: Is it normal that we can't directly remove the secondary node
when fencing is set?
> Hi everyone,
> I have a question about removing the secondary node of DRBD9.
> When fencing is set, is it normal that we can't remove the secondary node
of DRBD9, but the operation is successful of DRBD8.4.6?
> Version of DRBD kernel source is the newest version(9.0.4-1).Version of
DRBD utils is 8.9.6.
> Description:
>     3 nodes, one of the nodes is primary,disk state is UpToDate.Fencing
is set.
>     I got an error message 'State change failed: (-7) State change was
refused by peer node' when executing the command 'drbdadm down <res-name>'
on any of the secondary nodes.
>
> Analysis:
>     When executing the down command on one of the secondary nodes.
>     The secondary node will execute the methods
'change_cluster_wide_state' of drbd_state.c.
>     change_cluster_wide_state()
>     {
>         ...
>         if (have_peers) {
>                 if (wait_event_timeout(resource->state_wait,
>                                cluster_wide_reply_ready(resource),
>
twopc_timeout(resource))){-------------①Waiting for peer node to reply, the
thread will sleep until the peer node replies.
>                     rv =
get_cluster_wide_reply(resource);------------②Get the reply info.
>                 }else{
>                 }
>         ...
>     }
>
>     Process ①
>         Primary node will execute the following methods.
>
..->try_state_change->is_valid_soft_transition->__is_valid_soft_transition
>             Finally,__is_valid_soft_transition will return error code
SS_PRIMARY_NOP。
>
>             if (peer_device->connection->fencing_policy >= FP_RESOURCE &&
>                 !(role[OLD] == R_PRIMARY && repl_state[OLD] <
L_ESTABLISHED && !(peer_disk_state[OLD] <= D_OUTDATED)) &&
>                  (role[NEW] == R_PRIMARY && repl_state[NEW] <
L_ESTABLISHED && !(peer_disk_state[NEW] <= D_OUTDATED)))
>                    return SS_PRIMARY_NOP;
>
>             Primary node will set drbd_packet to P_TWOPC_NO, seconday
node will get the reply to set connection status to TWOPC_NO。
>             At this time,Process ① will finish.
>
>     Process ②
>            rv will be set to SS_CW_FAILED_BY_PEER
>
>     ====8.4.6版====
>         One is primary, the next one is secondary.
>         When executing 'drbdadm down <res-name>' on seconday node, the
same error message will be recorded in the log file for the first time to
change the peer disk to D_UNKNOWN。
>         But the command will succeed by changing peer disk to D_OUTDATED
for the second time.
>
>         The following code that report the error.
>         is_valid_state()
>         {
>             ...
>             if (fp >= FP_RESOURCE &&
>                      ns.role == R_PRIMARY && ns.conn < C_CONNECTED &&
ns.pdsk >= D_UNKNOWN①){
>                         rv = SS_PRIMARY_NOP;
>                      }
>             ...
>         }
>
>         After executing the command 'drbdadm down <res-name>' on
secondary node, the status of the primary node is:
>         [root at drbd846 drbd-8.4.6]# cat /proc/drbd
>         version: 8.4.6 (api:1/proto:86-101)
>         GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by
root at drbd846.node1, 2016-09-08 08:51:45
>          0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/Outdated   r-----
>             ns:1048508 nr:0 dw:0 dr:1049236 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0
>
>         The peer disk state is OutDated, not DUnknown.
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160928/fd4cd181/attachment.htm>