[DRBD-user] drbdmanage restart failed

Thu Aug 31 09:50:07 CEST 2017

On Wed, Aug 30, 2017 at 01:22:43PM +0800, 杨成伟 wrote:
> Hi All,
> 
> I'm following drbd 9.x user guide and has a 3 node(2 control, 1
> satellite) setup.
> 
> However, I'm not sure what I've did on one node and that node can not
> restart by `drbdmanage restart`.
> 
> It errors like below
> 
> 
> ```
> Attempting to startup the server through D-Bus activation...
> ERROR:dbus.proxies:Introspect error on :1.96:/interface: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
> D-Bus connection successful, server is running and reachable
> ```
> 
> 
> and from the dbus log(journalctl -u dbus -f), I found that drbdmanage failed like below.
> 
> 
> ```
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual dbus[815]: [system] Activating service name='org.drbd.drbdmanaged' (using servicehelper)
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual dbus-daemon[815]: dbus[815]: [system] Activating service name='org.drbd.drbdmanaged' (using servicehelper)
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DRBDmanage server, version 0.99.9 -- initializing on node 'mesos-test-dev002-whdx.qiyi.virtual'
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual dbus[815]: [system] Successfully activated service 'org.drbd.drbdmanaged'
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual dbus-daemon[815]: dbus[815]: [system] Successfully activated service 'org.drbd.drbdmanaged'
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdadm -vvv adjust .drbdctrl
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: External command 'drbdadm': Exit code 0
> Aug 30 13:10:42 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DRBDManage starting as potential leader node
> Aug 30 13:10:59 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: ERROR      drbdsetup events tracing has failed, restarting
> Aug 30 13:10:59 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       drbdsetup events tracing reestablished
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       Reading initial DRBD control volume status
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       Finished reading initial DRBD control volume status
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       Leader election by wait for connections
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=2 .drbd
> ctrl
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: External command 'drbdsetup': Exit code 0
> Aug 30 13:11:17 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdsetup primary .drbdctrl
> Aug 30 13:11:18 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: ERROR      DrbdAdm: External command 'drbdsetup': Exit code 11
> Aug 30 13:11:18 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       Leader election: lost election
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: ERROR      Signal creation failed, unhandled exception encountered
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: ERROR      Signal creation failed, unhandled exception encountered
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: ERROR      Signal creation failed, unhandled exception encountered
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       starting resource 'r0'
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdadm --config-to-test /var/lib/drbd.d/drbdmanage_r0.res.tmp --config-t
> o-exclude /var/lib/drbd.d/drbdmanage_r0.res sh-nop
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: External command 'drbdadm': Exit code 0
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdadm -c /var/lib/drbd.d/drbdmanage_r0.res.tmp -d up r0
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: External command 'drbdadm': Exit code 0
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: Running external command: drbdadm -vvv adjust -c /var/lib/drbd.d/drbdmanage_r0.res r0
> Aug 30 13:11:24 mesos-test-dev002-whdx.qiyi.virtual drbdmanaged[28329]: INFO       DrbdAdm: External command 'drbdadm': Exit code 0
> ```
> 
> 
> It seems `drbdmanage restart` fails because `drbdsetup primary .drbdctl` fails,
> so I just run this command on the node.
> 
> 
> ```
> [root at mesos-test-dev002-whdx ~]# drbdsetup primary .drbdctrl
> .drbdctrl: State change failed: (-1) Multiple primaries not allowed by config
> ```
> 
> 
> Apparently, it fails and said *Multiple primaries not allowed by config*,
> I am not sure why it will run this command because drbdmanage role says this node is not the leader

That is okay. On startup it sees that it has a control volume and tries
to become primary (it does not check if there is already another
primary, there is no reason), it obviously fails, which is the indicator
that it will be in the role of a potential leader.

> 
> ```
> [root at mesos-test-dev002-whdx ~]# drbdmanage role
> Operation completed successfully
> potential leader (CL: mesos-test-dev001-whdx.qiyi.virtual)
> ```

Looks good. Especially on startup/restart drbdmanage can take too long
until it reacts to commands and then you see these dbus timeouts. When
drbdmanage reached some initial state where it can react to dbus
commands, it would send "hey, I'm currently not ready"-commands to the
CLI and the CLI would then retry and you would not see dbus-timeouts.

Getting DM really quick in an initial state where it can answer
dbus-commands is not that trivial, so just ignore dbus-timeouts on
startup. If the system is not ultra-busy starting hundreds of resources,
this is usually one such dbus-timeout.

Regards, rck