[DRBD-user] Switchover Problems with DRBD

Mon Mar 10 18:45:48 CET 2008

I already posted this mail to the list last week, but the mailing-list-software denied to forward mails larger than 40KB (mine was...) So I  post this now without the logfile, which was the largest attachment.

If it is needed, please let me know.

On Linux-HA-list I got this reply, so maybe one of the DRBD-specialists here can help me getting the config working again:

> -----Ursprüngliche Nachricht-----
> Von: linux-ha-bounces at lists.linux-ha.org 
> [mailto:linux-ha-bounces at lists.linux-
> ha.org] Im Auftrag von Chun Tian (binghe)
> Gesendet: Montag, 10. März 2008 14:57
> An: General Linux-HA mailing list
> Betreff: Re: AW: AW: [Linux-HA] Switchover problem with DRBD
> 
> Hi, Florian
> 
> I compard my HA config, can almost say, your Heartbeat configure just 
> can work, but DRBD has something wrong. See this:
> 
> crmd[17381]: 2008/03/05_11:44:34 ERROR: process_lrm_event: LRM 
> operation DRBD_AFD:1_promote_0 (17) Timed Out (timeout=20000ms)
> drbd[18348]:	2008/03/05_11:44:34 DEBUG: r0 notify: post for stop -
> counts: active 0 - starting 1 - stopping 1
> drbd[18348]:	2008/03/05_11:44:34 DEBUG: r0: Calling drbdadm -c /etc/
> drbd.conf state r0
> drbd[18348]:	2008/03/05_11:44:44 DEBUG: r0: Exit code 0
> drbd[18348]:	2008/03/05_11:44:44 DEBUG: r0: Command output: Child
> process does not terminate! Exiting. No response from the DRBD driver! 
> Is the module loaded? Unknown/TOO_LARGE
> drbd[18348]:	2008/03/05_11:44:44 DEBUG: r0: Calling drbdadm -c /etc/
> drbd.conf cstate r0
> lrmd[17378]: 2008/03/05_11:44:54 WARN: DRBD_AFD:1:notify process (PID
> 18348) timed out (try 1).  Killing with signal SIGTERM (15).
> lrmd[17378]: 2008/03/05_11:44:54 WARN: operation notify[18] on 
> ocf::drbd::DRBD_AFD:1 for client 17381, its parameters: 
> CRM_meta_role=[Master] CRM_meta_notify_stop_resource=[DRBD_AFD:0 ] 
> CRM_meta_notify_operation=[stop] 
> CRM_meta_notify_start_resource=[DRBD_AFD:1 ] 
> CRM_meta_notify_stop_uname=[noderz ] 
> CRM_meta_notify_promote_resource=[DRBD_AFD:1 ] drbd_resource=[r0] 
> CRM_meta_notify_master_uname=[noderz ] 
> CRM_meta_notify_demote_uname=[noderz ] CRM_meta_master_max=[1] 
> CRM_meta_notify_master_resource=[DRBD_AFD:0 ] CRM_meta_timeout=[20000]
> CRM_meta_s: pid [18348] timed out
> 
> There's something wrong when HA running drbdadm command, it hangs. By 
> seeing you drbd.conf, I think you may be using the DRBD 8.x but not 
> 7.x, am I right? I must say for your case, the more stable DRBD 7.x is
> enough: you never want Two-Primary DRBD node.
> 
> Regards,
> 
> Chun Tian (binghe)

--------------------------------------------------------------------------------------------------
Hi everybody,

Testing my 2-node-cluster i got a strange behaviour when stopping heartbeat on my primary node. I don't know if it is caused by heartbeat or DRBD or both, so I post this in both lists.

Starting with this:

============
Last updated: Wed Mar  5 15:01:10 2008
Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101)
2 Nodes configured.
3 Resources configured.
============

Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online
Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online

Master/Slave Set: DRBD
    DRBD_AFD:0  (heartbeat::ocf:drbd):  Master noderz
    DRBD_AFD:1  (heartbeat::ocf:drbd):  Started nodekrz Resource Group: Group1
    Filesystem  (heartbeat::ocf:Filesystem):    Started noderz
    AFD (lsb:afdha):    Started noderz
Cluster_IP      (heartbeat::ocf:IPaddr):        Started noderz

I said /etc/init.d/heartbeat stop on primary node (noderz) and expected this:

============
Last updated: Wed Mar  5 15:01:10 2008
Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d) 
2 Nodes configured.
3 Resources configured.
============

Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online

Master/Slave Set: DRBD
    DRBD_AFD:0  (heartbeat::ocf:drbd):  stopped
    DRBD_AFD:1  (heartbeat::ocf:drbd):  Master nodekrz
Resource Group: Group1
    Filesystem  (heartbeat::ocf:Filesystem):    Started nodekrz
    AFD (lsb:afdha):    Started nodekrz
Cluster_IP      (heartbeat::ocf:IPaddr):        Started nodekrz

But I got this:
============
Last updated: Wed Mar  5 14:52:06 2008
Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d)
2 Nodes configured.
3 Resources configured.
============

Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online

Master/Slave Set: DRBD
    DRBD_AFD:0  (heartbeat::ocf:drbd):  Stopped
    DRBD_AFD:1  (heartbeat::ocf:drbd):  Started nodekrz

Failed actions:
    DRBD_AFD:1_promote_0 (node=nodekrz, call=17, rc=-2): Timed Out

I added the /var/log/ha-debug of the node, a cibadmin -Q, my ha.cf and my drbd.conf (if needed)

Would be nice if someone could give me a hint why the switchover fails.

Thanks a lot for any help.
Florian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.xml
Type: text/xml
Size: 19937 bytes
Desc: cib.xml
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080310/7df112ac/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha.cf
Type: application/octet-stream
Size: 423 bytes
Desc: ha.cf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080310/7df112ac/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd.conf
Type: application/octet-stream
Size: 831 bytes
Desc: drbd.conf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080310/7df112ac/attachment-0001.obj>