Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I am having a weird problem with my RA in my virtual machines. As far as I can tell, it is related to timeouts when syncing. I have a working cluster where one node is inconsistent and syncing from the other node. When I reboot the SyncSource machine (yes, I am trying to break things and see how it all reacts :P) the other node notices and refuses to promote itself because it there is no consistent disk. So far, so good. What happens next is that the SyncSource machine comes back up and the sync restarts. Then the SyncSource machine tries to promote itself to master but this times out. It then tries to demote itself, but this times out too. After all this, the drbd is stopped and then started, repeating the process. (see attached log) If I reboot the SyncTarget node, the SyncSource node comes up safely and when the SyncTarget node comes up it smoothly joins the merry cluster and the problem is gone. Rebooting the SyncSource node again recreates the problem. Does this have any cause except load? Where does the 20000ms timeout come from? It is not configured by me and a quick search for "20" in "cibadmin --query" doesn't return any relevant results. Can I modify it and should I? Are there other solutions for this problem? Regards, Johan -------------- next part -------------- 12:44:01 SYSLOG info crmd [18177]: info: do_lrm_rsc_op: Performing key=16:68:0:e6df7f7f-a533-4afe-af87-a630b2ea89d3 op=drbd0:1_promote_0 ) 12:44:01 SYSLOG info lrmd [18174]: info: rsc:drbd0:1: promote 12:44:03 SYSLOG warning heartbeat [17330]: WARN: Late heartbeat: Node slave.domain.be: interval 5250 ms 12:44:09 SYSLOG info kernel block drbd0: role( Secondary -> Primary ) 12:44:15 SYSLOG warning kernel block drbd0: md_sync_timer expired! Worker calls drbd_md_sync(). 12:44:21 SYSLOG warning lrmd [18174]: WARN: drbd0:1:promote process (PID 26476) timed out (try 1). Killing with signal SIGTERM (15). 12:44:21 SYSLOG warning lrmd [18174]: WARN: operation promote[26] on ocf::drbd::drbd0:1 for client 18177, its parameters: CRM_meta_clone=[1] CRM_meta_notify_slave_resource=[drbd0:0 drbd0:1 ] CRM_meta_notify_active_resource=[ 12:44:21 SYSLOG err crmd [18177]: ERROR: process_lrm_event: LRM operation drbd0:1_promote_0 (26) Timed Out (timeout=20000ms) 12:44:22 SYSLOG info crmd [18177]: info: do_lrm_rsc_op: Performing key=95:68:0:e6df7f7f-a533-4afe-af87-a630b2ea89d3 op=drbd0:1_notify_0 ) 12:44:22 SYSLOG info lrmd [18174]: info: rsc:drbd0:1: notify 12:44:22 SYSLOG warning heartbeat [17330]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 60 ms (> 50 ms) (GSource: 0x810e658) 12:44:25 SYSLOG warning heartbeat [17330]: WARN: Late heartbeat: Node slave.domain.be: interval 5360 ms 12:44:26 SYSLOG info attrd [18176]: info: attrd_ha_callback: flush message from slave.domain.be 12:44:30 SYSLOG warning heartbeat [17330]: WARN: Late heartbeat: Node slave.domain.be: interval 5320 ms 12:44:40 SYSLOG info crm_attribute [6931]: info: Invoked: crm_attribute -N master.domain.be -n master-drbd0:1 -l reboot -v 100 12:44:40 SYSLOG info lrmd [18174]: info: RA output: (drbd0:1:notify:stdout) 12:44:40 SYSLOG info crmd [18177]: info: process_lrm_event: LRM operation drbd0:1_notify_0 (call=27, rc=0, cib-update=34, confirmed=true) complete ok 12:44:45 SYSLOG info crmd [18177]: info: do_lrm_rsc_op: Performing key=96:69:0:e6df7f7f-a533-4afe-af87-a630b2ea89d3 op=drbd0:1_notify_0 ) 12:44:45 SYSLOG info lrmd [18174]: info: rsc:drbd0:1: notify 12:44:46 SYSLOG info crmd [18177]: info: process_lrm_event: LRM operation drbd0:1_notify_0 (call=28, rc=0, cib-update=35, confirmed=true) complete ok 12:44:48 SYSLOG info crmd [18177]: info: do_lrm_rsc_op: Performing key=15:69:0:e6df7f7f-a533-4afe-af87-a630b2ea89d3 op=drbd0:1_demote_0 ) 12:44:48 SYSLOG info lrmd [18174]: info: rsc:drbd0:1: demote 12:44:48 SYSLOG info kernel block drbd0: role( Primary -> Secondary ) 12:44:53 SYSLOG warning kernel block drbd0: md_sync_timer expired! Worker calls drbd_md_sync(). 12:44:53 SYSLOG warning heartbeat [17330]: WARN: Late heartbeat: Node slave.domain.be: interval 6060 ms 12:45:08 SYSLOG warning lrmd [18174]: WARN: drbd0:1:demote process (PID 9159) timed out (try 1). Killing with signal SIGTERM (15). 12:45:08 SYSLOG warning lrmd [18174]: WARN: operation demote[29] on ocf::drbd::drbd0:1 for client 18177, its parameters: CRM_meta_clone=[1] CRM_meta_notify_slave_resource=[drbd0:0 ] CRM_meta_notify_active_resource=[ ] CRM_met 12:45:08 SYSLOG err crmd [18177]: ERROR: process_lrm_event: LRM operation drbd0:1_demote_0 (29) Timed Out (timeout=20000ms)