[DRBD-user] Removing DRBD Kernel Module Blocks

Fri Jan 27 23:19:54 CET 2012

Hi Felix, 

>> Jan 26 15:44:14 node1 kernel: [177694.517283] block drbd0: Requested 
>> state change failed by peer : Refusing to be Primary while peer is not 
>> outdated (-7) 

> This is odd. I don't think DRBD should attempt to become primary when 
> you issue a stop command/ 
I agree. I don't understand why this happens when I attempt to stop it and remove the module. Does anyone know what this error means and why it would occur when attempting to stop DRBD? 

> Shouldn't pacemaker be stopping and starting this service for you? 
It is, however I discovered that entries for DRBD still existed in /etc/rc* - I removed those so now pacemaker is the only way to start/stop DRBD. 

> I'm not sure it's normal for DRBD to outdate its disk on disconnection, but it does seem to make sense. 
This is probably because I have configured resource-level fencing using dopd. I believe the peer would fence this node once it loses connection 

After removing the /etc/rc*drbd entries and rebooting the nodes a couple of times, I am now able to stop heartbeat, which stops DRBD successfully. Now, however, one of my DRBD resources (ms_drbd_mount2) will not promote to master: 

Online: [ node1 ] 
OFFLINE: [ node2 ] 

Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
Masters: [ node1 ] 
Stopped: [ p_drbd_mount1:1 ] 
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
Slaves: [ node1 ] 
Stopped: [ p_drbd_mount2:1 ] 
Resource Group: g_apache 
p_fs_varwww (ocf::heartbeat:Filesystem): Started node1 
p_apache (ocf::heartbeat:apache): Started node1 
Resource Group: g_mount1 
p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1 
p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1 

Any attempt to start it via crm resource [promote|start|stop|cleanup] does nothing. I am able to manually set the DRBD resource as primary. I took node2 offline in the hopes that it would start with just one node active, but it still remains slave. I see some error messages in the log about migrating the resource from node2: 
pengine: [30681]: WARN: common_apply_stickiness: Forcing ms_drbd_crm away from node after 1000000 failures (max=1000000) 

However, shouldn't it have migrated already when that node went offline? How can I what is preventing the DRBD resource from being promoted? The syslog contains 

crmd: [30323]: info: te_rsc_command: Initiating action 43: monitor p_drbd_mount2:0_monitor_30000 on node1 (local) 
crmd: [30323]: info: do_lrm_rsc_op: Performing key=43:111:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e op=p_drbd_mount2:0_monitor_30000 ) 
lrmd: [30320]: info: rsc:p_drbd_mount2:0 monitor[192] (pid 14960) 
lrmd: [30320]: info: operation monitor[192] on p_drbd_mount2:0 for client 30323: pid 14960 exited with return code 0 
crmd: [30323]: info: process_lrm_event: LRM operation p_drbd_mount2:0_monitor_30000 (call=192, rc=0, cib-update=619, confirmed=false) ok 
crmd: [30323]: info: match_graph_event: Action p_drbd_mount2:0_monitor_30000 (43) confirmed on node1 (rc=0) 
pengine: [30681]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active in master mode on node1 
pengine: [30681]: notice: unpack_rsc_op: Operation p_drbd_mount2:0_last_failure_0 found resource p_drbd_mount2:0 active on node1 
pengine: [30681]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999998 more times on node2 before being forced off 
pengine: [30681]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999998 more times on node2 before being forced off 
pengine: [30681]: WARN: common_apply_stickiness: Forcing ms_drbd_mount2 away from node2 after 1000000 failures (max=1000000) 
pengine: [30681]: WARN: common_apply_stickiness: Forcing ms_drbd_mount2 away from node2 after 1000000 failures (max=1000000) 
pengine: [30681]: notice: LogActions: Leave p_drbd_mount1:0#011(Master node1) 
pengine: [30681]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped) 
pengine: [30681]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node1) 
pengine: [30681]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped) 

I've attached my configuration (as outputted by crm configure show). 

Thanks, 

Andrew 

----- Original Message -----

From: "Felix Frank" < ff at mpexnet.de > 
To: "Andrew Martin" < amartin at xes-inc.com > 
Sent: Friday, January 27, 2012 2:52:05 AM 
Subject: Re: [DRBD-user] Removing DRBD Kernel Module Blocks 

Hi, 

On 01/26/2012 11:18 PM, Andrew Martin wrote: 
> I am using DRBD with pacemaker+heartbeat for a HA cluster. There are no 

fair choice. 

> mounted filesystems at this time. Below is a copy of the kernel log 

So the DRBDs are idle and managed by pacemaker, correct? 

> after I attempted to stop the drbd service: 

Shouldn't pacemaker be stopping and starting this service for you? 

> Jan 26 15:44:14 node1 kernel: [177694.517283] block drbd0: Requested 
> state change failed by peer : Refusing to be Primary while peer is not 
> outdated (-7) 

This is odd. I don't think DRBD should attempt to become primary when 
you issue a stop command/ 

> Jan 26 15:44:14 node1 kernel: [177694.873466] block drbd0: peer( Primary 
> -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> 
> Outdated ) pdsk( UpToDate -> DUnknown ) 

I'm not sure it's normal for DRBD to outdate its disk on disconnection, 
but it does seem to make sense. 

> Jan 26 15:44:14 node1 kernel: [177695.209668] block drbd0: disk( 
> Outdated -> Diskless ) 

This looks funny as well. But may just be correct. 

Do you stop pacemaker before stopping DRBD? 
What happens if you disable pacemaker, drbdadm down all and then stop DRBD? 

Regards, 
Felix 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120127/86783549/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pacemaker-config.txt
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120127/86783549/attachment.txt>