[DRBD-user] Removing DRBD Kernel Module Blocks

Mon Jan 30 23:27:16 CET 2012

Hi Felix, 

Thanks for the suggestion. I have removed all of the target-* parameters, however this behavior is still present when failing over with the DRBD device. I did notice the following line in the log when attempting to failover: 

Jan 30 15:11:15 node2 lrmd: [1440]: info: RA output: (p_drbd_mount2:1:notify:stdout) drbdsetup 1 syncer --set-defaults --create-device --rate=1000M 
Jan 30 15:11:15 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op p_drbd_mount2:1_notify_0 from 120:428:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: lrm_invoke-lrmd-1327957875-58 
Jan 30 15:11:15 node2 crmd: [1443]: info: process_lrm_event: LRM operation p_drbd_mount2:1_notify_0 (call=79, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:15 node2 kernel: [ 1075.519415] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) 
Jan 30 15:11:15 node2 kernel: [ 1075.519441] block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). 
Jan 30 15:11:15 node2 kernel: [ 1075.792404] block drbd1: conn( WFBitMapT -> WFSyncUUID ) 
Jan 30 15:11:15 node2 kernel: [ 1075.806756] block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 
Jan 30 15:11:15 node2 kernel: [ 1075.809927] block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) 
Jan 30 15:11:15 node2 kernel: [ 1075.809934] block drbd1: conn( WFSyncUUID -> SyncTarget ) 
Jan 30 15:11:15 node2 kernel: [ 1075.809942] block drbd1: Began resync as SyncTarget (will sync 838514728 KB [209628682 bits set]). 
Jan 30 15:11:16 node2 kernel: [ 1076.060191] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) 
Jan 30 15:11:16 node2 kernel: [ 1076.060200] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
Jan 30 15:11:16 node2 crmd: [1443]: info: do_lrm_rsc_op: Performing key=135:429:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e op=p_drbd_mount2:1_notify_0 ) 
Jan 30 15:11:16 node2 lrmd: [1440]: info: rsc:p_drbd_mount2:1 notify[81] (pid 6052) 
Jan 30 15:11:16 node2 lrmd: [1440]: info: operation notify[81] on p_drbd_mount2:1 for client 1443: pid 6052 exited with return code 0 
Jan 30 15:11:16 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op p_drbd_mount2:1_notify_0 from 135:429:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: lrm_invoke-lrmd-1327957876-59 
Jan 30 15:11:16 node2 crmd: [1443]: info: process_lrm_event: LRM operation p_drbd_mount2:1_notify_0 (call=81, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:24 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mount2:1 (10) 
Jan 30 15:11:24 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 217: master-p_drbd_mount2:1=10 
Jan 30 15:11:38 node2 kernel: [ 1097.979353] block drbd0: role( Secondary -> Primary ) 

Jan 30 15:11:55 node2 lrmd: [1440]: info: RA output: (p_drbd_mount1:0:monitor:stderr) lock on /var/lock/drbd-147-0 currently held by pid:6164 

Jan 30 15:11:55 node2 mount2_attribute: [6214]: info: Invoked: mount2_attribute -N node2 -n master-p_drbd_mount1:0 -l reboot -D 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mount1:0 (<null>) 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_perform_update: Sent delete 221: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=master-p_drbd_mount1:0, id=<n/a>, set=(null), section=status 
Jan 30 15:11:55 node2 crmd: [1443]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_30000 (call=74, rc=7, cib-update=114, confirmed=false) not running 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_perform_update: Sent delete 223: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=master-p_drbd_mount1:0, id=<n/a>, set=(null), section=status 
Jan 30 15:11:55 node2 kernel: [ 1115.843923] block drbd1: role( Secondary -> Primary ) 
Jan 30 15:11:55 node2 crmd: [1443]: info: process_lrm_event: LRM operation p_drbd_mount2:1_monitor_30000 (call=24, rc=8, cib-update=115, confirmed=false) master 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_drbd_mount1:0 (1) 
Jan 30 15:11:56 node2 crmd: [1443]: info: do_lrm_rsc_op: Performing key=108:434:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e op=p_drbd_mount1:0_notify_0 ) 
Jan 30 15:11:56 node2 lrmd: [1440]: info: rsc:p_drbd_mount1:0 notify[82] (pid 6228) 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 226: fail-count-p_drbd_mount1:0=1 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_drbd_mount1:0 (1327957916) 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 228: last-failure-p_drbd_mount1:0=1327957916 
Jan 30 15:11:56 node2 lrmd: [1440]: info: operation notify[82] on p_drbd_mount1:0 for client 1443: pid 6228 exited with return code 0 
Jan 30 15:11:56 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:0_notify_0 from 108:434:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: lrm_invoke-lrmd-1327957916-60 
Jan 30 15:11:56 node2 crmd: [1443]: info: process_lrm_event: LRM operation p_drbd_mount1:0_notify_0 (call=82, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_drbd_mount2:1 (1) 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 231: fail-count-p_drbd_mount2:1=1 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_drbd_mount2:1 (1327957916) 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 234: last-failure-p_drbd_mount2:1=1327957916 

I was unable to trace the 6164 process as it had already terminated when I looked for it. The node simply stays as slave, even though it is the only online node. Attempting to cleanup the ms_drbd_mount1 primitive does not resolve the situation, however I could set the DRBD device as primary using drbdadm primary mount1. After doing so I could then use cleanup ms_drbd_mount1 and it resumed bringing up the rest of the resources successfully. 

I also noticed this in the log: 

Jan 30 15:11:59 node2 kernel: [ 1119.882808] block drbd0: role( Primary -> Secondary ) 
Jan 30 15:11:59 node2 lrmd: [1440]: info: RA output: (p_drbd_mount1:0:stop:stdout) 
Jan 30 15:11:59 node2 kernel: [ 1119.960543] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Jan 30 15:11:59 node2 kernel: [ 1119.960589] block drbd0: short read expecting header on sock: r=-512 
Jan 30 15:11:59 node2 kernel: [ 1119.968719] block drbd0: asender terminated 
Jan 30 15:11:59 node2 kernel: [ 1119.968768] block drbd0: Terminating asender thread 
Jan 30 15:12:00 node2 kernel: [ 1119.969386] block drbd0: Connection closed 
Jan 30 15:12:00 node2 kernel: [ 1119.969421] block drbd0: conn( Disconnecting -> StandAlone ) 
Jan 30 15:12:00 node2 kernel: [ 1119.969467] block drbd0: receiver terminated 
Jan 30 15:12:00 node2 kernel: [ 1119.969470] block drbd0: Terminating receiver thread 
Jan 30 15:12:00 node2 kernel: [ 1119.969634] block drbd0: disk( UpToDate -> Diskless ) 
Jan 30 15:12:00 node2 kernel: [ 1119.970702] block drbd0: drbd_bm_resize called with capacity == 0 
Jan 30 15:12:00 node2 kernel: [ 1119.971970] block drbd0: worker terminated 
Jan 30 15:12:00 node2 kernel: [ 1119.971973] block drbd0: Terminating worker thread 
Jan 30 15:12:00 node2 lrmd: [1440]: info: RA output: (p_drbd_mount1:0:stop:stdout) 
Jan 30 15:12:01 node2 mount2_attribute: [6433]: info: Invoked: mount2_attribute -N node2 -n master-p_drbd_mount1:0 -l reboot -D 

I'm not sure if the "short read expecting header on sock" error is related to why it can't promote the resource, but this doesn't seem to stop the drbdadm utility from doing it. Is there some pacemaker utility that gives more verbose information on why a resource fails to start? 

What does the digit after the resource indicate, e.g. the :1 or :1_stop_0 below: 

Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
p_drbd_mount2:1 (ocf::linbit:drbd): Master node2 (unmanaged) FAILED 
Slaves: [ node1 ] 

Failed actions: 
p_drbd_mount2:1_stop_0 (node=node2, call=94, rc=-2, status=Timed Out): unknown exec error 

Thanks, 

Andrew 
----- Original Message -----

From: "Felix Frank" < ff at mpexnet.de > 
To: "Andrew Martin" < amartin at xes-inc.com > 
Cc: "drbd-user" < drbd-user at lists.linbit.com > 
Sent: Monday, January 30, 2012 3:07:08 AM 
Subject: Re: [DRBD-user] Removing DRBD Kernel Module Blocks 

Hi, 

On 01/29/2012 03:03 PM, Lars Ellenberg wrote: 
>>> I don't think DRBD should attempt to become primary when 
>>> > > you issue a stop command/ 
> And it does not. It is the *peer* that is "refusing" here, 
> hinting to "self" that I should Outdate myself when voluntarily 
> disconnecting from a peer in Primary role. 

D'oh! 

On 01/27/2012 11:19 PM, Andrew Martin wrote: 
> However, shouldn't it have migrated already when that node went offline? 
> How can I what is preventing the DRBD resource from being promoted? 
... 
> I've attached my configuration (as outputted by crm configure show). 

Hum, difficult. I can't really tell the problem from what you posted. 

>From experience: Your config contains lots of "target-role" etc. 
parameters (probably from issuing crm resource start / promote ...). 
I have known some of these to prevent clean failovers under specific 
circumstances. 

You may want to try and get rid of all those target-* parameters as a 
first try. 

HTH, 
Felix 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120130/748f5dd0/attachment.htm>