[DRBD-user] DRBD9 automatic unfence question

Miro Igov miro.igov at gmail.com
Wed Jul 7 16:37:20 CEST 2021


Hello, 

I have 2 node cluster (pacemaker) with master/slave DRBD resource.

The resource is configured with handlers:

                fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";

                unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";

 

If I shutdown the master node  it gets fenced by handler and promotes the
live node, see in log:

 

Jul  3 07:41:45 val-test1-copy kernel: [  308.983600] drbd odoo
val-test2-copy: peer( Primary -> Secondary )

Jul  3 07:41:46 val-test1-copy kernel: [  309.113294] drbd odoo
val-test2-copy: Preparing remote state change 4162941012

Jul  3 07:41:46 val-test1-copy kernel: [  309.113464] drbd odoo
val-test2-copy: Committing remote state change 4162941012 (primary_nodes=0)

Jul  3 07:41:46 val-test1-copy kernel: [  309.113467] drbd odoo
val-test2-copy: conn( Connected -> TearDown ) peer( Secondary ->Unknown )

Jul  3 07:41:46 val-test1-copy kernel: [  309.113468] drbd odoo/0 drbd0
val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )

Jul  3 07:41:46 val-test1-copy kernel: [  309.113479] drbd odoo
val-test2-copy: ack_receiver terminated

Jul  3 07:41:46 val-test1-copy kernel: [  309.113480] drbd odoo
val-test2-copy: Terminating ack_recv thread

Jul  3 07:41:46 val-test1-copy kernel: [  309.134608] drbd odoo
val-test2-copy: Restarting sender thread

Jul  3 07:41:46 val-test1-copy kernel: [  309.145903] drbd odoo
val-test2-copy: Connection closed

Jul  3 07:41:46 val-test1-copy kernel: [  309.145912] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected

Jul  3 07:41:46 val-test1-copy kernel: [  309.146759] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0

Jul  3 07:41:46 val-test1-copy kernel: [  309.146765] drbd odoo
val-test2-copy: conn( TearDown -> Unconnected )

Jul  3 07:41:46 val-test1-copy kernel: [  309.146770] drbd odoo
val-test2-copy: Restarting receiver thread

Jul  3 07:41:46 val-test1-copy kernel: [  309.146771] drbd odoo
val-test2-copy: conn( Unconnected -> Connecting )

Jul  3 07:41:47 val-test1-copy kernel: [  310.357823] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm fence-peer

Jul  3 07:41:27 val-test1-copy qemu-ga: message repeated 13 times: [ info:
guest-ping called]

Jul  3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh

Jul  3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: /

Jul  3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: INFO peers are
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-odoo-ms_drbd_odoo'

Jul  3 07:41:47 val-test1-copy kernel: [  310.639973] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400)

Jul  3 07:41:47 val-test1-copy kernel: [  310.639977] drbd odoo
val-test2-copy: fence-peer helper returned 4 (peer was fenced)

Jul  3 07:41:47 val-test1-copy kernel: [  310.639980] drbd odoo/0 drbd0
val-test2-copy: pdsk( DUnknown -> Outdated )

Jul  3 07:41:47 val-test1-copy kernel: [  310.639991] drbd odoo: Preparing
cluster-wide state change 1652257927 (0->-1 3/1)

Jul  3 07:41:47 val-test1-copy kernel: [  310.639992] drbd odoo: Committing
cluster-wide state change 1652257927 (0ms)

Jul  3 07:41:47 val-test1-copy kernel: [  310.639993] drbd odoo: role(
Secondary -> Primary )

Jul  3 07:41:49 val-test1-copy kernel: [  313.035751] drbd odoo/0 drbd0: new
current UUID: E67F75EE576FC547 weak: FFFFFFFFFFFFFFFE

Jul  3 07:41:50 val-test1-copy kernel: [  313.058994] EXT4-fs (drbd0):
mounted filesystem with ordered data mode. Opts: (null)

 

So far perfect.

Later i start the node and start pacemaker. The node joins as slave and
starts replication, then removes the fence constraint from pacemaker (also
perfect): 

 

Jul  3 07:51:13 val-test1-copy kernel: [  876.831326] drbd odoo
val-test2-copy: Handshake to peer 1 successful: Agreed network protocol
version 118

Jul  3 07:51:13 val-test1-copy kernel: [  876.831327] drbd odoo
val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.

Jul  3 07:51:13 val-test1-copy kernel: [  876.831494] drbd odoo
val-test2-copy: Peer authenticated using 20 bytes HMAC

Jul  3 07:51:13 val-test1-copy kernel: [  876.831502] drbd odoo
val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2018])

Jul  3 07:51:13 val-test1-copy kernel: [  876.872236] drbd odoo: Preparing
cluster-wide state change 2689784605 (0->1 499/145)

Jul  3 07:51:13 val-test1-copy kernel: [  876.888237] drbd odoo/0 drbd0
val-test2-copy: drbd_sync_handshake:

Jul  3 07:51:13 val-test1-copy kernel: [  876.888238] drbd odoo/0 drbd0
val-test2-copy: self
E67F75EE576FC547:FA440AF0812F8348:2F6D8A307AFA432E:492AA1512F283D34 bits:44
flags:120

Jul  3 07:51:13 val-test1-copy kernel: [  876.888241] drbd odoo/0 drbd0
val-test2-copy: peer
FA440AF0812F8348:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:1020

Jul  3 07:51:13 val-test1-copy kernel: [  876.888242] drbd odoo/0 drbd0
val-test2-copy: uuid_compare()=source-use-bitmap by rule70

Jul  3 07:51:15 val-test1-copy kernel: [  878.341061] drbd odoo: State
change 2689784605: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC

Jul  3 07:51:15 val-test1-copy kernel: [  878.341063] drbd odoo: Committing
cluster-wide state change 2689784605 (1468ms)

Jul  3 07:51:15 val-test1-copy kernel: [  878.341077] drbd odoo
val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary )

Jul  3 07:51:15 val-test1-copy kernel: [  878.341078] drbd odoo/0 drbd0
val-test2-copy: pdsk( Outdated -> Consistent ) repl( Off-> WFBitMapS )

Jul  3 07:51:15 val-test1-copy kernel: [  878.341331] drbd odoo/0 drbd0
val-test2-copy: pdsk( Consistent -> Outdated )

Jul  3 07:51:15 val-test1-copy kernel: [  878.348293] drbd odoo/0 drbd0
val-test2-copy: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 47(1),
total 47; compression: 100.0%

Jul  3 07:51:15 val-test1-copy kernel: [  878.367674] drbd odoo/0 drbd0
val-test2-copy: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE
47(1), total 47; compression: 100.0%

Jul  3 07:51:15 val-test1-copy kernel: [  878.367686] drbd odoo/0 drbd0
val-test2-copy: helper command: /sbin/drbdadm before-resync-source

Jul  3 07:51:15 val-test1-copy kernel: [  878.368492] drbd odoo/0 drbd0
val-test2-copy: helper command: /sbin/drbdadm before-resync-source exit code
0

Jul  3 07:51:15 val-test1-copy kernel: [  878.368499] drbd odoo/0 drbd0
val-test2-copy: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS ->
SyncSource )

Jul  3 07:51:15 val-test1-copy kernel: [  878.368523] drbd odoo/0 drbd0
val-test2-copy: Began resync as SyncSource (will sync 176 KB [44 bits set]).

Jul  3 07:51:15 val-test1-copy kernel: [  878.497928] drbd odoo/0 drbd0
val-test2-copy: updated UUIDs
E67F75EE576FC547:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34

Jul  3 07:51:15 val-test1-copy kernel: [  878.515002] drbd odoo/0 drbd0
val-test2-copy: Resync done (total 1 sec; paused 0 sec; 176 K/sec)

Jul  3 07:51:15 val-test1-copy kernel: [  878.515006] drbd odoo/0 drbd0
val-test2-copy: pdsk( Inconsistent -> UpToDate ) repl( SyncSource ->
Established )

Jul  3 07:51:15 val-test1-copy kernel: [  878.533682] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm unfence-peer

Jul  3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connected
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES=0x00000003 /usr/lib/drbd/crm-unfence-peer.9.sh

Jul  3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]: INFO Removed
constraint 'drbd-fence-by-handler-odoo-ms_drbd_odoo'

Jul  3 07:51:15 val-test1-copy kernel: [  878.582275] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm unfence-peer exit code 0

 

 

Now the problem - i got message "PingAck did not arrive in time" in kernel i
guess a network issue. The slave node was fenced as it should be.

Log from master node: 

 

Jul  3 22:05:35 val-test1-copy kernel: [50518.748265] drbd odoo
val-test2-copy: PingAck did not arrive in time.

Jul  3 22:05:35 val-test1-copy kernel: [50518.748319] drbd odoo
val-test2-copy: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown
)

Jul  3 22:05:35 val-test1-copy kernel: [50518.748320] drbd odoo/0 drbd0:
disk( UpToDate -> Consistent )

Jul  3 22:05:35 val-test1-copy kernel: [50518.748322] drbd odoo/0 drbd0
val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )

Jul  3 22:05:35 val-test1-copy kernel: [50518.748339] drbd odoo
val-test2-copy: ack_receiver terminated

Jul  3 22:05:35 val-test1-copy kernel: [50518.748342] drbd odoo
val-test2-copy: Terminating ack_recv thread

Jul  3 22:05:35 val-test1-copy kernel: [50518.785064] drbd odoo
val-test2-copy: Restarting sender thread

Jul  3 22:05:36 val-test1-copy kernel: [50519.533938] drbd odoo: Preparing
cluster-wide state change 1241314843 (0->-1 0/0)

Jul  3 22:05:36 val-test1-copy kernel: [50519.533961] drbd odoo: Committing
cluster-wide state change 1241314843 (0ms)

Jul  3 22:05:36 val-test1-copy kernel: [50519.533964] drbd odoo/0 drbd0:
disk( Consistent -> UpToDate )

Jul  3 22:05:36 val-test1-copy kernel: [50519.611815] drbd odoo
val-test2-copy: Connection closed

Jul  3 22:05:36 val-test1-copy kernel: [50519.611823] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected

Jul  3 22:05:36 val-test1-copy kernel: [50519.612622] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0

Jul  3 22:05:36 val-test1-copy kernel: [50519.612627] drbd odoo
val-test2-copy: conn( NetworkFailure -> Unconnected )

Jul  3 22:05:36 val-test1-copy kernel: [50519.612631] drbd odoo
val-test2-copy: Restarting receiver thread

Jul  3 22:05:36 val-test1-copy kernel: [50519.612633] drbd odoo
val-test2-copy: conn( Unconnected -> Connecting )

Jul  3 22:05:36 val-test1-copy kernel: [50520.169410] drbd odoo
val-test2-copy: Handshake to peer 1 successful: Agreed network protocol
version 118

Jul  3 22:05:36 val-test1-copy kernel: [50520.169411] drbd odoo
val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.

Jul  3 22:05:36 val-test1-copy kernel: [50520.169582] drbd odoo
val-test2-copy: Peer authenticated using 20 bytes HMAC

Jul  3 22:05:36 val-test1-copy kernel: [50520.169588] drbd odoo
val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2037])

Jul  3 22:05:36 val-test1-copy kernel: [50520.229035] drbd odoo: Preparing
cluster-wide state change 3809390519 (0->1 499/146)

Jul  3 22:05:36 val-test1-copy kernel: [50520.245020] drbd odoo/0 drbd0
val-test2-copy: drbd_sync_handshake:

Jul  3 22:05:36 val-test1-copy kernel: [50520.245021] drbd odoo/0 drbd0
val-test2-copy: self
FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0
flags:120

Jul  3 22:05:36 val-test1-copy kernel: [50520.245022] drbd odoo/0 drbd0
val-test2-copy: peer
FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:1120

Jul  3 22:05:36 val-test1-copy kernel: [50520.245024] drbd odoo/0 drbd0
val-test2-copy: uuid_compare()=no-sync by rule 38

Jul  3 22:05:36 val-test1-copy kernel: [50520.248276] drbd odoo: State
change 3809390519: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC

Jul  3 22:05:36 val-test1-copy kernel: [50520.248278] drbd odoo: Committing
cluster-wide state change 3809390519 (16ms)

Jul  3 22:05:36 val-test1-copy kernel: [50520.248290] drbd odoo
val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Primary )

Jul  3 22:05:36 val-test1-copy kernel: [50520.248291] drbd odoo/0 drbd0
val-test2-copy: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

 

Log from slave node (fence-peer handler executed and constraint set in
pacemaker):

 

Jul  3 22:05:35 val-test2-copy kernel: [50697.130833] drbd odoo
val-test1-copy: sock was shut down by peer

Jul  3 22:05:35 val-test2-copy kernel: [50697.130846] drbd odoo
val-test1-copy: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown )

Jul  3 22:05:35 val-test2-copy kernel: [50697.130848] drbd odoo/0 drbd0
val-test1-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )

Jul  3 22:05:35 val-test2-copy kernel: [50697.130874] drbd odoo
val-test1-copy: ack_receiver terminated

Jul  3 22:05:35 val-test2-copy kernel: [50697.130877] drbd odoo
val-test1-copy: Terminating ack_recv thread

Jul  3 22:05:35 val-test2-copy kernel: [50697.156086] drbd odoo
val-test1-copy: Restarting sender thread

Jul  3 22:05:36 val-test2-copy kernel: [50697.867155] drbd odoo
val-test1-copy: Connection closed

Jul  3 22:05:36 val-test2-copy kernel: [50697.867218] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm disconnected

Jul  3 22:05:36 val-test2-copy kernel: [50697.867221] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm fence-peer

Jul  3 22:05:36 val-test2-copy kernel: [50697.867998] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm disconnected exit code 0

Jul  3 22:05:36 val-test2-copy kernel: [50697.868005] drbd odoo
val-test1-copy: conn( BrokenPipe -> Unconnected )

Jul  3 22:05:36 val-test2-copy kernel: [50697.868010] drbd odoo
val-test1-copy: Restarting receiver thread

Jul  3 22:05:36 val-test2-copy kernel: [50697.868012] drbd odoo
val-test1-copy: conn( Unconnected -> Connecting )

Jul  3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=BrokenPipe
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.112 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=1 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.111
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=0 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES =0x00000002 /usr/lib/drbd/crm-fence-peer.9.sh

Jul  3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: /

Jul  3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: INFO peers are
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-odoo-ms_drbd_odoo'

Jul  3 22:05:36 val-test2-copy kernel: [50698.484443] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400)

Jul  3 22:05:36 val-test2-copy kernel: [50698.484448] drbd odoo
val-test1-copy: fence-peer helper returned 4 (peer was fenced)

Jul  3 22:05:36 val-test2-copy kernel: [50698.484450] drbd odoo/0 drbd0
val-test1-copy: pdsk( DUnknown -> Outdated )

Jul  3 22:05:36 val-test2-copy kernel: [50698.539142] drbd odoo
val-test1-copy: Handshake to peer 0 successful: Agreed network protocol
version 118

Jul  3 22:05:36 val-test2-copy kernel: [50698.539144] drbd odoo
val-test1-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.

Jul  3 22:05:36 val-test2-copy kernel: [50698.539297] drbd odoo
val-test1-copy: Peer authenticated using 20 bytes HMAC

Jul  3 22:05:36 val-test2-copy kernel: [50698.539305] drbd odoo
val-test1-copy: Starting ack_recv thread (from drbd_r_odoo [1980])

Jul  3 22:05:36 val-test2-copy kernel: [50698.598830] drbd odoo
val-test1-copy: Preparing remote state change 3809390519

Jul  3 22:05:36 val-test2-copy kernel: [50698.617380] drbd odoo/0 drbd0
val-test1-copy: drbd_sync_handshake:

Jul  3 22:05:36 val-test2-copy kernel: [50698.617383] drbd odoo/0 drbd0
val-test1-copy: self
FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:120

Jul  3 22:05:36 val-test2-copy kernel: [50698.617384] drbd odoo/0 drbd0
val-test1-copy: peer
FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0
flags:1120

Jul  3 22:05:36 val-test2-copy kernel: [50698.617386] drbd odoo/0 drbd0
val-test1-copy: uuid_compare()=no-sync by rule 38

Jul  3 22:05:36 val-test2-copy kernel: [50698.618018] drbd odoo
val-test1-copy: Committing remote state change 3809390519 (primary_nodes=2)

Jul  3 22:05:36 val-test2-copy kernel: [50698.618023] drbd odoo
val-test1-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary )

Jul  3 22:05:36 val-test2-copy kernel: [50698.618024] drbd odoo/0 drbd0
val-test1-copy: pdsk( Outdated -> UpToDate ) repl( Off -> Established )

 

 

My question is: Why it did not trigger unfence handler after connection was
recovered and replication established?

The fence constraint stays in pacemaker but 2 nodes are up to date.

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210707/92359df4/attachment-0001.html>


More information about the drbd-user mailing list