[DRBD-user] DRBD9 automatic unfence question
Miro Igov
miro.igov at gmail.com
Wed Jul 7 16:37:20 CEST 2021
Hello,
I have 2 node cluster (pacemaker) with master/slave DRBD resource.
The resource is configured with handlers:
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
If I shutdown the master node it gets fenced by handler and promotes the
live node, see in log:
Jul 3 07:41:45 val-test1-copy kernel: [ 308.983600] drbd odoo
val-test2-copy: peer( Primary -> Secondary )
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113294] drbd odoo
val-test2-copy: Preparing remote state change 4162941012
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113464] drbd odoo
val-test2-copy: Committing remote state change 4162941012 (primary_nodes=0)
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113467] drbd odoo
val-test2-copy: conn( Connected -> TearDown ) peer( Secondary ->Unknown )
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113468] drbd odoo/0 drbd0
val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113479] drbd odoo
val-test2-copy: ack_receiver terminated
Jul 3 07:41:46 val-test1-copy kernel: [ 309.113480] drbd odoo
val-test2-copy: Terminating ack_recv thread
Jul 3 07:41:46 val-test1-copy kernel: [ 309.134608] drbd odoo
val-test2-copy: Restarting sender thread
Jul 3 07:41:46 val-test1-copy kernel: [ 309.145903] drbd odoo
val-test2-copy: Connection closed
Jul 3 07:41:46 val-test1-copy kernel: [ 309.145912] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected
Jul 3 07:41:46 val-test1-copy kernel: [ 309.146759] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0
Jul 3 07:41:46 val-test1-copy kernel: [ 309.146765] drbd odoo
val-test2-copy: conn( TearDown -> Unconnected )
Jul 3 07:41:46 val-test1-copy kernel: [ 309.146770] drbd odoo
val-test2-copy: Restarting receiver thread
Jul 3 07:41:46 val-test1-copy kernel: [ 309.146771] drbd odoo
val-test2-copy: conn( Unconnected -> Connecting )
Jul 3 07:41:47 val-test1-copy kernel: [ 310.357823] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm fence-peer
Jul 3 07:41:27 val-test1-copy qemu-ga: message repeated 13 times: [ info:
guest-ping called]
Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh
Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: /
Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: INFO peers are
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-odoo-ms_drbd_odoo'
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639973] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400)
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639977] drbd odoo
val-test2-copy: fence-peer helper returned 4 (peer was fenced)
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639980] drbd odoo/0 drbd0
val-test2-copy: pdsk( DUnknown -> Outdated )
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639991] drbd odoo: Preparing
cluster-wide state change 1652257927 (0->-1 3/1)
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639992] drbd odoo: Committing
cluster-wide state change 1652257927 (0ms)
Jul 3 07:41:47 val-test1-copy kernel: [ 310.639993] drbd odoo: role(
Secondary -> Primary )
Jul 3 07:41:49 val-test1-copy kernel: [ 313.035751] drbd odoo/0 drbd0: new
current UUID: E67F75EE576FC547 weak: FFFFFFFFFFFFFFFE
Jul 3 07:41:50 val-test1-copy kernel: [ 313.058994] EXT4-fs (drbd0):
mounted filesystem with ordered data mode. Opts: (null)
So far perfect.
Later i start the node and start pacemaker. The node joins as slave and
starts replication, then removes the fence constraint from pacemaker (also
perfect):
Jul 3 07:51:13 val-test1-copy kernel: [ 876.831326] drbd odoo
val-test2-copy: Handshake to peer 1 successful: Agreed network protocol
version 118
Jul 3 07:51:13 val-test1-copy kernel: [ 876.831327] drbd odoo
val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 3 07:51:13 val-test1-copy kernel: [ 876.831494] drbd odoo
val-test2-copy: Peer authenticated using 20 bytes HMAC
Jul 3 07:51:13 val-test1-copy kernel: [ 876.831502] drbd odoo
val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2018])
Jul 3 07:51:13 val-test1-copy kernel: [ 876.872236] drbd odoo: Preparing
cluster-wide state change 2689784605 (0->1 499/145)
Jul 3 07:51:13 val-test1-copy kernel: [ 876.888237] drbd odoo/0 drbd0
val-test2-copy: drbd_sync_handshake:
Jul 3 07:51:13 val-test1-copy kernel: [ 876.888238] drbd odoo/0 drbd0
val-test2-copy: self
E67F75EE576FC547:FA440AF0812F8348:2F6D8A307AFA432E:492AA1512F283D34 bits:44
flags:120
Jul 3 07:51:13 val-test1-copy kernel: [ 876.888241] drbd odoo/0 drbd0
val-test2-copy: peer
FA440AF0812F8348:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:1020
Jul 3 07:51:13 val-test1-copy kernel: [ 876.888242] drbd odoo/0 drbd0
val-test2-copy: uuid_compare()=source-use-bitmap by rule70
Jul 3 07:51:15 val-test1-copy kernel: [ 878.341061] drbd odoo: State
change 2689784605: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Jul 3 07:51:15 val-test1-copy kernel: [ 878.341063] drbd odoo: Committing
cluster-wide state change 2689784605 (1468ms)
Jul 3 07:51:15 val-test1-copy kernel: [ 878.341077] drbd odoo
val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary )
Jul 3 07:51:15 val-test1-copy kernel: [ 878.341078] drbd odoo/0 drbd0
val-test2-copy: pdsk( Outdated -> Consistent ) repl( Off-> WFBitMapS )
Jul 3 07:51:15 val-test1-copy kernel: [ 878.341331] drbd odoo/0 drbd0
val-test2-copy: pdsk( Consistent -> Outdated )
Jul 3 07:51:15 val-test1-copy kernel: [ 878.348293] drbd odoo/0 drbd0
val-test2-copy: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 47(1),
total 47; compression: 100.0%
Jul 3 07:51:15 val-test1-copy kernel: [ 878.367674] drbd odoo/0 drbd0
val-test2-copy: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE
47(1), total 47; compression: 100.0%
Jul 3 07:51:15 val-test1-copy kernel: [ 878.367686] drbd odoo/0 drbd0
val-test2-copy: helper command: /sbin/drbdadm before-resync-source
Jul 3 07:51:15 val-test1-copy kernel: [ 878.368492] drbd odoo/0 drbd0
val-test2-copy: helper command: /sbin/drbdadm before-resync-source exit code
0
Jul 3 07:51:15 val-test1-copy kernel: [ 878.368499] drbd odoo/0 drbd0
val-test2-copy: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS ->
SyncSource )
Jul 3 07:51:15 val-test1-copy kernel: [ 878.368523] drbd odoo/0 drbd0
val-test2-copy: Began resync as SyncSource (will sync 176 KB [44 bits set]).
Jul 3 07:51:15 val-test1-copy kernel: [ 878.497928] drbd odoo/0 drbd0
val-test2-copy: updated UUIDs
E67F75EE576FC547:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34
Jul 3 07:51:15 val-test1-copy kernel: [ 878.515002] drbd odoo/0 drbd0
val-test2-copy: Resync done (total 1 sec; paused 0 sec; 176 K/sec)
Jul 3 07:51:15 val-test1-copy kernel: [ 878.515006] drbd odoo/0 drbd0
val-test2-copy: pdsk( Inconsistent -> UpToDate ) repl( SyncSource ->
Established )
Jul 3 07:51:15 val-test1-copy kernel: [ 878.533682] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm unfence-peer
Jul 3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connected
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES=0x00000003 /usr/lib/drbd/crm-unfence-peer.9.sh
Jul 3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]: INFO Removed
constraint 'drbd-fence-by-handler-odoo-ms_drbd_odoo'
Jul 3 07:51:15 val-test1-copy kernel: [ 878.582275] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm unfence-peer exit code 0
Now the problem - i got message "PingAck did not arrive in time" in kernel i
guess a network issue. The slave node was fenced as it should be.
Log from master node:
Jul 3 22:05:35 val-test1-copy kernel: [50518.748265] drbd odoo
val-test2-copy: PingAck did not arrive in time.
Jul 3 22:05:35 val-test1-copy kernel: [50518.748319] drbd odoo
val-test2-copy: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown
)
Jul 3 22:05:35 val-test1-copy kernel: [50518.748320] drbd odoo/0 drbd0:
disk( UpToDate -> Consistent )
Jul 3 22:05:35 val-test1-copy kernel: [50518.748322] drbd odoo/0 drbd0
val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Jul 3 22:05:35 val-test1-copy kernel: [50518.748339] drbd odoo
val-test2-copy: ack_receiver terminated
Jul 3 22:05:35 val-test1-copy kernel: [50518.748342] drbd odoo
val-test2-copy: Terminating ack_recv thread
Jul 3 22:05:35 val-test1-copy kernel: [50518.785064] drbd odoo
val-test2-copy: Restarting sender thread
Jul 3 22:05:36 val-test1-copy kernel: [50519.533938] drbd odoo: Preparing
cluster-wide state change 1241314843 (0->-1 0/0)
Jul 3 22:05:36 val-test1-copy kernel: [50519.533961] drbd odoo: Committing
cluster-wide state change 1241314843 (0ms)
Jul 3 22:05:36 val-test1-copy kernel: [50519.533964] drbd odoo/0 drbd0:
disk( Consistent -> UpToDate )
Jul 3 22:05:36 val-test1-copy kernel: [50519.611815] drbd odoo
val-test2-copy: Connection closed
Jul 3 22:05:36 val-test1-copy kernel: [50519.611823] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected
Jul 3 22:05:36 val-test1-copy kernel: [50519.612622] drbd odoo
val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0
Jul 3 22:05:36 val-test1-copy kernel: [50519.612627] drbd odoo
val-test2-copy: conn( NetworkFailure -> Unconnected )
Jul 3 22:05:36 val-test1-copy kernel: [50519.612631] drbd odoo
val-test2-copy: Restarting receiver thread
Jul 3 22:05:36 val-test1-copy kernel: [50519.612633] drbd odoo
val-test2-copy: conn( Unconnected -> Connecting )
Jul 3 22:05:36 val-test1-copy kernel: [50520.169410] drbd odoo
val-test2-copy: Handshake to peer 1 successful: Agreed network protocol
version 118
Jul 3 22:05:36 val-test1-copy kernel: [50520.169411] drbd odoo
val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 3 22:05:36 val-test1-copy kernel: [50520.169582] drbd odoo
val-test2-copy: Peer authenticated using 20 bytes HMAC
Jul 3 22:05:36 val-test1-copy kernel: [50520.169588] drbd odoo
val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2037])
Jul 3 22:05:36 val-test1-copy kernel: [50520.229035] drbd odoo: Preparing
cluster-wide state change 3809390519 (0->1 499/146)
Jul 3 22:05:36 val-test1-copy kernel: [50520.245020] drbd odoo/0 drbd0
val-test2-copy: drbd_sync_handshake:
Jul 3 22:05:36 val-test1-copy kernel: [50520.245021] drbd odoo/0 drbd0
val-test2-copy: self
FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0
flags:120
Jul 3 22:05:36 val-test1-copy kernel: [50520.245022] drbd odoo/0 drbd0
val-test2-copy: peer
FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:1120
Jul 3 22:05:36 val-test1-copy kernel: [50520.245024] drbd odoo/0 drbd0
val-test2-copy: uuid_compare()=no-sync by rule 38
Jul 3 22:05:36 val-test1-copy kernel: [50520.248276] drbd odoo: State
change 3809390519: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
Jul 3 22:05:36 val-test1-copy kernel: [50520.248278] drbd odoo: Committing
cluster-wide state change 3809390519 (16ms)
Jul 3 22:05:36 val-test1-copy kernel: [50520.248290] drbd odoo
val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Primary )
Jul 3 22:05:36 val-test1-copy kernel: [50520.248291] drbd odoo/0 drbd0
val-test2-copy: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
Log from slave node (fence-peer handler executed and constraint set in
pacemaker):
Jul 3 22:05:35 val-test2-copy kernel: [50697.130833] drbd odoo
val-test1-copy: sock was shut down by peer
Jul 3 22:05:35 val-test2-copy kernel: [50697.130846] drbd odoo
val-test1-copy: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown )
Jul 3 22:05:35 val-test2-copy kernel: [50697.130848] drbd odoo/0 drbd0
val-test1-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Jul 3 22:05:35 val-test2-copy kernel: [50697.130874] drbd odoo
val-test1-copy: ack_receiver terminated
Jul 3 22:05:35 val-test2-copy kernel: [50697.130877] drbd odoo
val-test1-copy: Terminating ack_recv thread
Jul 3 22:05:35 val-test2-copy kernel: [50697.156086] drbd odoo
val-test1-copy: Restarting sender thread
Jul 3 22:05:36 val-test2-copy kernel: [50697.867155] drbd odoo
val-test1-copy: Connection closed
Jul 3 22:05:36 val-test2-copy kernel: [50697.867218] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm disconnected
Jul 3 22:05:36 val-test2-copy kernel: [50697.867221] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm fence-peer
Jul 3 22:05:36 val-test2-copy kernel: [50697.867998] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm disconnected exit code 0
Jul 3 22:05:36 val-test2-copy kernel: [50697.868005] drbd odoo
val-test1-copy: conn( BrokenPipe -> Unconnected )
Jul 3 22:05:36 val-test2-copy kernel: [50697.868010] drbd odoo
val-test1-copy: Restarting receiver thread
Jul 3 22:05:36 val-test2-copy kernel: [50697.868012] drbd odoo
val-test1-copy: conn( Unconnected -> Connecting )
Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]:
DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=BrokenPipe
DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1
DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.112 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=1 DRBD_NODE_ID_0=val-test1-copy
DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.111
DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=0 DRBD_RESOURCE=odoo DRBD_VOLUME=0
UP_TO_DATE_NODES =0x00000002 /usr/lib/drbd/crm-fence-peer.9.sh
Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: /
Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: INFO peers are
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-odoo-ms_drbd_odoo'
Jul 3 22:05:36 val-test2-copy kernel: [50698.484443] drbd odoo
val-test1-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400)
Jul 3 22:05:36 val-test2-copy kernel: [50698.484448] drbd odoo
val-test1-copy: fence-peer helper returned 4 (peer was fenced)
Jul 3 22:05:36 val-test2-copy kernel: [50698.484450] drbd odoo/0 drbd0
val-test1-copy: pdsk( DUnknown -> Outdated )
Jul 3 22:05:36 val-test2-copy kernel: [50698.539142] drbd odoo
val-test1-copy: Handshake to peer 0 successful: Agreed network protocol
version 118
Jul 3 22:05:36 val-test2-copy kernel: [50698.539144] drbd odoo
val-test1-copy: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 3 22:05:36 val-test2-copy kernel: [50698.539297] drbd odoo
val-test1-copy: Peer authenticated using 20 bytes HMAC
Jul 3 22:05:36 val-test2-copy kernel: [50698.539305] drbd odoo
val-test1-copy: Starting ack_recv thread (from drbd_r_odoo [1980])
Jul 3 22:05:36 val-test2-copy kernel: [50698.598830] drbd odoo
val-test1-copy: Preparing remote state change 3809390519
Jul 3 22:05:36 val-test2-copy kernel: [50698.617380] drbd odoo/0 drbd0
val-test1-copy: drbd_sync_handshake:
Jul 3 22:05:36 val-test2-copy kernel: [50698.617383] drbd odoo/0 drbd0
val-test1-copy: self
FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0
flags:120
Jul 3 22:05:36 val-test2-copy kernel: [50698.617384] drbd odoo/0 drbd0
val-test1-copy: peer
FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0
flags:1120
Jul 3 22:05:36 val-test2-copy kernel: [50698.617386] drbd odoo/0 drbd0
val-test1-copy: uuid_compare()=no-sync by rule 38
Jul 3 22:05:36 val-test2-copy kernel: [50698.618018] drbd odoo
val-test1-copy: Committing remote state change 3809390519 (primary_nodes=2)
Jul 3 22:05:36 val-test2-copy kernel: [50698.618023] drbd odoo
val-test1-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary )
Jul 3 22:05:36 val-test2-copy kernel: [50698.618024] drbd odoo/0 drbd0
val-test1-copy: pdsk( Outdated -> UpToDate ) repl( Off -> Established )
My question is: Why it did not trigger unfence handler after connection was
recovered and replication established?
The fence constraint stays in pacemaker but 2 nodes are up to date.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210707/92359df4/attachment-0001.html>
More information about the drbd-user
mailing list