[DRBD-user] Auto-promote hangs when 3rd node is gracefully taken offline

Tue Jul 30 10:47:16 CEST 2019

On Sat, Jul 27, 2019 at 02:43:31PM -0400, Digimer wrote:
> Hi all,
> 
>   I hit an odd issue. I have a 3-node DRBD 9 setup (see below).
> Basically, two nodes are protocol C and in pacemaker, the third node is
> protocol A and is outside pacemaker (it's a DR and will periodically
> connect -> sync -> disconnect, mostly living outside the cluster).
> 
>   I down'ed the "DR" resource a day or two ago. Today, I stopped a
> server on node 2 and then tried to boot it on node 1. It hung, because
> DRBD wouldn't promote to primary. I tried manually promoting it, and the
> 'drbdadm up <res>' never returned.

the "drbdadm primary" command, I guess?

That would be the corresponding log line, probably:

Jul 27 18:16:39 el8-a01n01.digimer.ca kernel: drbd test_server:
Auto-promote failed: Timeout in operation

>   Eventually, I brought the DR node back up, and after a delay with a
> flurry of log entries, it connected and resync'ed. After than,
> auto-promotion worked again and I could boot the server.

So we should figure out why the auto-promote ran into a timeout
(and why some command "never" returned for you)

What's the latency to your DR? between the other nodes?
What's the typical run time of your fence handler,
and does it try to "talk to DRBD" while it is running?

Then, why later DRBD is "busy looping"
"preparing/aborting/..." cluster wide state transitions,
that seems ... a waste of effort ;-)

>   Attached are the full logs from the three nods, going back to the 24th
> (three days ago). I'm not including them inline as it adds up to 4500 lines.

compressed attachments are fine.

>   I assume I am right to assume that gracefully down'ing a resource
> should not cause problems, as it worked for a little bit OK, and failed
> after it was left alone for a day or so.

Even ungracefully disconnecting some DR node
should not affect main cluster operation.

Just saying: I'm not a fan of auto-promote, btw.
but it sure makes "live migration" easier.

Some other log lines I want to highlight, even though that is several
days before the "interesting" auto promote timeout failure:

> Jul 24 21:47:20 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 6
> Jul 24 21:47:27 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 5
> Jul 24 21:47:33 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 4
> Jul 24 21:47:39 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 3
> Jul 24 21:47:45 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 2
> Jul 24 21:47:51 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 1

Uh? Seriously congested on the network, several times?

pattern repeats 18 times, finally:

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: sock was shut down by peer
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server: susp-io( no -> fencing)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: fence-peer helper broken, returned 9

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server: susp-io( fencing -> no)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server/1 drbd1 el8-a01n02.digimer.ca: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server/0 drbd0 el8-a01n02.digimer.ca: Resync done (total 1 sec; paused 0 sec; 2052 K/sec)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: helper command: /sbin/drbdadm unfence-peer
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: helper command: /sbin/drbdadm unfence-peer exit code 0 (0x0)

> Jul 26 19:44:47 el8-a01n01.digimer.ca kernel: drbd test_server: role( Primary -> Secondary )
> Jul 26 19:45:03 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: peer( Secondary -> Primary )

> Jul 27 17:52:41 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: peer( Primary -> Secondary )

Here, autopromote fails, with timeout, after trying three times,
and each try hit the timeout of 30 seconds:

> Jul 27 18:14:48 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 3188437928 (0->-1 3/1)
> Jul 27 18:15:03 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 3188437928 (30341ms) rv = -23
> Jul 27 18:15:34 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 216414519 (0->-1 3/1)
> Jul 27 18:15:34 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 216414519 (30344ms) rv = -23
> Jul 27 18:16:04 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 3835310858 (0->-1 3/1)
> Jul 27 18:16:04 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 3835310858 (30032ms) rv = -23
> Jul 27 18:16:36 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 3764308321 (0->-1 3/1)
> Jul 27 18:16:36 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 3764308321 (30464ms) rv = -23
> Jul 27 18:16:39 el8-a01n01.digimer.ca kernel: drbd test_server: Auto-promote failed: Timeout in operation
> Jul 27 18:16:39 el8-a01n01.digimer.ca libvirtd[2155]: 2019-07-27 18:16:39.328+0000: 2155: error : qemuProcessReportLogError:1923 : internal error: qemu unexpectedly closed the monitor: 2019-07-27T18:16:39.287405Z qemu-kvm: -drive file=/dev/drbd0,format=raw,if=none,id=drive-virtio-disk0,cache=directsync,aio=native: The device is not writable: Permission denied
> Jul 27 18:16:39 el8-a01n01.digimer.ca libvirtd[2155]: 2019-07-27 18:16:39.331+0000: 2156: error : qemuProcessReportLogError:1923 : internal error: process exited while connecting to monitor: 2019-07-27T18:16:39.287405Z qemu-kvm: -drive file=/dev/drbd0,format=raw,if=none,id=drive-virtio-disk0,cache=directsync,aio=native: The device is not writable: Permission denied

... again:

> Jul 27 18:21:26 el8-a01n01.digimer.ca kernel: drbd test_server: Auto-promote failed: Timeout in operation

> After up'ing test_server on DR;
> 
> Jul 27 18:21:57 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 282086595 (0->-1 3/1)
> Jul 27 18:21:57 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 282086595 (30016ms) rv = -23
> Jul 27 18:22:27 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 181803038 (0->-1 3/1)
> Jul 27 18:22:27 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 181803038 (30176ms) rv = -23
> Jul 27 18:23:25 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01dr01.digimer.ca: Handshake to peer 2 successful: Agreed network protocol version 115
> Jul 27 18:23:25 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01dr01.digimer.ca: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
> Jul 27 18:23:25 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01dr01.digimer.ca: Starting ack_recv thread (from drbd_r_test_ser [19286])
> Jul 27 18:23:26 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 4276833572 (0->2 499/146)
> Jul 27 18:23:26 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Rejecting concurrent remote state change 812175563 because of state change 4276833572
> Jul 27 18:23:26 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Ignoring P_TWOPC_ABORT packet 812175563.

1105 lines, about rejecting and ignoring...

> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Ignoring P_TWOPC_ABORT packet 273065352.
> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Rejecting concurrent remote state change 2452056374 because of state change 4276833572

then finally, after the timeout:

> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting cluster-wide state change 4276833572 (30455ms) rv = -23

> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Ignoring P_TWOPC_ABORT packet 2452056374.
> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Preparing remote state change 3936225582 
> Jul 27 18:23:56 el8-a01n01.digimer.ca kernel: drbd test_server el8-a01n02.digimer.ca: Committing remote state change 3936225582 (primary_nodes=0)

Yay. This time, a state change went through :-)

> Jul 27 18:24:15 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing cluster-wide state change 3514756670 (0->2 499/146)
> Jul 27 18:24:15 el8-a01n01.digimer.ca kernel: drbd test_server: State change 3514756670: primary_nodes=0, weak_nodes=0
> Jul 27 18:24:15 el8-a01n01.digimer.ca kernel: drbd test_server: Committing cluster-wide state change 3514756670 (18914ms)

And an other one as well...

but it took 19 seconds!

> Wait for it to finish sync'ing
> 
> After resync of DR, node 1 auto-promoted successfully.

> Jul 27 18:29:39 el8-a01dr01.digimer.ca kernel: drbd test_server/0 drbd0: No resync, but 16 bits in bitmap!

Hm. I'm not too happy about this message in this context. But unrelated.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT