[DRBD-user] "drbdadm down" takes a long time to complete

Janusz Jaskiewicz janusz.jaskiewicz at gmail.com
Mon Dec 27 22:22:41 CET 2021


Hello.

I have a 3 node drbd cluster, where 2 nodes (prod01, prod02) are managed by
pacemaker and the third node (dr01) runs raw drbd.

The replication works well, but sometimes I have an intermittent problem
with bringing the drbd down.
It happens regardless if the node is managed by a pacemaker or not.
The problem is that the drbdadm down <resource> command takes a very long
time to complete (a couple of minutes).


I tried to debug this problem on dr01 and I found out that it only happens
when drbd first attempts to disconnect prod02 (secondary) node and then
prod01 (primary).
In that case the connection transitions from Connected to NetworkFailure
after a while.
When the order is opposite: prod01 followed by prod02, the drbdadm down
command returns immediately.
In that case I can see the connection goes from Connected -> Disconnecting
and the resource is brought down immediately.

I'm pretty much stuck here and I would appreciate any help pointing out how
to resolve, or further debug this problem.

DRBD_KERNEL_VERSION=9.1.4
DRBDADM_VERSION=9.17.0

Please find below my drbd config and two snippets of journalctl logs (one
where the problem occurs and the other one where it does not).


Thanks in advance for your help.

Regards,
Janusz.

============================
DRBD config:

resource drbd-persistence {

  device    /dev/drbd1;
  disk      /dev/mapper/drbdpool-drbdata;
  meta-disk internal;

  on prod01.ny5.dc.mycompany.com {
    node-id     0;
  }
  on prod02.ny5.dc.mycompany.com {
    node-id     1;
  }
  on dr01.ch4.dc.mycompany.com {
    node-id     2;
  }

  connection {
    host prod01.ny5.dc.mycompany.com address <obfuscated>:7789;
    host prod02.ny5.dc.mycompany.com address <obfuscated>:7789;
    net {
      protocol C;
    }
  }
  connection {
    host prod01.ny5.dc.mycompany.com address <obfuscated>:7789;
    host dr01.ch4.dc.mycompany.com address <obfuscated>:7789;
    net {
      protocol A;
      sndbuf-size 4M;
    }
  }
  connection {
    host prod02.ny5.dc.mycompany.com address <obfuscated>:7789;
    host dr01.ch4.dc.mycompany.com address <obfuscated>:7789;
    net {
      protocol A;
      sndbuf-size 4M;
    }
  }
}

==============================

journalctl log when problem occurs:

Dec 23 10:30:31 dr01.ch4.dc.mycompany.com sudo[64021]:      user :
TTY=pts/0 ; PWD=/home/user ; USER=root ; COMMAND=/sbin/drbdadm down
drbd-persistence
Dec 23 10:30:31 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1840270664 (2->0 496/16)
Dec 23 10:31:01 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 239ms
Dec 23 10:31:01 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 2640669369 (2->0 496/16)
Dec 23 10:31:32 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 175ms
Dec 23 10:31:32 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 47999794 (2->0 496/16)
Dec 23 10:32:02 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 1110ms
Dec 23 10:32:03 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1719956804 (2->0 496/16)
Dec 23 10:32:33 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 405ms
Dec 23 10:32:34 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 4005278613 (2->0 496/16)
Dec 23 10:33:04 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 3378ms
Dec 23 10:33:07 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1216738202 (2->0 496/16)
Dec 23 10:33:38 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 5569ms
Dec 23 10:33:43 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 3440526386 (2->0 496/16)
Dec 23 10:34:14 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 6349ms
Dec 23 10:34:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 3514914987 (2->0 496/16)
Dec 23 10:34:51 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 4259ms
Dec 23 10:34:55 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1695181261 (2->0 496/16)
Dec 23 10:35:25 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 6294ms
Dec 23 10:35:32 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1507920047 (2->0 496/16)
Dec 23 10:36:02 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 767ms
Dec 23 10:36:03 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 438170821 (2->0 496/16)
Dec 23 10:36:33 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 3327ms
Dec 23 10:36:37 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 864136504 (2->0 496/16)
Dec 23 10:37:07 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 5029ms
Dec 23 10:37:12 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1182907387 (2->0 496/16)
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: meta connection shut down by peer.
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Connected -> NetworkFailure ) peer(
Secondary -> Unknown )
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1 prod02.ny5.dc.mycompany.com: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: ack_receiver terminated
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating ack_recv thread
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Retrying cluster-wide state change after 921ms
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Starting sender thread (from drbd_r_service-
[63376])
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Connection closed
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
exit code 0
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( NetworkFailure -> Unconnected )
Dec 23 10:37:20 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Restarting receiver thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 510832083 (2->0 496/16)
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
State change 510832083: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Cluster is now split
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Committing cluster-wide state change 510832083 (16ms)
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: conn( Connected -> Disconnecting ) peer(
Primary -> Unknown )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( UpToDate -> Outdated )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1 prod01.ny5.dc.mycompany.com: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: ack_receiver terminated
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating ack_recv thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Unconnected -> Connecting )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Starting sender thread (from drbd_r_service-
[63375])
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Connection closed
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
exit code 0
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: conn( Disconnecting -> StandAlone )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating receiver thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Connecting -> Disconnecting )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Starting sender thread (from drbd_r_service-
[63376])
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Connection closed
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
exit code 0
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Disconnecting -> StandAlone )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating receiver thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( Outdated -> Detaching )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( Detaching -> Diskless )
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: drbd_bm_resize called with capacity == 0
Dec 23 10:37:21 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Terminating worker thread

================================
journalctl logs when problem doesn't occur:

Dec 23 10:40:36 dr01.ch4.dc.mycompany.com sudo[64116]:      user :
TTY=pts/0 ; PWD=/home/user ; USER=root ; COMMAND=/sbin/drbdadm down
drbd-persistence
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 1232950875 (2->0 496/16)
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
State change 1232950875: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Committing cluster-wide state change 1232950875 (15ms)
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: conn( Connected -> Disconnecting ) peer(
Primary -> Unknown )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( UpToDate -> Outdated )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1 prod01.ny5.dc.mycompany.com: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: ack_receiver terminated
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating ack_recv thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Starting sender thread (from drbd_r_service-
[64099])
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Connection closed
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
exit code 0
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: conn( Disconnecting -> StandAlone )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating receiver thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod01.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Preparing cluster-wide state change 3474969236 (2->1 496/16)
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
State change 3474969236: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Cluster is now split
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Committing cluster-wide state change 3474969236 (15ms)
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Connected -> Disconnecting ) peer(
Secondary -> Unknown )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1 prod02.ny5.dc.mycompany.com: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: ack_receiver terminated
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating ack_recv thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Starting sender thread (from drbd_r_service-
[64101])
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Connection closed
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: helper command: /sbin/drbdadm disconnected
exit code 0
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: conn( Disconnecting -> StandAlone )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating receiver thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence
prod02.ny5.dc.mycompany.com: Terminating sender thread
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( Outdated -> Detaching )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: disk( Detaching -> Diskless )
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence/0
drbd1: drbd_bm_resize called with capacity == 0
Dec 23 10:40:36 dr01.ch4.dc.mycompany.com kernel: drbd drbd-persistence:
Terminating worker thread
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20211227/cd821496/attachment-0001.htm>


More information about the drbd-user mailing list