[PATCH] drbd: Fix kernel hangs issue

zhengbing.huang zhengbing.huang at easystack.cn
Mon May 12 07:44:49 CEST 2025


Kernel hangs:
[433091.237135] BUG: scheduling while atomic: swapper/27/0/0x00000100
[433091.392756] CPU: 27 PID: 0 Comm: swapper/27 Kdump: loaded Tainted: G
OE    --------- -  - 4.18.0-372.19.1.es8_10.aarch64 #1
[433091.406962] Hardware name: WUZHOU S627K2/BC82AMDQ, BIOS 6.70
04/03/2024
[433091.414645] Call trace:
[433091.418157]  dump_backtrace+0x0/0x158
[433091.422867]  show_stack+0x24/0x30
[433091.427215]  dump_stack+0x5c/0x74
[433091.431582]  __schedule_bug+0x74/0x88
[433091.436260]  __schedule+0x794/0x860
[433091.440760]  schedule+0x48/0xc8
[433091.444888]  schedule_timeout+0x1ac/0x2c0
[433091.449903]  __dtr_disconnect_path+0x28c/0x520 [drbd_transport_rdma]
[433091.457229]  dtr_disconnect_path.part.3+0x20/0x78 [drbd_transport_rdma]
[433091.464879]  dtr_remove_path+0x24/0x30 [drbd_transport_rdma]
[433091.471553]  drbd_destroy_path+0x34/0x60 [drbd]
[433091.477144]  drbd_reclaim_path+0x44/0x50 [drbd]
[433091.482683]  rcu_do_batch+0x178/0x438
[433091.487319]  rcu_core+0x1d4/0x2f8
[433091.491635]  rcu_core_si+0x14/0x20
[433091.496021]  __do_softirq+0x118/0x320
[433091.500656]  irq_exit_rcu+0x10c/0x120
[433091.505243]  irq_exit+0x14/0x20
[433091.509293]  __handle_domain_irq+0x70/0xc0
[433091.514266]  gic_handle_irq+0xd4/0x178
[433091.518891]  el1_irq+0xb8/0x140
[433091.522877]  arch_cpu_idle+0x20/0x28
[433091.527295]  default_idle_call+0x54/0x158
[433091.532154]  do_idle+0x208/0x278
[433091.536230]  cpu_startup_entry+0x28/0x30
[433091.540977]  secondary_start_kernel+0x150/0x160

(gdb) l *__dtr_disconnect_path+0x28c
0xa3c is in __dtr_disconnect_path
(/usr/src/debug/kmod-drbd-kmodtool-9.2.9-10.es8_19.k372.mlnx.23.04.aarch64/obj/default-OFED/drbd/drbd_transport_rdma.c:2772).
2771            case PCS_FINISHING:
2772                    t = wait_event_timeout(path->cs.wq,
2773				atomic_read(&path->cs.passive_state) == PCS_INACTIVE,
2774                                           HZ * 60);

So, the problem is that schedule is called in the softirq.

We should remove ops.remove_path() from call_rcu() to avoid this problem.

Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
 drbd/drbd_main.c | 5 +++--
 drbd/drbd_nl.c   | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index f058736a1..d3912e019 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -3078,6 +3078,9 @@ static void drbd_remove_all_paths(struct drbd_connection *connection)
 	smp_wmb();
 
 	list_for_each_entry_safe(path, tmp, &transport->paths, list) {
+
+		transport->class->ops.remove_path(path);
+
 		/* Exclusive with reading state, in particular remember_state_change() */
 		write_lock_irq(&resource->state_rwlock);
 		list_del_rcu(&path->list);
@@ -3963,8 +3966,6 @@ void drbd_destroy_path(struct kref *kref)
 	struct drbd_connection *connection =
 		container_of(path->transport, struct drbd_connection, transport);
 
-	connection->transport.class->ops.remove_path(path);
-
 	kref_debug_put(&connection->kref_debug, 17);
 	kref_put(&connection->kref, drbd_destroy_connection);
 	kfree(path);
diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 3a7991e74..bb5d531bf 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -4810,6 +4810,8 @@ adm_del_path(struct drbd_config_context *adm_ctx,  struct genl_info *info)
 		/* Ensure flag visible before list manipulation. */
 		smp_wmb();
 
+		transport->class->ops.remove_path(path);
+
 		/* Exclusive with reading state, in particular remember_state_change() */
 		write_lock_irq(&resource->state_rwlock);
 		list_del_rcu(&path->list);
-- 
2.43.0



More information about the drbd-dev mailing list