[DRBD-user] drbd 0.7.20 lockup

Bradley Baetz bradley.baetz at optusnet.com.au
Fri Jul 14 08:24:56 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I've run into a problem with DRBD where it gets stuck.

We have two machines, doing the DRBD replication via a crossover cable on eth1.
Both machines started off running drbd 0.7.20, heartbeat 1.2.3, and the RHEL4
kernel 2.6.9-34.0.1.ELsmp. The kernel is patched to disable NAPI on the e1000
driver.

Both are single CPU boxes, but with hyperthreading enabled. DRBD has
ext3 running on top of it, being used for a postgres database.

Plan was to upgrade it to 2.6.9-34.0.2.ELsmp. I rebuilt DRBD for the new
kernel, and rebooted the standby box. That came up fine:

Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFConnection --> WFReportParams
Jul 14 10:52:13 dbtools02 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 14 10:52:13 dbtools02 kernel: drbd0: Connection established.
Jul 14 10:52:13 dbtools02 kernel: drbd0: I am(S): 1:00000004:00000001:0000001c:00000002:00
Jul 14 10:52:13 dbtools02 kernel: drbd0: Peer(P): 1:00000004:00000001:0000001d:00000002:10
Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFReportParams --> WFBitMapT
Jul 14 10:52:13 dbtools02 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Jul 14 10:52:13 dbtools02 rc: Starting drbd:  succeeded
Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFBitMapT --> SyncTarget
Jul 14 10:52:13 dbtools02 kernel: drbd0: Resync started as SyncTarget (need to sync 316 KB [79 bits set]).
Jul 14 10:52:13 dbtools02 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 316 K/sec)
Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_worker [1851]: cstate SyncTarget --> Connected

I then went to do hb_standby on the current primary. This started to
shutdown postgresql, but then stopped - postgres and kjournald were stuck
in D state. I tried to do a reboot, but that just caused reboot and
shutdown processes to end up in D state too. sysreq-T (via /proc/sysrq-trigger)
has:

Jul 14 10:59:01 dbtools01 kernel: kjournald     D F7E17980  1812  9519      1          9704  9185 (L-TLB)
Jul 14 10:59:01 dbtools01 kernel: ee3b6d90 00000046 00000038 f7e17980 c1bb7080 c1bb8930 00000019 c1816de0
Jul 14 10:59:01 dbtools01 kernel:        c1bb8930 00000000 c180f740 c180ede0 00000000 0000091c c0ff668b 000f22e3
Jul 14 10:59:01 dbtools01 kernel:        c1bb8930 f5a6cbb0 f5a6cd1c 00000002 c1bb7080 c180ede0 c0442080 f724a86c
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0d52>] io_schedule+0x26/0x30
Jul 14 10:59:01 dbtools01 kernel:  [<c015b4dc>] __wait_on_buffer+0x6c/0x83
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<f88706a0>] journal_commit_transaction+0x485/0xfc1 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<f8872e8d>] kjournald+0xc7/0x219 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c011d549>] schedule_tail+0x31/0xa7
Jul 14 10:59:01 dbtools01 kernel:  [<f8872dc0>] commit_timeout+0x0/0x5 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8872dc6>] kjournald+0x0/0x219 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jul 14 10:59:01 dbtools01 kernel: drbd0_receive S F13C1D80  2476  9185      1          9519 26155 (L-TLB)
Jul 14 10:59:01 dbtools01 kernel: eaef0d8c 00000046 f13c1d80 f13c1d80 f7e110b0 c027a3e5 cf112800 00000000
Jul 14 10:59:01 dbtools01 kernel:        f7e14090 f7e14090 f7e14090 c1816de0 00000001 000004b3 2153565c 000f22dc
Jul 14 10:59:01 dbtools01 kernel:        f7e110b0 f61e8630 f61e879c 00000286 00000001 f59e6500 7fffffff 00000000
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c027a3e5>] pskb_expand_head+0xce/0x114
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0ddf>] schedule_timeout+0x53/0xee
Jul 14 10:59:01 dbtools01 kernel:  [<c02a23e5>] tcp_clean_rtx_queue+0x1e2/0x35f
Jul 14 10:59:01 dbtools01 kernel:  [<c027995b>] release_sock+0xf/0x4f
Jul 14 10:59:01 dbtools01 kernel:  [<c0279490>] sk_wait_data+0x88/0xbd
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c02acb8a>] tcp_v4_do_rcv+0x1b/0xe1
Jul 14 10:59:01 dbtools01 kernel:  [<c029df09>] tcp_recvmsg+0x379/0x681
Jul 14 10:59:01 dbtools01 kernel:  [<c0279a88>] sock_common_recvmsg+0x30/0x46
Jul 14 10:59:01 dbtools01 kernel:  [<c0276754>] sock_recvmsg+0xef/0x10c
Jul 14 10:59:01 dbtools01 kernel:  [<c01436d0>] buffered_rmqueue+0x17d/0x1a5
Jul 14 10:59:01 dbtools01 kernel:  [<f885e89c>] __split_bio+0xfd/0x103 [dm_mod]
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c022431c>] generic_make_request+0x18e/0x19e
Jul 14 10:59:01 dbtools01 kernel:  [<f8aed2af>] drbd_recv+0x8f/0x14f [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8aed888>] drbd_recv_header+0x12/0xc7 [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8aeebe5>] receive_DataRequest+0x0/0x6a0 [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af0843>] drbdd+0x14/0x10d [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af124d>] drbdd_init+0x92/0x15d [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jul 14 10:59:01 dbtools01 kernel: drbd0_worker  S F4338FC0  3080 23649      1         24405 21557 (L-TLB)
Jul 14 10:59:01 dbtools01 kernel: f4338f7c 00000046 00000008 f4338fc0 00000000 00000000 00000000 f4338f28
Jul 14 10:59:01 dbtools01 kernel:        00000001 00000000 00000000 c1816de0 00000001 000046ab e49bcc57 000f2307
Jul 14 10:59:01 dbtools01 kernel:        f7e110b0 f622c6b0 f622c81c 00000000 00000001 c1bb71f4 c1bb71ec 00000246
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c02cfcc7>] __down_interruptible+0xbc/0x115
Jul 14 10:59:01 dbtools01 kernel:  [<c011e71b>] default_wake_function+0x0/0xc
Jul 14 10:59:01 dbtools01 kernel:  [<c02cfd33>] __down_failed_interruptible+0x7/0xc
Jul 14 10:59:01 dbtools01 kernel:  [<f8aec455>] .text.lock.drbd_worker+0x37/0x3e [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jul 14 10:59:01 dbtools01 kernel: drbd0_asender S C02A6134  2760 24405      1          9720 23649 (L-TLB)
Jul 14 10:59:01 dbtools01 kernel: f50bad94 00000046 0000ffff c02a6134 f7e110b0 0000ffff c02a6134 00002710
Jul 14 10:59:01 dbtools01 kernel:        f620f8e0 f59e6b98 f620f8c0 c1816de0 00000001 000007ae 0435cc3b 000f2339
Jul 14 10:59:01 dbtools01 kernel:        f7e110b0 f61a4b30 f61a4c9c 00000000 00000001 fe9bda24 fe9bda24 00000000
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c02a6134>] tcp_cwnd_restart+0x14/0xb7
Jul 14 10:59:01 dbtools01 kernel:  [<c02a6134>] tcp_cwnd_restart+0x14/0xb7
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0e5f>] schedule_timeout+0xd3/0xee
Jul 14 10:59:01 dbtools01 kernel:  [<c012a46e>] process_timeout+0x0/0x5
Jul 14 10:59:01 dbtools01 kernel:  [<c027995b>] release_sock+0xf/0x4f
Jul 14 10:59:01 dbtools01 kernel:  [<c0279490>] sk_wait_data+0x88/0xbd
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0ca2>] __cond_resched+0x14/0x39
Jul 14 10:59:01 dbtools01 kernel:  [<c029df09>] tcp_recvmsg+0x379/0x681
Jul 14 10:59:01 dbtools01 kernel:  [<c0279a88>] sock_common_recvmsg+0x30/0x46
Jul 14 10:59:01 dbtools01 kernel:  [<c0276754>] sock_recvmsg+0xef/0x10c
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c011e7a1>] __wake_up+0x29/0x3c
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<f8af63a9>] _drbd_send_cmd+0xae/0x103 [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8aed215>] drbd_recv_short+0x8f/0x9a [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af1a68>] drbd_asender+0xf6/0x3b2 [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb

with postgres doing:

Jul 14 10:59:01 dbtools01 kernel: postmaster    S 000000D0  1992  9704      1 24412    9179  9519 (NOTLB)
Jul 14 10:59:01 dbtools01 kernel: ea8cbec4 00000082 c15c1400 000000d0 00000000 c0328300 ea8cbf58 000000d0
Jul 14 10:59:01 dbtools01 kernel:        00000000 c01437cd 00000000 c1816de0 00000001 0000383f 2638a722 000f232f
Jul 14 10:59:01 dbtools01 kernel:        f7e110b0 c1bb8eb0 c1bb901c 00000000 00000001 fe9bf77b fe9bf77b 00000000
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c01437cd>] __alloc_pages+0xd5/0x2f7
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0e5f>] schedule_timeout+0xd3/0xee
Jul 14 10:59:01 dbtools01 kernel:  [<c012a46e>] process_timeout+0x0/0x5
Jul 14 10:59:01 dbtools01 kernel:  [<c016abd8>] do_select+0x291/0x2c6
Jul 14 10:59:01 dbtools01 kernel:  [<c016a80d>] __pollwait+0x0/0x95
Jul 14 10:59:01 dbtools01 kernel:  [<c016af00>] sys_select+0x2e0/0x43a
Jul 14 10:59:01 dbtools01 kernel:  [<c012caf4>] sigprocmask+0xb0/0xca
Jul 14 10:59:01 dbtools01 kernel:  [<c02d268f>] syscall_call+0x7/0xb
Jul 14 10:59:01 dbtools01 kernel: postmaster    D 00000001  2612  9720      1         24430 24405 (NOTLB)
Jul 14 10:59:01 dbtools01 kernel: f67aede4 00000082 ee2203f0 00000001 f67aedb8 00000246 00000000 c1bbbc00
Jul 14 10:59:01 dbtools01 kernel:        c1bbbc00 f007520c ef5e03c8 c180ede0 00000000 00006d5d 7544c5c6 000f22f7
Jul 14 10:59:01 dbtools01 kernel:        c0320a80 f62882b0 f628841c f7e17980 00000002 c180ede0 c0442080 f025a660
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0d52>] io_schedule+0x26/0x30
Jul 14 10:59:01 dbtools01 kernel:  [<c015b446>] __lock_buffer+0x6c/0x88
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<f886fd20>] journal_invalidatepage+0x86/0x101 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f88f0486>] ext3_invalidatepage+0x0/0x2d [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<c01487e6>] do_invalidatepage+0x16/0x18
Jul 14 10:59:01 dbtools01 kernel:  [<c0148800>] truncate_complete_page+0x18/0x3a
Jul 14 10:59:01 dbtools01 kernel:  [<c014890b>] truncate_inode_pages+0x7a/0x22d
Jul 14 10:59:01 dbtools01 kernel:  [<f88f56da>] ext3_rename+0x467/0x473 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<c016e4c0>] dput+0x34/0x1a7
Jul 14 10:59:01 dbtools01 kernel:  [<c01689c2>] vfs_rename_other+0x11b/0x125
Jul 14 10:59:01 dbtools01 kernel:  [<c0170b9a>] generic_delete_inode+0x50/0x104
Jul 14 10:59:01 dbtools01 kernel:  [<c0170dc8>] iput+0x5f/0x61
Jul 14 10:59:01 dbtools01 kernel:  [<c016e607>] dput+0x17b/0x1a7
Jul 14 10:59:01 dbtools01 kernel:  [<c0168e5b>] sys_rename+0x157/0x1e0
Jul 14 10:59:01 dbtools01 kernel:  [<c014fd45>] unmap_vma_list+0xe/0x17
Jul 14 10:59:01 dbtools01 kernel:  [<c015007f>] do_munmap+0x129/0x137
Jul 14 10:59:01 dbtools01 kernel:  [<c02d268f>] syscall_call+0x7/0xb
Jul 14 10:59:01 dbtools01 kernel:  [<c02d007b>] schedule+0x2f7/0x8d3
Jul 14 10:59:01 dbtools01 kernel: postmaster    D 00000000  2664 24412   9704         24942       (NOTLB)
Jul 14 10:59:01 dbtools01 kernel: f4fa6cb4 00000082 f5a87680 00000000 fffcf858 fffcf858 b7f0b000 c014d82e
Jul 14 10:59:01 dbtools01 kernel:        f56d4df8 00000001 b7f0b000 c180ede0 00000000 002e629c f6037c51 000f22e6
Jul 14 10:59:01 dbtools01 kernel:        c0320a80 f5985130 f598529c f7e17980 00000002 c180ede0 c0441e80 ef5e02cc
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<c014d82e>] do_no_page+0x6f/0x2f9
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0d52>] io_schedule+0x26/0x30
Jul 14 10:59:01 dbtools01 kernel:  [<c015b446>] __lock_buffer+0x6c/0x88
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<c015b384>] bh_wake_function+0x0/0x29
Jul 14 10:59:01 dbtools01 kernel:  [<f886e837>] do_get_write_access+0x72/0x532 [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0ca2>] __cond_resched+0x14/0x39
Jul 14 10:59:01 dbtools01 kernel:  [<c015c77d>] __getblk+0x2b/0x49
Jul 14 10:59:01 dbtools01 kernel:  [<f88f13e2>] ext3_get_inode_loc+0x4f/0x226 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<f886ed13>] journal_get_write_access+0x1c/0x2c [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f88f1ef2>] ext3_reserve_inode_write+0x34/0x81 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<f88f1f6a>] ext3_mark_inode_dirty+0x2b/0x41 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<f886e47c>] journal_start+0x78/0x9e [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<f88f1fd3>] ext3_dirty_inode+0x53/0x66 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<c0176380>] __mark_inode_dirty+0x28/0x176
Jul 14 10:59:01 dbtools01 kernel:  [<c0126674>] current_fs_time+0x44/0x4c
Jul 14 10:59:01 dbtools01 kernel:  [<c0170e61>] update_atime+0x6e/0x93
Jul 14 10:59:01 dbtools01 kernel:  [<c014066a>] do_generic_mapping_read+0x43d/0x445
Jul 14 10:59:01 dbtools01 kernel:  [<c01408da>] __generic_file_aio_read+0x19f/0x1bd
Jul 14 10:59:01 dbtools01 kernel:  [<c0140672>] file_read_actor+0x0/0xc9
Jul 14 10:59:01 dbtools01 kernel:  [<c0140938>] generic_file_aio_read+0x40/0x47
Jul 14 10:59:01 dbtools01 kernel:  [<c015a335>] do_sync_read+0x97/0xc9
Jul 14 10:59:01 dbtools01 kernel:  [<c014e935>] __vma_link+0x59/0x66
Jul 14 10:59:01 dbtools01 kernel:  [<c014e986>] vma_link+0x44/0xbc
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c015a41d>] vfs_read+0xb6/0xe2
Jul 14 10:59:01 dbtools01 kernel:  [<c015a630>] sys_read+0x3c/0x62
Jul 14 10:59:01 dbtools01 kernel:  [<c02d268f>] syscall_call+0x7/0xb
Jul 14 10:59:01 dbtools01 kernel: postmaster    D F88F1C03  3088 24942   9704               24412 (NOTLB)
Jul 14 10:59:01 dbtools01 kernel: f156ed30 00000082 0000003b f88f1c03 f886ed1c f61a4030 00000019 c1816de0
Jul 14 10:59:01 dbtools01 kernel:        f61a4030 00000000 c180f740 c180ede0 00000000 00011d5f 42090eb4 000f2313
Jul 14 10:59:01 dbtools01 kernel:        f61a4030 f5a6c0b0 f5a6c21c 00000002 c1bb7080 c180ede0 c180b880 f6353ea8
Jul 14 10:59:01 dbtools01 kernel: Call Trace:
Jul 14 10:59:01 dbtools01 kernel:  [<f88f1c03>] ext3_do_update_inode+0x2f7/0x31e [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<f886ed1c>] journal_get_write_access+0x25/0x2c [jbd]
Jul 14 10:59:01 dbtools01 kernel:  [<c02d0d52>] io_schedule+0x26/0x30
Jul 14 10:59:01 dbtools01 kernel:  [<c013ff69>] __lock_page+0xac/0xbf
Jul 14 10:59:01 dbtools01 kernel:  [<c013fcf9>] page_wake_function+0x0/0x36
Jul 14 10:59:01 dbtools01 kernel:  [<c013fcf9>] page_wake_function+0x0/0x36
Jul 14 10:59:01 dbtools01 kernel:  [<c0140065>] find_lock_page+0x6d/0x9d
Jul 14 10:59:01 dbtools01 kernel:  [<c01417de>] generic_file_buffered_write+0xe5/0x47c
Jul 14 10:59:01 dbtools01 kernel:  [<f88f1c03>] ext3_do_update_inode+0x2f7/0x31e [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<f88f576a>] __ext3_journal_stop+0x19/0x34 [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<c017643a>] __mark_inode_dirty+0xe2/0x176
Jul 14 10:59:01 dbtools01 kernel:  [<c0126674>] current_fs_time+0x44/0x4c
Jul 14 10:59:01 dbtools01 kernel:  [<c0141efe>] __generic_file_aio_write_nolock+0x389/0x3b7
Jul 14 10:59:01 dbtools01 kernel:  [<c0141f65>] generic_file_aio_write_nolock+0x39/0x7f
Jul 14 10:59:01 dbtools01 kernel:  [<c014214f>] generic_file_aio_write+0x72/0xc6
Jul 14 10:59:01 dbtools01 kernel:  [<f88edd7e>] ext3_file_write+0x19/0x8b [ext3]
Jul 14 10:59:01 dbtools01 kernel:  [<c015a4e0>] do_sync_write+0x97/0xc9
Jul 14 10:59:01 dbtools01 kernel:  [<c016a808>] poll_freewait+0x33/0x38
Jul 14 10:59:01 dbtools01 kernel:  [<c0164adc>] pipe_readv+0x292/0x29e
Jul 14 10:59:01 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 10:59:01 dbtools01 kernel:  [<c015a5c8>] vfs_write+0xb6/0xe2
Jul 14 10:59:01 dbtools01 kernel:  [<c015a692>] sys_write+0x3c/0x62
Jul 14 10:59:01 dbtools01 kernel:  [<c02d268f>] syscall_call+0x7/0xb

There weren't any messages from DRBD on either box, and it was in
Secondary/Primary state on the secondary.

I then did a reboot -f. When the primary came back up, it started to
resync (heartbeat wasn't running on the secondary, so there was no
takeover attempt while it was rebooting):

Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFConnection --> WFReportParams
Jul 14 11:04:33 dbtools01 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 14 11:04:33 dbtools01 kernel: drbd0: Connection established.
Jul 14 11:04:33 dbtools01 kernel: drbd0: I am(S): 1:00000004:00000001:0000001d:00000002:10
Jul 14 11:04:33 dbtools01 kernel: drbd0: Peer(S): 1:00000004:00000001:0000001d:00000002:01
Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFReportParams --> WFBitMapS
Jul 14 11:04:33 dbtools01 drbd: WARN: stdin/stdout is not a TTY; using /dev/console
Jul 14 11:04:33 dbtools01 rc: Starting drbd:  succeeded
Jul 14 11:04:33 dbtools01 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFBitMapS --> SyncSource
Jul 14 11:04:33 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 802816 KB [200704 bits set]).
Jul 14 11:05:06 dbtools01 kernel: drbd0: Secondary/Secondary --> Primary/Secondary

but then showed the transfer as 'stalled' on both ends:

version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by root at build03.syd.optusnet.com.au, 2006-07-14 10:39:06
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:98444 nr:0 dw:6644 dr:115025 al:6 bm:214 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.6% (710976/710976)K
        stalled

Jul 14 11:05:13 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967295
Jul 14 11:05:16 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967294
Jul 14 11:05:19 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967293
Jul 14 11:05:22 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967292
Jul 14 11:05:25 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967291
Jul 14 11:05:28 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967290
....
Jul 14 11:09:16 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967215
Jul 14 11:09:19 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967214
Jul 14 11:09:22 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967213
Jul 14 11:09:25 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967212
Jul 14 11:09:28 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967211

postgres wasn't starting on the primary yet, either. sysrq-T shows the only D state thing being pdflush:

Jul 14 11:09:34 dbtools01 kernel: pdflush       D F7D8C680  2752    47      6            49    46 (L-TLB)
Jul 14 11:09:34 dbtools01 kernel: f7cf2c58 00000046 f6337d80 f7d8c680 f6337d80 c022a224 f7d8c680 f7ce0b68
Jul 14 11:09:34 dbtools01 kernel:        f6337d80 c022a79d f633522c c1816de0 00000001 000336a4 94d89130 00000024
Jul 14 11:09:34 dbtools01 kernel:        f7e110b0 f7cc4130 f7cc429c c1bd965c 00000001 c1bd9284 00000246 c1bd928c
Jul 14 11:09:34 dbtools01 kernel: Call Trace:
Jul 14 11:09:34 dbtools01 kernel:  [<c022a224>] cfq_add_crq_rb+0x3b/0x4d
Jul 14 11:09:34 dbtools01 kernel:  [<c022a79d>] cfq_enqueue+0x33/0x64
Jul 14 11:09:34 dbtools01 kernel:  [<c02cfbb1>] __down+0x81/0xdb
Jul 14 11:09:34 dbtools01 kernel:  [<c011e71b>] default_wake_function+0x0/0xc
Jul 14 11:09:34 dbtools01 kernel:  [<c02cfd28>] __down_failed+0x8/0xc
Jul 14 11:09:34 dbtools01 kernel:  [<f8af934c>] .text.lock.drbd_main+0x41/0x18c [drbd]
Jul 14 11:09:34 dbtools01 kernel:  [<c015e990>] bio_clone+0x84/0x9c
Jul 14 11:09:34 dbtools01 kernel:  [<f8af263c>] drbd_make_request_common+0x499/0x744 [drbd]
Jul 14 11:09:34 dbtools01 kernel:  [<f885e89c>] __split_bio+0xfd/0x103 [dm_mod]
Jul 14 11:09:34 dbtools01 kernel:  [<f8af2aa7>] drbd_make_request_26+0x1c0/0x1c9 [drbd]
Jul 14 11:09:34 dbtools01 kernel:  [<c022431c>] generic_make_request+0x18e/0x19e
Jul 14 11:09:34 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 11:09:34 dbtools01 kernel:  [<c02243f6>] submit_bio+0xca/0xd2
Jul 14 11:09:34 dbtools01 kernel:  [<c015e7c9>] bio_alloc+0x100/0x168
Jul 14 11:09:34 dbtools01 kernel:  [<c015e180>] submit_bh+0x141/0x166
Jul 14 11:09:34 dbtools01 kernel:  [<c015cc59>] __block_write_full_page+0x1f0/0x2ea
Jul 14 11:09:34 dbtools01 kernel:  [<c0160822>] blkdev_get_block+0x0/0x46
Jul 14 11:09:34 dbtools01 kernel:  [<c015dfc8>] block_write_full_page+0xc5/0xce<3>drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967209
Jul 14 11:09:34 dbtools01 kernel:
Jul 14 11:09:34 dbtools01 kernel:  [<c0160822>] blkdev_get_block+0x0/0x46
Jul 14 11:09:34 dbtools01 kernel:  [<c0177bfa>] mpage_writepages+0x1c2/0x314
Jul 14 11:09:34 dbtools01 kernel:  [<c0160915>] blkdev_writepage+0x0/0xc
Jul 14 11:09:34 dbtools01 kernel:  [<c011dcf7>] find_busiest_group+0xdd/0x2ba
Jul 14 11:09:34 dbtools01 kernel:  [<c0144a4d>] do_writepages+0x19/0x27
Jul 14 11:09:34 dbtools01 kernel:  [<c0176564>] __sync_single_inode+0x5f/0x1c1
Jul 14 11:09:34 dbtools01 kernel:  [<c01768f8>] sync_sb_inodes+0x1a7/0x274
Jul 14 11:09:34 dbtools01 kernel:  [<c0145160>] pdflush+0x0/0x1e
Jul 14 11:09:34 dbtools01 kernel:  [<c0176a56>] writeback_inodes+0x91/0xde
Jul 14 11:09:34 dbtools01 kernel:  [<c01448e3>] wb_kupdate+0x7b/0xde
Jul 14 11:09:34 dbtools01 kernel:  [<c01450cc>] __pdflush+0xec/0x180
Jul 14 11:09:34 dbtools01 kernel:  [<c014517a>] pdflush+0x1a/0x1e
Jul 14 11:09:34 dbtools01 kernel:  [<c0144868>] wb_kupdate+0x0/0xde
Jul 14 11:09:35 dbtools01 kernel:  [<c0145160>] pdflush+0x0/0x1e
Jul 14 11:09:35 dbtools01 kernel:  [<c0133ead>] kthread+0x73/0x9b
Jul 14 11:09:35 dbtools01 kernel:  [<c0133e3a>] kthread+0x0/0x9b
Jul 14 11:09:35 dbtools01 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb

I stopped/started drbd on the secondary, and the primary logged:

Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967202
Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967201
Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967200
Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_asender [1985]: cstate SyncSource --> NetworkFailure
Jul 14 11:10:02 dbtools01 kernel: drbd0: asender terminated
Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate NetworkFailure --> BrokenPipe
Jul 14 11:10:02 dbtools01 kernel: drbd0: short read expecting header on sock: r=-512
Jul 14 11:10:02 dbtools01 kernel: drbd0: worker terminated
Jul 14 11:10:02 dbtools01 kernel: drbd0: _drbd_send_page: size=4096 len=2064 sent=-4
Jul 14 11:10:02 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_receiver.c:1904
Jul 14 11:10:02 dbtools01 kernel: kjournald starting.  Commit interval 5 seconds
Jul 14 11:10:02 dbtools01 kernel: EXT3 FS on drbd0, internal journal
Jul 14 11:10:02 dbtools01 kernel: EXT3-fs: recovery complete.
Jul 14 11:10:02 dbtools01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate BrokenPipe --> Unconnected
Jul 14 11:10:02 dbtools01 kernel: drbd0: Connection lost.
Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate Unconnected --> WFConnection
Jul 14 11:10:03 dbtools01 postgresql: Starting postgresql service:  succeeded
Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFConnection --> WFReportParams
Jul 14 11:10:21 dbtools01 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 14 11:10:21 dbtools01 kernel: drbd0: Connection established.
Jul 14 11:10:21 dbtools01 kernel: drbd0: I am(P): 1:00000004:00000001:0000001f:00000002:10
Jul 14 11:10:21 dbtools01 kernel: drbd0: Peer(S): 0:00000004:00000001:0000001d:00000002:00
Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFReportParams --> WFBitMapS
Jul 14 11:10:21 dbtools01 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFBitMapS --> SyncSource
Jul 14 11:10:21 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 719188 KB [179797 bits set]).

But it still got stuck as 'stalled'. I left it for an hour or so, did another stop/start, and that didn't help.
I then stopped it on the secondary, and shut down drbd on the primary too, for a cold start:

Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate WFConnection --> Unconnected
Jul 14 12:37:10 dbtools01 kernel: drbd0: worker terminated
Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_receiver.c:1904
Jul 14 12:37:10 dbtools01 kernel: drbd0: drbd0_receiver [3390]: cstate Unconnected --> StandAlone
Jul 14 12:37:10 dbtools01 kernel: drbd0: Connection lost.
Jul 14 12:37:10 dbtools01 kernel: drbd0: Discarding network configuration.
Jul 14 12:37:10 dbtools01 kernel: drbd0: drbd0_receiver [3390]: cstate StandAlone --> StandAlone
Jul 14 12:37:10 dbtools01 kernel: drbd0: receiver terminated
Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate StandAlone --> StandAlone
Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate StandAlone --> Unconfigured
Jul 14 12:37:10 dbtools01 kernel: drbd0: worker terminated
Jul 14 12:37:10 dbtools01 kernel: drbd0: ee_in_use:2048 ee_vacant:0 epoch_size:0
Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( list_empty(&mdev->net_ee) ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1501
Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use==0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1617
Jul 14 12:37:10 dbtools01 kernel: drbd0: 2048 EEs in net list found!
Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_vacant == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1644
Jul 14 12:37:10 dbtools01 kernel: drbd: module cleanup done.

and then started DRBD again on the primary:

Jul 14 12:37:14 dbtools01 kernel: drbd: initialised. Version: 0.7.20 (api:79/proto:74)
Jul 14 12:37:14 dbtools01 kernel: drbd: SVN Revision: 2260 build by root at build03.syd.optusnet.com.au, 2006-07-14 10:39:06
Jul 14 12:37:14 dbtools01 kernel: drbd: registered as block device major 147
Jul 14 12:37:20 dbtools01 kernel: drbd0: resync bitmap: bits=5210112 words=162816
Jul 14 12:37:20 dbtools01 kernel: drbd0: size = 19 GB (20840448 KB)
Jul 14 12:37:20 dbtools01 kernel: klogd 1.4.1, ---------- state change ----------
Jul 14 12:37:20 dbtools01 kernel: drbd0: 694 MB marked out-of-sync by on disk bit-map.
Jul 14 12:37:20 dbtools01 kernel: drbd0: Found 6 transactions (266 active extents) in activity log.
Jul 14 12:37:20 dbtools01 kernel: drbd0: drbdsetup [3866]: cstate Unconfigured --> StandAlone
Jul 14 12:37:20 dbtools01 kernel: drbd0: drbdsetup [3879]: cstate StandAlone --> Unconnected
Jul 14 12:37:20 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate Unconnected --> WFConnection

and then brought up heartbeat on the primary, and drbd on the
secondary, and it all synced fine:

Jul 14 12:39:13 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate WFReportParams --> WFBitMapS
Jul 14 12:39:13 dbtools01 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Jul 14 12:39:13 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate WFBitMapS --> SyncSource
Jul 14 12:39:13 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 711360 KB [177840 bits set]).
Jul 14 12:39:36 dbtools01 kernel: drbd0: Resync done (total 23 sec; paused 0 sec; 30928 K/sec)
Jul 14 12:39:36 dbtools01 kernel: drbd0: drbd0_worker [3867]: cstate SyncSource --> Connected

I've seen this before a few times when doing upgrades/reboots, on
multiple different DRBD pairs. However, that was running older versions of
DRBD, and from the release notes for the last few versions I thought that
this was the situation mentioned there, and so would be fixed. However,
I can't get a reproducable test case. Does the above give enough information
to work out what the problem is? This is our test cluster, so I'm happy to
try patches/etc. I'm also going to see if I can reproduce this.

Thanks,


Bradley



More information about the drbd-user mailing list