Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I've run into a problem with DRBD where it gets stuck. We have two machines, doing the DRBD replication via a crossover cable on eth1. Both machines started off running drbd 0.7.20, heartbeat 1.2.3, and the RHEL4 kernel 2.6.9-34.0.1.ELsmp. The kernel is patched to disable NAPI on the e1000 driver. Both are single CPU boxes, but with hyperthreading enabled. DRBD has ext3 running on top of it, being used for a postgres database. Plan was to upgrade it to 2.6.9-34.0.2.ELsmp. I rebuilt DRBD for the new kernel, and rebooted the standby box. That came up fine: Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFConnection --> WFReportParams Jul 14 10:52:13 dbtools02 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jul 14 10:52:13 dbtools02 kernel: drbd0: Connection established. Jul 14 10:52:13 dbtools02 kernel: drbd0: I am(S): 1:00000004:00000001:0000001c:00000002:00 Jul 14 10:52:13 dbtools02 kernel: drbd0: Peer(P): 1:00000004:00000001:0000001d:00000002:10 Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFReportParams --> WFBitMapT Jul 14 10:52:13 dbtools02 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Jul 14 10:52:13 dbtools02 rc: Starting drbd: succeeded Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_receiver [1864]: cstate WFBitMapT --> SyncTarget Jul 14 10:52:13 dbtools02 kernel: drbd0: Resync started as SyncTarget (need to sync 316 KB [79 bits set]). Jul 14 10:52:13 dbtools02 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 316 K/sec) Jul 14 10:52:13 dbtools02 kernel: drbd0: drbd0_worker [1851]: cstate SyncTarget --> Connected I then went to do hb_standby on the current primary. This started to shutdown postgresql, but then stopped - postgres and kjournald were stuck in D state. I tried to do a reboot, but that just caused reboot and shutdown processes to end up in D state too. sysreq-T (via /proc/sysrq-trigger) has: Jul 14 10:59:01 dbtools01 kernel: kjournald D F7E17980 1812 9519 1 9704 9185 (L-TLB) Jul 14 10:59:01 dbtools01 kernel: ee3b6d90 00000046 00000038 f7e17980 c1bb7080 c1bb8930 00000019 c1816de0 Jul 14 10:59:01 dbtools01 kernel: c1bb8930 00000000 c180f740 c180ede0 00000000 0000091c c0ff668b 000f22e3 Jul 14 10:59:01 dbtools01 kernel: c1bb8930 f5a6cbb0 f5a6cd1c 00000002 c1bb7080 c180ede0 c0442080 f724a86c Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c02d0d52>] io_schedule+0x26/0x30 Jul 14 10:59:01 dbtools01 kernel: [<c015b4dc>] __wait_on_buffer+0x6c/0x83 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<f88706a0>] journal_commit_transaction+0x485/0xfc1 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<f8872e8d>] kjournald+0xc7/0x219 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c011d549>] schedule_tail+0x31/0xa7 Jul 14 10:59:01 dbtools01 kernel: [<f8872dc0>] commit_timeout+0x0/0x5 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<f8872dc6>] kjournald+0x0/0x219 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Jul 14 10:59:01 dbtools01 kernel: drbd0_receive S F13C1D80 2476 9185 1 9519 26155 (L-TLB) Jul 14 10:59:01 dbtools01 kernel: eaef0d8c 00000046 f13c1d80 f13c1d80 f7e110b0 c027a3e5 cf112800 00000000 Jul 14 10:59:01 dbtools01 kernel: f7e14090 f7e14090 f7e14090 c1816de0 00000001 000004b3 2153565c 000f22dc Jul 14 10:59:01 dbtools01 kernel: f7e110b0 f61e8630 f61e879c 00000286 00000001 f59e6500 7fffffff 00000000 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c027a3e5>] pskb_expand_head+0xce/0x114 Jul 14 10:59:01 dbtools01 kernel: [<c02d0ddf>] schedule_timeout+0x53/0xee Jul 14 10:59:01 dbtools01 kernel: [<c02a23e5>] tcp_clean_rtx_queue+0x1e2/0x35f Jul 14 10:59:01 dbtools01 kernel: [<c027995b>] release_sock+0xf/0x4f Jul 14 10:59:01 dbtools01 kernel: [<c0279490>] sk_wait_data+0x88/0xbd Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c02acb8a>] tcp_v4_do_rcv+0x1b/0xe1 Jul 14 10:59:01 dbtools01 kernel: [<c029df09>] tcp_recvmsg+0x379/0x681 Jul 14 10:59:01 dbtools01 kernel: [<c0279a88>] sock_common_recvmsg+0x30/0x46 Jul 14 10:59:01 dbtools01 kernel: [<c0276754>] sock_recvmsg+0xef/0x10c Jul 14 10:59:01 dbtools01 kernel: [<c01436d0>] buffered_rmqueue+0x17d/0x1a5 Jul 14 10:59:01 dbtools01 kernel: [<f885e89c>] __split_bio+0xfd/0x103 [dm_mod] Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c022431c>] generic_make_request+0x18e/0x19e Jul 14 10:59:01 dbtools01 kernel: [<f8aed2af>] drbd_recv+0x8f/0x14f [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8aed888>] drbd_recv_header+0x12/0xc7 [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8aeebe5>] receive_DataRequest+0x0/0x6a0 [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af0843>] drbdd+0x14/0x10d [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af124d>] drbdd_init+0x92/0x15d [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Jul 14 10:59:01 dbtools01 kernel: drbd0_worker S F4338FC0 3080 23649 1 24405 21557 (L-TLB) Jul 14 10:59:01 dbtools01 kernel: f4338f7c 00000046 00000008 f4338fc0 00000000 00000000 00000000 f4338f28 Jul 14 10:59:01 dbtools01 kernel: 00000001 00000000 00000000 c1816de0 00000001 000046ab e49bcc57 000f2307 Jul 14 10:59:01 dbtools01 kernel: f7e110b0 f622c6b0 f622c81c 00000000 00000001 c1bb71f4 c1bb71ec 00000246 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c02cfcc7>] __down_interruptible+0xbc/0x115 Jul 14 10:59:01 dbtools01 kernel: [<c011e71b>] default_wake_function+0x0/0xc Jul 14 10:59:01 dbtools01 kernel: [<c02cfd33>] __down_failed_interruptible+0x7/0xc Jul 14 10:59:01 dbtools01 kernel: [<f8aec455>] .text.lock.drbd_worker+0x37/0x3e [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Jul 14 10:59:01 dbtools01 kernel: drbd0_asender S C02A6134 2760 24405 1 9720 23649 (L-TLB) Jul 14 10:59:01 dbtools01 kernel: f50bad94 00000046 0000ffff c02a6134 f7e110b0 0000ffff c02a6134 00002710 Jul 14 10:59:01 dbtools01 kernel: f620f8e0 f59e6b98 f620f8c0 c1816de0 00000001 000007ae 0435cc3b 000f2339 Jul 14 10:59:01 dbtools01 kernel: f7e110b0 f61a4b30 f61a4c9c 00000000 00000001 fe9bda24 fe9bda24 00000000 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c02a6134>] tcp_cwnd_restart+0x14/0xb7 Jul 14 10:59:01 dbtools01 kernel: [<c02a6134>] tcp_cwnd_restart+0x14/0xb7 Jul 14 10:59:01 dbtools01 kernel: [<c02d0e5f>] schedule_timeout+0xd3/0xee Jul 14 10:59:01 dbtools01 kernel: [<c012a46e>] process_timeout+0x0/0x5 Jul 14 10:59:01 dbtools01 kernel: [<c027995b>] release_sock+0xf/0x4f Jul 14 10:59:01 dbtools01 kernel: [<c0279490>] sk_wait_data+0x88/0xbd Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c02d0ca2>] __cond_resched+0x14/0x39 Jul 14 10:59:01 dbtools01 kernel: [<c029df09>] tcp_recvmsg+0x379/0x681 Jul 14 10:59:01 dbtools01 kernel: [<c0279a88>] sock_common_recvmsg+0x30/0x46 Jul 14 10:59:01 dbtools01 kernel: [<c0276754>] sock_recvmsg+0xef/0x10c Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c011e7a1>] __wake_up+0x29/0x3c Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<f8af63a9>] _drbd_send_cmd+0xae/0x103 [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8aed215>] drbd_recv_short+0x8f/0x9a [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af1a68>] drbd_asender+0xf6/0x3b2 [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5f5b>] drbd_thread_setup+0x9e/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<f8af5ebd>] drbd_thread_setup+0x0/0xcf [drbd] Jul 14 10:59:01 dbtools01 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb with postgres doing: Jul 14 10:59:01 dbtools01 kernel: postmaster S 000000D0 1992 9704 1 24412 9179 9519 (NOTLB) Jul 14 10:59:01 dbtools01 kernel: ea8cbec4 00000082 c15c1400 000000d0 00000000 c0328300 ea8cbf58 000000d0 Jul 14 10:59:01 dbtools01 kernel: 00000000 c01437cd 00000000 c1816de0 00000001 0000383f 2638a722 000f232f Jul 14 10:59:01 dbtools01 kernel: f7e110b0 c1bb8eb0 c1bb901c 00000000 00000001 fe9bf77b fe9bf77b 00000000 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c01437cd>] __alloc_pages+0xd5/0x2f7 Jul 14 10:59:01 dbtools01 kernel: [<c02d0e5f>] schedule_timeout+0xd3/0xee Jul 14 10:59:01 dbtools01 kernel: [<c012a46e>] process_timeout+0x0/0x5 Jul 14 10:59:01 dbtools01 kernel: [<c016abd8>] do_select+0x291/0x2c6 Jul 14 10:59:01 dbtools01 kernel: [<c016a80d>] __pollwait+0x0/0x95 Jul 14 10:59:01 dbtools01 kernel: [<c016af00>] sys_select+0x2e0/0x43a Jul 14 10:59:01 dbtools01 kernel: [<c012caf4>] sigprocmask+0xb0/0xca Jul 14 10:59:01 dbtools01 kernel: [<c02d268f>] syscall_call+0x7/0xb Jul 14 10:59:01 dbtools01 kernel: postmaster D 00000001 2612 9720 1 24430 24405 (NOTLB) Jul 14 10:59:01 dbtools01 kernel: f67aede4 00000082 ee2203f0 00000001 f67aedb8 00000246 00000000 c1bbbc00 Jul 14 10:59:01 dbtools01 kernel: c1bbbc00 f007520c ef5e03c8 c180ede0 00000000 00006d5d 7544c5c6 000f22f7 Jul 14 10:59:01 dbtools01 kernel: c0320a80 f62882b0 f628841c f7e17980 00000002 c180ede0 c0442080 f025a660 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c02d0d52>] io_schedule+0x26/0x30 Jul 14 10:59:01 dbtools01 kernel: [<c015b446>] __lock_buffer+0x6c/0x88 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<f886fd20>] journal_invalidatepage+0x86/0x101 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<f88f0486>] ext3_invalidatepage+0x0/0x2d [ext3] Jul 14 10:59:01 dbtools01 kernel: [<c01487e6>] do_invalidatepage+0x16/0x18 Jul 14 10:59:01 dbtools01 kernel: [<c0148800>] truncate_complete_page+0x18/0x3a Jul 14 10:59:01 dbtools01 kernel: [<c014890b>] truncate_inode_pages+0x7a/0x22d Jul 14 10:59:01 dbtools01 kernel: [<f88f56da>] ext3_rename+0x467/0x473 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<c016e4c0>] dput+0x34/0x1a7 Jul 14 10:59:01 dbtools01 kernel: [<c01689c2>] vfs_rename_other+0x11b/0x125 Jul 14 10:59:01 dbtools01 kernel: [<c0170b9a>] generic_delete_inode+0x50/0x104 Jul 14 10:59:01 dbtools01 kernel: [<c0170dc8>] iput+0x5f/0x61 Jul 14 10:59:01 dbtools01 kernel: [<c016e607>] dput+0x17b/0x1a7 Jul 14 10:59:01 dbtools01 kernel: [<c0168e5b>] sys_rename+0x157/0x1e0 Jul 14 10:59:01 dbtools01 kernel: [<c014fd45>] unmap_vma_list+0xe/0x17 Jul 14 10:59:01 dbtools01 kernel: [<c015007f>] do_munmap+0x129/0x137 Jul 14 10:59:01 dbtools01 kernel: [<c02d268f>] syscall_call+0x7/0xb Jul 14 10:59:01 dbtools01 kernel: [<c02d007b>] schedule+0x2f7/0x8d3 Jul 14 10:59:01 dbtools01 kernel: postmaster D 00000000 2664 24412 9704 24942 (NOTLB) Jul 14 10:59:01 dbtools01 kernel: f4fa6cb4 00000082 f5a87680 00000000 fffcf858 fffcf858 b7f0b000 c014d82e Jul 14 10:59:01 dbtools01 kernel: f56d4df8 00000001 b7f0b000 c180ede0 00000000 002e629c f6037c51 000f22e6 Jul 14 10:59:01 dbtools01 kernel: c0320a80 f5985130 f598529c f7e17980 00000002 c180ede0 c0441e80 ef5e02cc Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<c014d82e>] do_no_page+0x6f/0x2f9 Jul 14 10:59:01 dbtools01 kernel: [<c02d0d52>] io_schedule+0x26/0x30 Jul 14 10:59:01 dbtools01 kernel: [<c015b446>] __lock_buffer+0x6c/0x88 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<c015b384>] bh_wake_function+0x0/0x29 Jul 14 10:59:01 dbtools01 kernel: [<f886e837>] do_get_write_access+0x72/0x532 [jbd] Jul 14 10:59:01 dbtools01 kernel: [<c02d0ca2>] __cond_resched+0x14/0x39 Jul 14 10:59:01 dbtools01 kernel: [<c015c77d>] __getblk+0x2b/0x49 Jul 14 10:59:01 dbtools01 kernel: [<f88f13e2>] ext3_get_inode_loc+0x4f/0x226 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<f886ed13>] journal_get_write_access+0x1c/0x2c [jbd] Jul 14 10:59:01 dbtools01 kernel: [<f88f1ef2>] ext3_reserve_inode_write+0x34/0x81 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<f88f1f6a>] ext3_mark_inode_dirty+0x2b/0x41 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<f886e47c>] journal_start+0x78/0x9e [jbd] Jul 14 10:59:01 dbtools01 kernel: [<f88f1fd3>] ext3_dirty_inode+0x53/0x66 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<c0176380>] __mark_inode_dirty+0x28/0x176 Jul 14 10:59:01 dbtools01 kernel: [<c0126674>] current_fs_time+0x44/0x4c Jul 14 10:59:01 dbtools01 kernel: [<c0170e61>] update_atime+0x6e/0x93 Jul 14 10:59:01 dbtools01 kernel: [<c014066a>] do_generic_mapping_read+0x43d/0x445 Jul 14 10:59:01 dbtools01 kernel: [<c01408da>] __generic_file_aio_read+0x19f/0x1bd Jul 14 10:59:01 dbtools01 kernel: [<c0140672>] file_read_actor+0x0/0xc9 Jul 14 10:59:01 dbtools01 kernel: [<c0140938>] generic_file_aio_read+0x40/0x47 Jul 14 10:59:01 dbtools01 kernel: [<c015a335>] do_sync_read+0x97/0xc9 Jul 14 10:59:01 dbtools01 kernel: [<c014e935>] __vma_link+0x59/0x66 Jul 14 10:59:01 dbtools01 kernel: [<c014e986>] vma_link+0x44/0xbc Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c015a41d>] vfs_read+0xb6/0xe2 Jul 14 10:59:01 dbtools01 kernel: [<c015a630>] sys_read+0x3c/0x62 Jul 14 10:59:01 dbtools01 kernel: [<c02d268f>] syscall_call+0x7/0xb Jul 14 10:59:01 dbtools01 kernel: postmaster D F88F1C03 3088 24942 9704 24412 (NOTLB) Jul 14 10:59:01 dbtools01 kernel: f156ed30 00000082 0000003b f88f1c03 f886ed1c f61a4030 00000019 c1816de0 Jul 14 10:59:01 dbtools01 kernel: f61a4030 00000000 c180f740 c180ede0 00000000 00011d5f 42090eb4 000f2313 Jul 14 10:59:01 dbtools01 kernel: f61a4030 f5a6c0b0 f5a6c21c 00000002 c1bb7080 c180ede0 c180b880 f6353ea8 Jul 14 10:59:01 dbtools01 kernel: Call Trace: Jul 14 10:59:01 dbtools01 kernel: [<f88f1c03>] ext3_do_update_inode+0x2f7/0x31e [ext3] Jul 14 10:59:01 dbtools01 kernel: [<f886ed1c>] journal_get_write_access+0x25/0x2c [jbd] Jul 14 10:59:01 dbtools01 kernel: [<c02d0d52>] io_schedule+0x26/0x30 Jul 14 10:59:01 dbtools01 kernel: [<c013ff69>] __lock_page+0xac/0xbf Jul 14 10:59:01 dbtools01 kernel: [<c013fcf9>] page_wake_function+0x0/0x36 Jul 14 10:59:01 dbtools01 kernel: [<c013fcf9>] page_wake_function+0x0/0x36 Jul 14 10:59:01 dbtools01 kernel: [<c0140065>] find_lock_page+0x6d/0x9d Jul 14 10:59:01 dbtools01 kernel: [<c01417de>] generic_file_buffered_write+0xe5/0x47c Jul 14 10:59:01 dbtools01 kernel: [<f88f1c03>] ext3_do_update_inode+0x2f7/0x31e [ext3] Jul 14 10:59:01 dbtools01 kernel: [<f88f576a>] __ext3_journal_stop+0x19/0x34 [ext3] Jul 14 10:59:01 dbtools01 kernel: [<c017643a>] __mark_inode_dirty+0xe2/0x176 Jul 14 10:59:01 dbtools01 kernel: [<c0126674>] current_fs_time+0x44/0x4c Jul 14 10:59:01 dbtools01 kernel: [<c0141efe>] __generic_file_aio_write_nolock+0x389/0x3b7 Jul 14 10:59:01 dbtools01 kernel: [<c0141f65>] generic_file_aio_write_nolock+0x39/0x7f Jul 14 10:59:01 dbtools01 kernel: [<c014214f>] generic_file_aio_write+0x72/0xc6 Jul 14 10:59:01 dbtools01 kernel: [<f88edd7e>] ext3_file_write+0x19/0x8b [ext3] Jul 14 10:59:01 dbtools01 kernel: [<c015a4e0>] do_sync_write+0x97/0xc9 Jul 14 10:59:01 dbtools01 kernel: [<c016a808>] poll_freewait+0x33/0x38 Jul 14 10:59:01 dbtools01 kernel: [<c0164adc>] pipe_readv+0x292/0x29e Jul 14 10:59:01 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 10:59:01 dbtools01 kernel: [<c015a5c8>] vfs_write+0xb6/0xe2 Jul 14 10:59:01 dbtools01 kernel: [<c015a692>] sys_write+0x3c/0x62 Jul 14 10:59:01 dbtools01 kernel: [<c02d268f>] syscall_call+0x7/0xb There weren't any messages from DRBD on either box, and it was in Secondary/Primary state on the secondary. I then did a reboot -f. When the primary came back up, it started to resync (heartbeat wasn't running on the secondary, so there was no takeover attempt while it was rebooting): Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFConnection --> WFReportParams Jul 14 11:04:33 dbtools01 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jul 14 11:04:33 dbtools01 kernel: drbd0: Connection established. Jul 14 11:04:33 dbtools01 kernel: drbd0: I am(S): 1:00000004:00000001:0000001d:00000002:10 Jul 14 11:04:33 dbtools01 kernel: drbd0: Peer(S): 1:00000004:00000001:0000001d:00000002:01 Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFReportParams --> WFBitMapS Jul 14 11:04:33 dbtools01 drbd: WARN: stdin/stdout is not a TTY; using /dev/console Jul 14 11:04:33 dbtools01 rc: Starting drbd: succeeded Jul 14 11:04:33 dbtools01 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFBitMapS --> SyncSource Jul 14 11:04:33 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 802816 KB [200704 bits set]). Jul 14 11:05:06 dbtools01 kernel: drbd0: Secondary/Secondary --> Primary/Secondary but then showed the transfer as 'stalled' on both ends: version: 0.7.20 (api:79/proto:74) SVN Revision: 2260 build by root at build03.syd.optusnet.com.au, 2006-07-14 10:39:06 0: cs:SyncSource st:Primary/Secondary ld:Consistent ns:98444 nr:0 dw:6644 dr:115025 al:6 bm:214 lo:0 pe:0 ua:0 ap:0 [>...................] sync'ed: 0.6% (710976/710976)K stalled Jul 14 11:05:13 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967295 Jul 14 11:05:16 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967294 Jul 14 11:05:19 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967293 Jul 14 11:05:22 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967292 Jul 14 11:05:25 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967291 Jul 14 11:05:28 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967290 .... Jul 14 11:09:16 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967215 Jul 14 11:09:19 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967214 Jul 14 11:09:22 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967213 Jul 14 11:09:25 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967212 Jul 14 11:09:28 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967211 postgres wasn't starting on the primary yet, either. sysrq-T shows the only D state thing being pdflush: Jul 14 11:09:34 dbtools01 kernel: pdflush D F7D8C680 2752 47 6 49 46 (L-TLB) Jul 14 11:09:34 dbtools01 kernel: f7cf2c58 00000046 f6337d80 f7d8c680 f6337d80 c022a224 f7d8c680 f7ce0b68 Jul 14 11:09:34 dbtools01 kernel: f6337d80 c022a79d f633522c c1816de0 00000001 000336a4 94d89130 00000024 Jul 14 11:09:34 dbtools01 kernel: f7e110b0 f7cc4130 f7cc429c c1bd965c 00000001 c1bd9284 00000246 c1bd928c Jul 14 11:09:34 dbtools01 kernel: Call Trace: Jul 14 11:09:34 dbtools01 kernel: [<c022a224>] cfq_add_crq_rb+0x3b/0x4d Jul 14 11:09:34 dbtools01 kernel: [<c022a79d>] cfq_enqueue+0x33/0x64 Jul 14 11:09:34 dbtools01 kernel: [<c02cfbb1>] __down+0x81/0xdb Jul 14 11:09:34 dbtools01 kernel: [<c011e71b>] default_wake_function+0x0/0xc Jul 14 11:09:34 dbtools01 kernel: [<c02cfd28>] __down_failed+0x8/0xc Jul 14 11:09:34 dbtools01 kernel: [<f8af934c>] .text.lock.drbd_main+0x41/0x18c [drbd] Jul 14 11:09:34 dbtools01 kernel: [<c015e990>] bio_clone+0x84/0x9c Jul 14 11:09:34 dbtools01 kernel: [<f8af263c>] drbd_make_request_common+0x499/0x744 [drbd] Jul 14 11:09:34 dbtools01 kernel: [<f885e89c>] __split_bio+0xfd/0x103 [dm_mod] Jul 14 11:09:34 dbtools01 kernel: [<f8af2aa7>] drbd_make_request_26+0x1c0/0x1c9 [drbd] Jul 14 11:09:34 dbtools01 kernel: [<c022431c>] generic_make_request+0x18e/0x19e Jul 14 11:09:34 dbtools01 kernel: [<c0120291>] autoremove_wake_function+0x0/0x2d Jul 14 11:09:34 dbtools01 kernel: [<c02243f6>] submit_bio+0xca/0xd2 Jul 14 11:09:34 dbtools01 kernel: [<c015e7c9>] bio_alloc+0x100/0x168 Jul 14 11:09:34 dbtools01 kernel: [<c015e180>] submit_bh+0x141/0x166 Jul 14 11:09:34 dbtools01 kernel: [<c015cc59>] __block_write_full_page+0x1f0/0x2ea Jul 14 11:09:34 dbtools01 kernel: [<c0160822>] blkdev_get_block+0x0/0x46 Jul 14 11:09:34 dbtools01 kernel: [<c015dfc8>] block_write_full_page+0xc5/0xce<3>drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967209 Jul 14 11:09:34 dbtools01 kernel: Jul 14 11:09:34 dbtools01 kernel: [<c0160822>] blkdev_get_block+0x0/0x46 Jul 14 11:09:34 dbtools01 kernel: [<c0177bfa>] mpage_writepages+0x1c2/0x314 Jul 14 11:09:34 dbtools01 kernel: [<c0160915>] blkdev_writepage+0x0/0xc Jul 14 11:09:34 dbtools01 kernel: [<c011dcf7>] find_busiest_group+0xdd/0x2ba Jul 14 11:09:34 dbtools01 kernel: [<c0144a4d>] do_writepages+0x19/0x27 Jul 14 11:09:34 dbtools01 kernel: [<c0176564>] __sync_single_inode+0x5f/0x1c1 Jul 14 11:09:34 dbtools01 kernel: [<c01768f8>] sync_sb_inodes+0x1a7/0x274 Jul 14 11:09:34 dbtools01 kernel: [<c0145160>] pdflush+0x0/0x1e Jul 14 11:09:34 dbtools01 kernel: [<c0176a56>] writeback_inodes+0x91/0xde Jul 14 11:09:34 dbtools01 kernel: [<c01448e3>] wb_kupdate+0x7b/0xde Jul 14 11:09:34 dbtools01 kernel: [<c01450cc>] __pdflush+0xec/0x180 Jul 14 11:09:34 dbtools01 kernel: [<c014517a>] pdflush+0x1a/0x1e Jul 14 11:09:34 dbtools01 kernel: [<c0144868>] wb_kupdate+0x0/0xde Jul 14 11:09:35 dbtools01 kernel: [<c0145160>] pdflush+0x0/0x1e Jul 14 11:09:35 dbtools01 kernel: [<c0133ead>] kthread+0x73/0x9b Jul 14 11:09:35 dbtools01 kernel: [<c0133e3a>] kthread+0x0/0x9b Jul 14 11:09:35 dbtools01 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb I stopped/started drbd on the secondary, and the primary logged: Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967202 Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967201 Jul 14 11:10:02 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967200 Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_asender [1985]: cstate SyncSource --> NetworkFailure Jul 14 11:10:02 dbtools01 kernel: drbd0: asender terminated Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate NetworkFailure --> BrokenPipe Jul 14 11:10:02 dbtools01 kernel: drbd0: short read expecting header on sock: r=-512 Jul 14 11:10:02 dbtools01 kernel: drbd0: worker terminated Jul 14 11:10:02 dbtools01 kernel: drbd0: _drbd_send_page: size=4096 len=2064 sent=-4 Jul 14 11:10:02 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_receiver.c:1904 Jul 14 11:10:02 dbtools01 kernel: kjournald starting. Commit interval 5 seconds Jul 14 11:10:02 dbtools01 kernel: EXT3 FS on drbd0, internal journal Jul 14 11:10:02 dbtools01 kernel: EXT3-fs: recovery complete. Jul 14 11:10:02 dbtools01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate BrokenPipe --> Unconnected Jul 14 11:10:02 dbtools01 kernel: drbd0: Connection lost. Jul 14 11:10:02 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate Unconnected --> WFConnection Jul 14 11:10:03 dbtools01 postgresql: Starting postgresql service: succeeded Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFConnection --> WFReportParams Jul 14 11:10:21 dbtools01 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jul 14 11:10:21 dbtools01 kernel: drbd0: Connection established. Jul 14 11:10:21 dbtools01 kernel: drbd0: I am(P): 1:00000004:00000001:0000001f:00000002:10 Jul 14 11:10:21 dbtools01 kernel: drbd0: Peer(S): 0:00000004:00000001:0000001d:00000002:00 Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFReportParams --> WFBitMapS Jul 14 11:10:21 dbtools01 kernel: drbd0: Primary/Unknown --> Primary/Secondary Jul 14 11:10:21 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFBitMapS --> SyncSource Jul 14 11:10:21 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 719188 KB [179797 bits set]). But it still got stuck as 'stalled'. I left it for an hour or so, did another stop/start, and that didn't help. I then stopped it on the secondary, and shut down drbd on the primary too, for a cold start: Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate WFConnection --> Unconnected Jul 14 12:37:10 dbtools01 kernel: drbd0: worker terminated Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_receiver.c:1904 Jul 14 12:37:10 dbtools01 kernel: drbd0: drbd0_receiver [3390]: cstate Unconnected --> StandAlone Jul 14 12:37:10 dbtools01 kernel: drbd0: Connection lost. Jul 14 12:37:10 dbtools01 kernel: drbd0: Discarding network configuration. Jul 14 12:37:10 dbtools01 kernel: drbd0: drbd0_receiver [3390]: cstate StandAlone --> StandAlone Jul 14 12:37:10 dbtools01 kernel: drbd0: receiver terminated Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate StandAlone --> StandAlone Jul 14 12:37:10 dbtools01 kernel: drbd0: drbdsetup [3816]: cstate StandAlone --> Unconfigured Jul 14 12:37:10 dbtools01 kernel: drbd0: worker terminated Jul 14 12:37:10 dbtools01 kernel: drbd0: ee_in_use:2048 ee_vacant:0 epoch_size:0 Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( list_empty(&mdev->net_ee) ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1501 Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_in_use==0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1617 Jul 14 12:37:10 dbtools01 kernel: drbd0: 2048 EEs in net list found! Jul 14 12:37:10 dbtools01 kernel: drbd0: ASSERT( mdev->ee_vacant == 0 ) in /usr/src/redhat/BUILD/drbd-0.7.20/drbd/drbd_main.c:1644 Jul 14 12:37:10 dbtools01 kernel: drbd: module cleanup done. and then started DRBD again on the primary: Jul 14 12:37:14 dbtools01 kernel: drbd: initialised. Version: 0.7.20 (api:79/proto:74) Jul 14 12:37:14 dbtools01 kernel: drbd: SVN Revision: 2260 build by root at build03.syd.optusnet.com.au, 2006-07-14 10:39:06 Jul 14 12:37:14 dbtools01 kernel: drbd: registered as block device major 147 Jul 14 12:37:20 dbtools01 kernel: drbd0: resync bitmap: bits=5210112 words=162816 Jul 14 12:37:20 dbtools01 kernel: drbd0: size = 19 GB (20840448 KB) Jul 14 12:37:20 dbtools01 kernel: klogd 1.4.1, ---------- state change ---------- Jul 14 12:37:20 dbtools01 kernel: drbd0: 694 MB marked out-of-sync by on disk bit-map. Jul 14 12:37:20 dbtools01 kernel: drbd0: Found 6 transactions (266 active extents) in activity log. Jul 14 12:37:20 dbtools01 kernel: drbd0: drbdsetup [3866]: cstate Unconfigured --> StandAlone Jul 14 12:37:20 dbtools01 kernel: drbd0: drbdsetup [3879]: cstate StandAlone --> Unconnected Jul 14 12:37:20 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate Unconnected --> WFConnection and then brought up heartbeat on the primary, and drbd on the secondary, and it all synced fine: Jul 14 12:39:13 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate WFReportParams --> WFBitMapS Jul 14 12:39:13 dbtools01 kernel: drbd0: Primary/Unknown --> Primary/Secondary Jul 14 12:39:13 dbtools01 kernel: drbd0: drbd0_receiver [3880]: cstate WFBitMapS --> SyncSource Jul 14 12:39:13 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 711360 KB [177840 bits set]). Jul 14 12:39:36 dbtools01 kernel: drbd0: Resync done (total 23 sec; paused 0 sec; 30928 K/sec) Jul 14 12:39:36 dbtools01 kernel: drbd0: drbd0_worker [3867]: cstate SyncSource --> Connected I've seen this before a few times when doing upgrades/reboots, on multiple different DRBD pairs. However, that was running older versions of DRBD, and from the release notes for the last few versions I thought that this was the situation mentioned there, and so would be fixed. However, I can't get a reproducable test case. Does the above give enough information to work out what the problem is? This is our test cluster, so I'm happy to try patches/etc. I'm also going to see if I can reproduce this. Thanks, Bradley