Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Fortunately the data volume was only mounted but not in use. I found a similar list post on http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html but it had no replies on what could cause this. I've been thinking the DRBD traffic should be on a separate network but have not set this up yet. Right now the DRBD traffic goes over the same vNetwork that other traffic goes over, including multicast VIP traffic form both LVS and pacemaker clusters. In words the SyncSource node started using a critical load average of resources and became unresponsive. This is a VM setup split over different physical ESX hosts but even the local console was dead. So a forced reset was in order. The cluster services came up fine, corosync+pacemaker+o2cb+ocfs2_dlm. The cluster is Debian Squeeze with corosync, pacemaker, openais and cman from backports. Only corosync and pacemaker are services actually used. Other packages are only installed for access to things like fencing and resource agents. Drbd 8.3.7 is used from Debian stable repository. The drbd config is mostly stock, here is the reource definition. resource shared0 { meta-disk internal; device /dev/drbd1; syncer { verify-alg sha1; } net { allow-two-primaries; } on appserver01 { disk /dev/mapper/shared0_appserver01-lv0; address 10.221.182.31:7789; } on appserver02 { disk /dev/mapper/shared0_appserver02-lv0; address 10.221.182.32:7789; } } The logs on the SyncSource node show the following happening at the time of the failure. Dec 2 02:09:56 appserver01 kernel: [123911.353113] block drbd1: peer( Primary -> Unknown ) conn( SyncSource -> NetworkFailure ) Dec 2 02:09:56 appserver01 kernel: [123911.353123] block drbd1: asender terminated Dec 2 02:09:56 appserver01 kernel: [123911.353126] block drbd1: Terminating drbd1_asender Dec 2 02:09:56 appserver01 kernel: [123911.353967] block drbd1: Connection closed Dec 2 02:09:56 appserver01 kernel: [123911.353974] block drbd1: conn( NetworkFailure -> Unconnected ) Dec 2 02:09:56 appserver01 kernel: [123911.353977] block drbd1: receiver terminated Dec 2 02:09:56 appserver01 kernel: [123911.353978] block drbd1: Restarting drbd1_receiver Dec 2 02:09:56 appserver01 kernel: [123911.353980] block drbd1: receiver (re)started Dec 2 02:09:56 appserver01 kernel: [123911.353983] block drbd1: conn( Unconnected -> WFConnection ) Dec 2 02:13:06 appserver01 kernel: [124101.093326] ocfs2rec D ffff88017e7fa350 0 26221 2 0x00000000 Dec 2 02:13:06 appserver01 kernel: [124101.093330] ffff88017e7fa350 0000000000000046 ffff88018dad4000 0000000000000010 Dec 2 02:13:06 appserver01 kernel: [124101.093333] 0000000000000616 ffffea000455c168 000000000000f9e0 ffff88018dad5fd8 Dec 2 02:13:06 appserver01 kernel: [124101.093335] 0000000000015780 0000000000015780 ffff88017e266350 ffff88017e266648 Dec 2 02:13:06 appserver01 kernel: [124101.093338] Call Trace: Dec 2 02:13:06 appserver01 kernel: [124101.093346] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8 Dec 2 02:13:06 appserver01 kernel: [124101.093348] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b Dec 2 02:13:06 appserver01 kernel: [124101.093353] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30 Dec 2 02:13:06 appserver01 kernel: [124101.093359] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user] Dec 2 02:13:06 appserver01 kernel: [124101.093363] [<ffffffff810b885b>] ? zone_watermark_ok+0x20/0xb1 Dec 2 02:13:06 appserver01 kernel: [124101.093365] [<ffffffff812fc665>] ? down_read+0x17/0x19 Dec 2 02:13:06 appserver01 kernel: [124101.093371] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm] Dec 2 02:13:06 appserver01 kernel: [124101.093374] [<ffffffff810c79c0>] ? zone_statistics+0x3c/0x5d Dec 2 02:13:06 appserver01 kernel: [124101.093377] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user] Dec 2 02:13:06 appserver01 kernel: [124101.093380] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] Dec 2 02:13:06 appserver01 kernel: [124101.093382] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] Dec 2 02:13:06 appserver01 kernel: [124101.093391] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093395] [<ffffffff8100f657>] ? __switch_to+0x140/0x297 Dec 2 02:13:06 appserver01 kernel: [124101.093402] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093408] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093415] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093421] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093428] [<ffffffffa032a07f>] ? __ocfs2_recovery_thread+0xe1/0x122b [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093430] [<ffffffff812fba90>] ? thread_return+0x79/0xe0 Dec 2 02:13:06 appserver01 kernel: [124101.093433] [<ffffffff8103a403>] ? activate_task+0x22/0x28 Dec 2 02:13:06 appserver01 kernel: [124101.093436] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b Dec 2 02:13:06 appserver01 kernel: [124101.093443] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2] Dec 2 02:13:06 appserver01 kernel: [124101.093446] [<ffffffff81064d79>] ? kthread+0x79/0x81 Dec 2 02:13:06 appserver01 kernel: [124101.093449] [<ffffffff81011baa>] ? child_rip+0xa/0x20 Dec 2 02:13:06 appserver01 kernel: [124101.093451] [<ffffffff81064d00>] ? kthread+0x0/0x81 Dec 2 02:13:06 appserver01 kernel: [124101.093453] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 Then a few moments passed. Dec 2 02:13:32 appserver01 kernel: [124127.071151] block drbd1: Handshake successful: Agreed network protocol version 91 Dec 2 02:13:32 appserver01 kernel: [124127.071157] block drbd1: conn( WFConnection -> WFReportParams ) Dec 2 02:13:32 appserver01 kernel: [124127.076732] block drbd1: Starting asender thread (from drbd1_receiver [7526]) Dec 2 02:13:32 appserver01 kernel: [124127.078447] block drbd1: data-integrity-alg: <not-used> Dec 2 02:13:32 appserver01 kernel: [124127.078456] block drbd1: drbd_sync_handshake: Dec 2 02:13:32 appserver01 kernel: [124127.078459] block drbd1: self 7843E95E721AF0ED:54BC6F3AD7F42585:52FF69A8720BCEAC:BA309D9B7FCA3C07 bits:115301551 flags:0 Dec 2 02:13:32 appserver01 kernel: [124127.078461] block drbd1: peer 54BC6F3AD7F42584:0000000000000000:0000000000000000:0000000000000000 bits:115314775 flags:2 Dec 2 02:13:32 appserver01 kernel: [124127.078464] block drbd1: uuid_compare()=1 by rule 70 Dec 2 02:13:32 appserver01 kernel: [124127.078465] block drbd1: Becoming sync source due to disk states. Dec 2 02:13:32 appserver01 kernel: [124127.078469] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Dec 2 02:13:39 appserver01 kernel: [124134.091066] block drbd1: conn( WFBitMapS -> SyncSource ) Dec 2 02:13:39 appserver01 kernel: [124134.091078] block drbd1: Began resync as SyncSource (will sync 461259100 KB [115314775 bits set]). And after yet some more moments passing it started to repeatedly post call traces. Here is just one cycle of these traces. At this point the load was critical and I must assume the server was unresponsive because the status of the alarms didn't change until manual intervention. It kept posting call traces for 4 minutes and then I must assume DRBD died because it was quiet until reboot. Dec 2 02:15:06 appserver01 kernel: [124220.996240] ocfs2rec D ffff88017e7fa350 0 26221 2 0x00000000 Dec 2 02:15:06 appserver01 kernel: [124220.996244] ffff88017e7fa350 0000000000000046 ffff88018dad4000 0000000000000010 Dec 2 02:15:06 appserver01 kernel: [124220.996247] 0000000000000616 ffffea000455c168 000000000000f9e0 ffff88018dad5fd8 Dec 2 02:15:06 appserver01 kernel: [124220.996250] 0000000000015780 0000000000015780 ffff88017e266350 ffff88017e266648 Dec 2 02:15:06 appserver01 kernel: [124220.996252] Call Trace: Dec 2 02:15:06 appserver01 kernel: [124220.996260] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8 Dec 2 02:15:06 appserver01 kernel: [124220.996262] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b Dec 2 02:15:06 appserver01 kernel: [124220.996267] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30 Dec 2 02:15:06 appserver01 kernel: [124220.996273] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user] Dec 2 02:15:06 appserver01 kernel: [124220.996277] [<ffffffff810b885b>] ? zone_watermark_ok+0x20/0xb1 Dec 2 02:15:06 appserver01 kernel: [124220.996279] [<ffffffff812fc665>] ? down_read+0x17/0x19 Dec 2 02:15:06 appserver01 kernel: [124220.996285] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm] Dec 2 02:15:06 appserver01 kernel: [124220.996289] [<ffffffff810c79c0>] ? zone_statistics+0x3c/0x5d Dec 2 02:15:06 appserver01 kernel: [124220.996291] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user] Dec 2 02:15:06 appserver01 kernel: [124220.996294] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] Dec 2 02:15:06 appserver01 kernel: [124220.996297] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] Dec 2 02:15:06 appserver01 kernel: [124220.996305] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996310] [<ffffffff8100f657>] ? __switch_to+0x140/0x297 Dec 2 02:15:06 appserver01 kernel: [124220.996317] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996323] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996330] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996337] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996343] [<ffffffffa032a07f>] ? __ocfs2_recovery_thread+0xe1/0x122b [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996346] [<ffffffff812fba90>] ? thread_return+0x79/0xe0 Dec 2 02:15:06 appserver01 kernel: [124220.996349] [<ffffffff8103a403>] ? activate_task+0x22/0x28 Dec 2 02:15:06 appserver01 kernel: [124220.996352] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b Dec 2 02:15:06 appserver01 kernel: [124220.996359] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996362] [<ffffffff81064d79>] ? kthread+0x79/0x81 Dec 2 02:15:06 appserver01 kernel: [124220.996364] [<ffffffff81011baa>] ? child_rip+0xa/0x20 Dec 2 02:15:06 appserver01 kernel: [124220.996366] [<ffffffff81064d00>] ? kthread+0x0/0x81 Dec 2 02:15:06 appserver01 kernel: [124220.996368] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 Dec 2 02:15:06 appserver01 kernel: [124220.996556] ls D ffff8801bb5a2a60 0 26318 26317 0x00000000 Dec 2 02:15:06 appserver01 kernel: [124220.996559] ffff8801bb5a2a60 0000000000000082 ffff8801bb7734c8 ffffffff81103ab9 Dec 2 02:15:06 appserver01 kernel: [124220.996561] ffff88016843dd58 ffff88016843ddf8 000000000000f9e0 ffff88016843dfd8 Dec 2 02:15:06 appserver01 kernel: [124220.996563] 0000000000015780 0000000000015780 ffff8801bcf1a350 ffff8801bcf1a648 Dec 2 02:15:06 appserver01 kernel: [124220.996566] Call Trace: Dec 2 02:15:06 appserver01 kernel: [124220.996570] [<ffffffff81103ab9>] ? mntput_no_expire+0x23/0xee Dec 2 02:15:06 appserver01 kernel: [124220.996573] [<ffffffff810f75af>] ? __link_path_walk+0x6f0/0x6f5 Dec 2 02:15:06 appserver01 kernel: [124220.996580] [<ffffffffa03296af>] ? ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996582] [<ffffffff81065046>] ? autoremove_wake_function+0x0/0x2e Dec 2 02:15:06 appserver01 kernel: [124220.996589] [<ffffffffa0319923>] ? ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996596] [<ffffffffa0324f2d>] ? ocfs2_inode_revalidate+0x145/0x221 [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996603] [<ffffffffa03208d9>] ? ocfs2_getattr+0x79/0x16a [ocfs2] Dec 2 02:15:06 appserver01 kernel: [124220.996606] [<ffffffff810f2591>] ? vfs_fstatat+0x43/0x57 Dec 2 02:15:06 appserver01 kernel: [124220.996609] [<ffffffff810f25fb>] ? sys_newlstat+0x11/0x30 Dec 2 02:15:06 appserver01 kernel: [124220.996612] [<ffffffff812ff306>] ? do_page_fault+0x2e0/0x2fc Dec 2 02:15:06 appserver01 kernel: [124220.996614] [<ffffffff812fd1a5>] ? page_fault+0x25/0x30 Dec 2 02:15:06 appserver01 kernel: [124220.996616] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Dec 2 02:17:06 appserver01 kernel: [124340.899149] events/0 D ffff88017e7faa60 0 6 2 0x00000000 Dec 2 02:17:06 appserver01 kernel: [124340.899153] ffff88017e7faa60 0000000000000046 ffff880006e157e8 ffff8801bf09e388 Dec 2 02:17:06 appserver01 kernel: [124340.899157] ffff8801bc88f1b8 ffff8801bc88f1a8 000000000000f9e0 ffff8801bf0b3fd8 Dec 2 02:17:06 appserver01 kernel: [124340.899160] 0000000000015780 0000000000015780 ffff8801bf09e350 ffff8801bf09e648 Dec 2 02:17:06 appserver01 kernel: [124340.899162] Call Trace: Dec 2 02:17:06 appserver01 kernel: [124340.899169] [<ffffffff812fba90>] ? thread_return+0x79/0xe0 Dec 2 02:17:06 appserver01 kernel: [124340.899172] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8 Dec 2 02:17:06 appserver01 kernel: [124340.899175] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b Dec 2 02:17:06 appserver01 kernel: [124340.899179] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30 Dec 2 02:17:06 appserver01 kernel: [124340.899185] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user] Dec 2 02:17:06 appserver01 kernel: [124340.899188] [<ffffffff812fc665>] ? down_read+0x17/0x19 Dec 2 02:17:06 appserver01 kernel: [124340.899193] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm] Dec 2 02:17:06 appserver01 kernel: [124340.899198] [<ffffffff810168c1>] ? sched_clock+0x5/0x8 Dec 2 02:17:06 appserver01 kernel: [124340.899202] [<ffffffff81049412>] ? update_rq_clock+0xf/0x28 Dec 2 02:17:06 appserver01 kernel: [124340.899205] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b Dec 2 02:17:06 appserver01 kernel: [124340.899209] [<ffffffff810fd0ce>] ? pollwake+0x53/0x59 Dec 2 02:17:06 appserver01 kernel: [124340.899211] [<ffffffff8104a461>] ? default_wake_function+0x0/0x9 Dec 2 02:17:06 appserver01 kernel: [124340.899214] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user] Dec 2 02:17:06 appserver01 kernel: [124340.899217] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] Dec 2 02:17:06 appserver01 kernel: [124340.899219] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] Dec 2 02:17:06 appserver01 kernel: [124340.899228] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899231] [<ffffffff812fba90>] ? thread_return+0x79/0xe0 Dec 2 02:17:06 appserver01 kernel: [124340.899237] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899244] [<ffffffffa0317472>] ? ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899250] [<ffffffffa0317472>] ? ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899257] [<ffffffffa0328abe>] ? ocfs2_queue_orphan_scan+0x29/0x126 [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899259] [<ffffffff812fc3c6>] ? mutex_lock+0xd/0x31 Dec 2 02:17:06 appserver01 kernel: [124340.899266] [<ffffffffa0328be0>] ? ocfs2_orphan_scan_work+0x25/0x4d [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899270] [<ffffffff81061a13>] ? worker_thread+0x188/0x21d Dec 2 02:17:06 appserver01 kernel: [124340.899276] [<ffffffffa0328bbb>] ? ocfs2_orphan_scan_work+0x0/0x4d [ocfs2] Dec 2 02:17:06 appserver01 kernel: [124340.899280] [<ffffffff81065046>] ? autoremove_wake_function+0x0/0x2e Dec 2 02:17:06 appserver01 kernel: [124340.899282] [<ffffffff8106188b>] ? worker_thread+0x0/0x21d Dec 2 02:17:06 appserver01 kernel: [124340.899284] [<ffffffff81064d79>] ? kthread+0x79/0x81 Dec 2 02:17:06 appserver01 kernel: [124340.899287] [<ffffffff81011baa>] ? child_rip+0xa/0x20 Dec 2 02:17:06 appserver01 kernel: [124340.899289] [<ffffffff81064d00>] ? kthread+0x0/0x81 Dec 2 02:17:06 appserver01 kernel: [124340.899291] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 -- Hälsningar / Greetings Stefan Midjich [De omnibus dubitandum] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121202/d8286164/attachment.htm>