Fortunately the data volume was only mounted but not in use. <br><br>I found a similar list post on <a href="http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html">http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html</a> but it had no replies on what could cause this. I've been thinking the DRBD traffic should be on a separate network but have not set this up yet. Right now the DRBD traffic goes over the same vNetwork that other traffic goes over, including multicast VIP traffic form both LVS and pacemaker clusters. <br>
<br>In words the SyncSource node started using a critical load average of resources and became unresponsive. This is a VM setup split over different physical ESX hosts but even the local console was dead. So a forced reset was in order. <br>
<br>The cluster services came up fine, corosync+pacemaker+o2cb+ocfs2_dlm. The cluster is Debian Squeeze with corosync, pacemaker, openais and cman from backports. Only corosync and pacemaker are services actually used. Other packages are only installed for access to things like fencing and resource agents. Drbd 8.3.7 is used from Debian stable repository. <br>
<br>The drbd config is mostly stock, here is the reource definition. <br><br>resource shared0 {<br> meta-disk internal;<br> device /dev/drbd1;<br> syncer {<br> verify-alg sha1;<br> }<br> net {<br> allow-two-primaries;<br>
}<br> on appserver01 {<br> disk /dev/mapper/shared0_appserver01-lv0;<br> address <a href="http://10.221.182.31:7789">10.221.182.31:7789</a>;<br> }<br> on appserver02 {<br> disk /dev/mapper/shared0_appserver02-lv0;<br>
address <a href="http://10.221.182.32:7789">10.221.182.32:7789</a>;<br> }<br>}<br><br>The logs on the SyncSource node show the following happening at the time of the failure. <br><br>Dec 2 02:09:56 appserver01 kernel: [123911.353113] block drbd1: peer( Primary -> Unknown ) conn( SyncSource -> NetworkFailure ) <br>
Dec 2 02:09:56 appserver01 kernel: [123911.353123] block drbd1: asender terminated<br>Dec 2 02:09:56 appserver01 kernel: [123911.353126] block drbd1: Terminating drbd1_asender<br>Dec 2 02:09:56 appserver01 kernel: [123911.353967] block drbd1: Connection closed<br>
Dec 2 02:09:56 appserver01 kernel: [123911.353974] block drbd1: conn( NetworkFailure -> Unconnected ) <br>Dec 2 02:09:56 appserver01 kernel: [123911.353977] block drbd1: receiver terminated<br>Dec 2 02:09:56 appserver01 kernel: [123911.353978] block drbd1: Restarting drbd1_receiver<br>
Dec 2 02:09:56 appserver01 kernel: [123911.353980] block drbd1: receiver (re)started<br>Dec 2 02:09:56 appserver01 kernel: [123911.353983] block drbd1: conn( Unconnected -> WFConnection ) <br>Dec 2 02:13:06 appserver01 kernel: [124101.093326] ocfs2rec D ffff88017e7fa350 0 26221 2 0x00000000<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093330] ffff88017e7fa350 0000000000000046 ffff88018dad4000 0000000000000010<br>Dec 2 02:13:06 appserver01 kernel: [124101.093333] 0000000000000616 ffffea000455c168 000000000000f9e0 ffff88018dad5fd8<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093335] 0000000000015780 0000000000015780 ffff88017e266350 ffff88017e266648<br>Dec 2 02:13:06 appserver01 kernel: [124101.093338] Call Trace:<br>Dec 2 02:13:06 appserver01 kernel: [124101.093346] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093348] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b<br>Dec 2 02:13:06 appserver01 kernel: [124101.093353] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093359] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093363] [<ffffffff810b885b>] ? zone_watermark_ok+0x20/0xb1<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093365] [<ffffffff812fc665>] ? down_read+0x17/0x19<br>Dec 2 02:13:06 appserver01 kernel: [124101.093371] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093374] [<ffffffff810c79c0>] ? zone_statistics+0x3c/0x5d<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093377] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093380] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093382] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093391] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093395] [<ffffffff8100f657>] ? __switch_to+0x140/0x297<br>Dec 2 02:13:06 appserver01 kernel: [124101.093402] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2]<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093408] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093415] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2]<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093421] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093428] [<ffffffffa032a07f>] ? __ocfs2_recovery_thread+0xe1/0x122b [ocfs2]<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093430] [<ffffffff812fba90>] ? thread_return+0x79/0xe0<br>Dec 2 02:13:06 appserver01 kernel: [124101.093433] [<ffffffff8103a403>] ? activate_task+0x22/0x28<br>Dec 2 02:13:06 appserver01 kernel: [124101.093436] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093443] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2]<br>Dec 2 02:13:06 appserver01 kernel: [124101.093446] [<ffffffff81064d79>] ? kthread+0x79/0x81<br>
Dec 2 02:13:06 appserver01 kernel: [124101.093449] [<ffffffff81011baa>] ? child_rip+0xa/0x20<br>Dec 2 02:13:06 appserver01 kernel: [124101.093451] [<ffffffff81064d00>] ? kthread+0x0/0x81<br>Dec 2 02:13:06 appserver01 kernel: [124101.093453] [<ffffffff81011ba0>] ? child_rip+0x0/0x20<br>
<br>Then a few moments passed.<br><br>Dec 2 02:13:32 appserver01 kernel: [124127.071151] block drbd1: Handshake successful: Agreed network protocol version 91<br>Dec 2 02:13:32 appserver01 kernel: [124127.071157] block drbd1: conn( WFConnection -> WFReportParams ) <br>
Dec 2 02:13:32 appserver01 kernel: [124127.076732] block drbd1: Starting asender thread (from drbd1_receiver [7526])<br>Dec 2 02:13:32 appserver01 kernel: [124127.078447] block drbd1: data-integrity-alg: <not-used><br>
Dec 2 02:13:32 appserver01 kernel: [124127.078456] block drbd1: drbd_sync_handshake:<br>Dec 2 02:13:32 appserver01 kernel: [124127.078459] block drbd1: self 7843E95E721AF0ED:54BC6F3AD7F42585:52FF69A8720BCEAC:BA309D9B7FCA3C07 bits:115301551 flags:0<br>
Dec 2 02:13:32 appserver01 kernel: [124127.078461] block drbd1: peer 54BC6F3AD7F42584:0000000000000000:0000000000000000:0000000000000000 bits:115314775 flags:2<br>Dec 2 02:13:32 appserver01 kernel: [124127.078464] block drbd1: uuid_compare()=1 by rule 70<br>
Dec 2 02:13:32 appserver01 kernel: [124127.078465] block drbd1: Becoming sync source due to disk states.<br>Dec 2 02:13:32 appserver01 kernel: [124127.078469] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) <br>
Dec 2 02:13:39 appserver01 kernel: [124134.091066] block drbd1: conn( WFBitMapS -> SyncSource ) <br>Dec 2 02:13:39 appserver01 kernel: [124134.091078] block drbd1: Began resync as SyncSource (will sync 461259100 KB [115314775 bits set]).<br>
<br>And after yet some more moments passing it started to repeatedly post call traces. Here is just one cycle of these traces. At this point the load was critical and I must assume the server was unresponsive because the status of the alarms didn't change until manual intervention. It kept posting call traces for 4 minutes and then I must assume DRBD died because it was quiet until reboot. <br>
<br>Dec 2 02:15:06 appserver01 kernel: [124220.996240] ocfs2rec D ffff88017e7fa350 0 26221 2 0x00000000<br>Dec 2 02:15:06 appserver01 kernel: [124220.996244] ffff88017e7fa350 0000000000000046 ffff88018dad4000 0000000000000010<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996247] 0000000000000616 ffffea000455c168 000000000000f9e0 ffff88018dad5fd8<br>Dec 2 02:15:06 appserver01 kernel: [124220.996250] 0000000000015780 0000000000015780 ffff88017e266350 ffff88017e266648<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996252] Call Trace:<br>Dec 2 02:15:06 appserver01 kernel: [124220.996260] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8<br>Dec 2 02:15:06 appserver01 kernel: [124220.996262] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996267] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30<br>Dec 2 02:15:06 appserver01 kernel: [124220.996273] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996277] [<ffffffff810b885b>] ? zone_watermark_ok+0x20/0xb1<br>Dec 2 02:15:06 appserver01 kernel: [124220.996279] [<ffffffff812fc665>] ? down_read+0x17/0x19<br>Dec 2 02:15:06 appserver01 kernel: [124220.996285] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996289] [<ffffffff810c79c0>] ? zone_statistics+0x3c/0x5d<br>Dec 2 02:15:06 appserver01 kernel: [124220.996291] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996294] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996297] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996305] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996310] [<ffffffff8100f657>] ? __switch_to+0x140/0x297<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996317] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996323] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996330] [<ffffffffa03195c2>] ? ocfs2_super_lock+0xc7/0x2a9 [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996337] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996343] [<ffffffffa032a07f>] ? __ocfs2_recovery_thread+0xe1/0x122b [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996346] [<ffffffff812fba90>] ? thread_return+0x79/0xe0<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996349] [<ffffffff8103a403>] ? activate_task+0x22/0x28<br>Dec 2 02:15:06 appserver01 kernel: [124220.996352] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996359] [<ffffffffa0329f9e>] ? __ocfs2_recovery_thread+0x0/0x122b [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996362] [<ffffffff81064d79>] ? kthread+0x79/0x81<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996364] [<ffffffff81011baa>] ? child_rip+0xa/0x20<br>Dec 2 02:15:06 appserver01 kernel: [124220.996366] [<ffffffff81064d00>] ? kthread+0x0/0x81<br>Dec 2 02:15:06 appserver01 kernel: [124220.996368] [<ffffffff81011ba0>] ? child_rip+0x0/0x20<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996556] ls D ffff8801bb5a2a60 0 26318 26317 0x00000000<br>Dec 2 02:15:06 appserver01 kernel: [124220.996559] ffff8801bb5a2a60 0000000000000082 ffff8801bb7734c8 ffffffff81103ab9<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996561] ffff88016843dd58 ffff88016843ddf8 000000000000f9e0 ffff88016843dfd8<br>Dec 2 02:15:06 appserver01 kernel: [124220.996563] 0000000000015780 0000000000015780 ffff8801bcf1a350 ffff8801bcf1a648<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996566] Call Trace:<br>Dec 2 02:15:06 appserver01 kernel: [124220.996570] [<ffffffff81103ab9>] ? mntput_no_expire+0x23/0xee<br>Dec 2 02:15:06 appserver01 kernel: [124220.996573] [<ffffffff810f75af>] ? __link_path_walk+0x6f0/0x6f5<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996580] [<ffffffffa03296af>] ? ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996582] [<ffffffff81065046>] ? autoremove_wake_function+0x0/0x2e<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996589] [<ffffffffa0319923>] ? ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996596] [<ffffffffa0324f2d>] ? ocfs2_inode_revalidate+0x145/0x221 [ocfs2]<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996603] [<ffffffffa03208d9>] ? ocfs2_getattr+0x79/0x16a [ocfs2]<br>Dec 2 02:15:06 appserver01 kernel: [124220.996606] [<ffffffff810f2591>] ? vfs_fstatat+0x43/0x57<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996609] [<ffffffff810f25fb>] ? sys_newlstat+0x11/0x30<br>Dec 2 02:15:06 appserver01 kernel: [124220.996612] [<ffffffff812ff306>] ? do_page_fault+0x2e0/0x2fc<br>Dec 2 02:15:06 appserver01 kernel: [124220.996614] [<ffffffff812fd1a5>] ? page_fault+0x25/0x30<br>
Dec 2 02:15:06 appserver01 kernel: [124220.996616] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b<br>Dec 2 02:17:06 appserver01 kernel: [124340.899149] events/0 D ffff88017e7faa60 0 6 2 0x00000000<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899153] ffff88017e7faa60 0000000000000046 ffff880006e157e8 ffff8801bf09e388<br>Dec 2 02:17:06 appserver01 kernel: [124340.899157] ffff8801bc88f1b8 ffff8801bc88f1a8 000000000000f9e0 ffff8801bf0b3fd8<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899160] 0000000000015780 0000000000015780 ffff8801bf09e350 ffff8801bf09e648<br>Dec 2 02:17:06 appserver01 kernel: [124340.899162] Call Trace:<br>Dec 2 02:17:06 appserver01 kernel: [124340.899169] [<ffffffff812fba90>] ? thread_return+0x79/0xe0<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899172] [<ffffffff812fcc4f>] ? rwsem_down_failed_common+0x8c/0xa8<br>Dec 2 02:17:06 appserver01 kernel: [124340.899175] [<ffffffff812fccb2>] ? rwsem_down_read_failed+0x22/0x2b<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899179] [<ffffffff811965f4>] ? call_rwsem_down_read_failed+0x14/0x30<br>Dec 2 02:17:06 appserver01 kernel: [124340.899185] [<ffffffffa028f0bc>] ? user_dlm_lock+0x0/0x47 [ocfs2_stack_user]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899188] [<ffffffff812fc665>] ? down_read+0x17/0x19<br>Dec 2 02:17:06 appserver01 kernel: [124340.899193] [<ffffffffa02133b6>] ? dlm_lock+0x56/0x149 [dlm]<br>Dec 2 02:17:06 appserver01 kernel: [124340.899198] [<ffffffff810168c1>] ? sched_clock+0x5/0x8<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899202] [<ffffffff81049412>] ? update_rq_clock+0xf/0x28<br>Dec 2 02:17:06 appserver01 kernel: [124340.899205] [<ffffffff8104a44f>] ? try_to_wake_up+0x289/0x29b<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899209] [<ffffffff810fd0ce>] ? pollwake+0x53/0x59<br>Dec 2 02:17:06 appserver01 kernel: [124340.899211] [<ffffffff8104a461>] ? default_wake_function+0x0/0x9<br>Dec 2 02:17:06 appserver01 kernel: [124340.899214] [<ffffffffa028f0fe>] ? user_dlm_lock+0x42/0x47 [ocfs2_stack_user]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899217] [<ffffffffa028f000>] ? fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]<br>Dec 2 02:17:06 appserver01 kernel: [124340.899219] [<ffffffffa028f02d>] ? fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899228] [<ffffffffa031587a>] ? __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]<br>Dec 2 02:17:06 appserver01 kernel: [124340.899231] [<ffffffff812fba90>] ? thread_return+0x79/0xe0<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899237] [<ffffffffa0315cd8>] ? ocfs2_cluster_lock+0x15/0x17 [ocfs2]<br>Dec 2 02:17:06 appserver01 kernel: [124340.899244] [<ffffffffa0317472>] ? ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899250] [<ffffffffa0317472>] ? ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]<br>Dec 2 02:17:06 appserver01 kernel: [124340.899257] [<ffffffffa0328abe>] ? ocfs2_queue_orphan_scan+0x29/0x126 [ocfs2]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899259] [<ffffffff812fc3c6>] ? mutex_lock+0xd/0x31<br>Dec 2 02:17:06 appserver01 kernel: [124340.899266] [<ffffffffa0328be0>] ? ocfs2_orphan_scan_work+0x25/0x4d [ocfs2]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899270] [<ffffffff81061a13>] ? worker_thread+0x188/0x21d<br>Dec 2 02:17:06 appserver01 kernel: [124340.899276] [<ffffffffa0328bbb>] ? ocfs2_orphan_scan_work+0x0/0x4d [ocfs2]<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899280] [<ffffffff81065046>] ? autoremove_wake_function+0x0/0x2e<br>Dec 2 02:17:06 appserver01 kernel: [124340.899282] [<ffffffff8106188b>] ? worker_thread+0x0/0x21d<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899284] [<ffffffff81064d79>] ? kthread+0x79/0x81<br>Dec 2 02:17:06 appserver01 kernel: [124340.899287] [<ffffffff81011baa>] ? child_rip+0xa/0x20<br>Dec 2 02:17:06 appserver01 kernel: [124340.899289] [<ffffffff81064d00>] ? kthread+0x0/0x81<br>
Dec 2 02:17:06 appserver01 kernel: [124340.899291] [<ffffffff81011ba0>] ? child_rip+0x0/0x20<br clear="all"><br>-- <br>Hälsningar / Greetings<br><br>Stefan Midjich<br>[De omnibus dubitandum]<br>