Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, Dec 02, 2012 at 11:52:47AM +0100, Stefan Midjich wrote: > Fortunately the data volume was only mounted but not in use. > > I found a similar list post on > http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html but it > had no replies on what could cause this. I've been thinking the DRBD > traffic should be on a separate network but have not set this up yet. Right > now the DRBD traffic goes over the same vNetwork that other traffic goes > over, including multicast VIP traffic form both LVS and pacemaker clusters. > > In words the SyncSource node started using a critical load average of > resources and became unresponsive. This is a VM setup split over different > physical ESX hosts but even the local console was dead. So a forced reset > was in order. > > The cluster services came up fine, corosync+pacemaker+o2cb+ocfs2_dlm. The With cluster file systems, you need tested and confirmed working fencing, aka STONITH. Fencing/STONITH is a hard requirement. This is not negotiable. If you try to get away without it, and the network layer has so much as a hickup, your IO will block. Hard. Up to here, this was not even considering DRBD... If you want to use cluster file systems on top of DRBD, you *additionally* need to integrate DRBD replication link breakage into your fencing setup. Some keywords to search for: fencing resource-and-stonith; fence-peer handler; obliterate-peer; > cluster is Debian Squeeze with corosync, pacemaker, openais and cman from > backports. Only corosync and pacemaker are services actually used. Other > packages are only installed for access to things like fencing and resource > agents. Drbd 8.3.7 is used from Debian stable repository. > > The drbd config is mostly stock, here is the reource definition. > > resource shared0 { > meta-disk internal; > device /dev/drbd1; > syncer { > verify-alg sha1; > } > net { > allow-two-primaries; > } > on appserver01 { > disk /dev/mapper/shared0_appserver01-lv0; > address 10.221.182.31:7789; > } > on appserver02 { > disk /dev/mapper/shared0_appserver02-lv0; > address 10.221.182.32:7789; > } > } > > The logs on the SyncSource node show the following happening at the time of > the failure. > > Dec 2 02:09:56 appserver01 kernel: [123911.353113] block drbd1: peer( > Primary -> Unknown ) conn( SyncSource -> NetworkFailure ) > Dec 2 02:09:56 appserver01 kernel: [123911.353123] block drbd1: asender > terminated > Dec 2 02:09:56 appserver01 kernel: [123911.353126] block drbd1: > Terminating drbd1_asender > Dec 2 02:09:56 appserver01 kernel: [123911.353967] block drbd1: Connection > closed > Dec 2 02:09:56 appserver01 kernel: [123911.353974] block drbd1: conn( > NetworkFailure -> Unconnected ) > Dec 2 02:09:56 appserver01 kernel: [123911.353977] block drbd1: receiver > terminated > Dec 2 02:09:56 appserver01 kernel: [123911.353978] block drbd1: Restarting > drbd1_receiver > Dec 2 02:09:56 appserver01 kernel: [123911.353980] block drbd1: receiver > (re)started > Dec 2 02:09:56 appserver01 kernel: [123911.353983] block drbd1: conn( > Unconnected -> WFConnection ) > Dec 2 02:13:06 appserver01 kernel: [124101.093326] ocfs2rec D > ffff88017e7fa350 0 26221 2 0x00000000 > Dec 2 02:13:06 appserver01 kernel: [124101.093330] ffff88017e7fa350 > 0000000000000046 ffff88018dad4000 0000000000000010 > Dec 2 02:13:06 appserver01 kernel: [124101.093333] 0000000000000616 > ffffea000455c168 000000000000f9e0 ffff88018dad5fd8 > Dec 2 02:13:06 appserver01 kernel: [124101.093335] 0000000000015780 > 0000000000015780 ffff88017e266350 ffff88017e266648 > Dec 2 02:13:06 appserver01 kernel: [124101.093338] Call Trace: > Dec 2 02:13:06 appserver01 kernel: [124101.093346] [<ffffffff812fcc4f>] ? > rwsem_down_failed_common+0x8c/0xa8 > Dec 2 02:13:06 appserver01 kernel: [124101.093348] [<ffffffff812fccb2>] ? > rwsem_down_read_failed+0x22/0x2b > Dec 2 02:13:06 appserver01 kernel: [124101.093353] [<ffffffff811965f4>] ? > call_rwsem_down_read_failed+0x14/0x30 > Dec 2 02:13:06 appserver01 kernel: [124101.093359] [<ffffffffa028f0bc>] ? > user_dlm_lock+0x0/0x47 [ocfs2_stack_user] > Dec 2 02:13:06 appserver01 kernel: [124101.093363] [<ffffffff810b885b>] ? > zone_watermark_ok+0x20/0xb1 > Dec 2 02:13:06 appserver01 kernel: [124101.093365] [<ffffffff812fc665>] ? > down_read+0x17/0x19 > Dec 2 02:13:06 appserver01 kernel: [124101.093371] [<ffffffffa02133b6>] ? > dlm_lock+0x56/0x149 [dlm] > Dec 2 02:13:06 appserver01 kernel: [124101.093374] [<ffffffff810c79c0>] ? > zone_statistics+0x3c/0x5d > Dec 2 02:13:06 appserver01 kernel: [124101.093377] [<ffffffffa028f0fe>] ? > user_dlm_lock+0x42/0x47 [ocfs2_stack_user] > Dec 2 02:13:06 appserver01 kernel: [124101.093380] [<ffffffffa028f000>] ? > fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] > Dec 2 02:13:06 appserver01 kernel: [124101.093382] [<ffffffffa028f02d>] ? > fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] > Dec 2 02:13:06 appserver01 kernel: [124101.093391] [<ffffffffa031587a>] ? > __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093395] [<ffffffff8100f657>] ? > __switch_to+0x140/0x297 > Dec 2 02:13:06 appserver01 kernel: [124101.093402] [<ffffffffa0315cd8>] ? > ocfs2_cluster_lock+0x15/0x17 [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093408] [<ffffffffa03195c2>] ? > ocfs2_super_lock+0xc7/0x2a9 [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093415] [<ffffffffa03195c2>] ? > ocfs2_super_lock+0xc7/0x2a9 [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093421] [<ffffffffa0329f9e>] ? > __ocfs2_recovery_thread+0x0/0x122b [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093428] [<ffffffffa032a07f>] ? > __ocfs2_recovery_thread+0xe1/0x122b [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093430] [<ffffffff812fba90>] ? > thread_return+0x79/0xe0 > Dec 2 02:13:06 appserver01 kernel: [124101.093433] [<ffffffff8103a403>] ? > activate_task+0x22/0x28 > Dec 2 02:13:06 appserver01 kernel: [124101.093436] [<ffffffff8104a44f>] ? > try_to_wake_up+0x289/0x29b > Dec 2 02:13:06 appserver01 kernel: [124101.093443] [<ffffffffa0329f9e>] ? > __ocfs2_recovery_thread+0x0/0x122b [ocfs2] > Dec 2 02:13:06 appserver01 kernel: [124101.093446] [<ffffffff81064d79>] ? > kthread+0x79/0x81 > Dec 2 02:13:06 appserver01 kernel: [124101.093449] [<ffffffff81011baa>] ? > child_rip+0xa/0x20 > Dec 2 02:13:06 appserver01 kernel: [124101.093451] [<ffffffff81064d00>] ? > kthread+0x0/0x81 > Dec 2 02:13:06 appserver01 kernel: [124101.093453] [<ffffffff81011ba0>] ? > child_rip+0x0/0x20 > > Then a few moments passed. > > Dec 2 02:13:32 appserver01 kernel: [124127.071151] block drbd1: Handshake > successful: Agreed network protocol version 91 > Dec 2 02:13:32 appserver01 kernel: [124127.071157] block drbd1: conn( > WFConnection -> WFReportParams ) > Dec 2 02:13:32 appserver01 kernel: [124127.076732] block drbd1: Starting > asender thread (from drbd1_receiver [7526]) > Dec 2 02:13:32 appserver01 kernel: [124127.078447] block drbd1: > data-integrity-alg: <not-used> > Dec 2 02:13:32 appserver01 kernel: [124127.078456] block drbd1: > drbd_sync_handshake: > Dec 2 02:13:32 appserver01 kernel: [124127.078459] block drbd1: self > 7843E95E721AF0ED:54BC6F3AD7F42585:52FF69A8720BCEAC:BA309D9B7FCA3C07 > bits:115301551 flags:0 > Dec 2 02:13:32 appserver01 kernel: [124127.078461] block drbd1: peer > 54BC6F3AD7F42584:0000000000000000:0000000000000000:0000000000000000 > bits:115314775 flags:2 > Dec 2 02:13:32 appserver01 kernel: [124127.078464] block drbd1: > uuid_compare()=1 by rule 70 > Dec 2 02:13:32 appserver01 kernel: [124127.078465] block drbd1: Becoming > sync source due to disk states. > Dec 2 02:13:32 appserver01 kernel: [124127.078469] block drbd1: peer( > Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > Dec 2 02:13:39 appserver01 kernel: [124134.091066] block drbd1: conn( > WFBitMapS -> SyncSource ) > Dec 2 02:13:39 appserver01 kernel: [124134.091078] block drbd1: Began > resync as SyncSource (will sync 461259100 KB [115314775 bits set]). > > And after yet some more moments passing it started to repeatedly post call > traces. Here is just one cycle of these traces. At this point the load was > critical and I must assume the server was unresponsive because the status > of the alarms didn't change until manual intervention. It kept posting call > traces for 4 minutes and then I must assume DRBD died because it was quiet > until reboot. > > Dec 2 02:15:06 appserver01 kernel: [124220.996240] ocfs2rec D > ffff88017e7fa350 0 26221 2 0x00000000 > Dec 2 02:15:06 appserver01 kernel: [124220.996244] ffff88017e7fa350 > 0000000000000046 ffff88018dad4000 0000000000000010 > Dec 2 02:15:06 appserver01 kernel: [124220.996247] 0000000000000616 > ffffea000455c168 000000000000f9e0 ffff88018dad5fd8 > Dec 2 02:15:06 appserver01 kernel: [124220.996250] 0000000000015780 > 0000000000015780 ffff88017e266350 ffff88017e266648 > Dec 2 02:15:06 appserver01 kernel: [124220.996252] Call Trace: > Dec 2 02:15:06 appserver01 kernel: [124220.996260] [<ffffffff812fcc4f>] ? > rwsem_down_failed_common+0x8c/0xa8 > Dec 2 02:15:06 appserver01 kernel: [124220.996262] [<ffffffff812fccb2>] ? > rwsem_down_read_failed+0x22/0x2b > Dec 2 02:15:06 appserver01 kernel: [124220.996267] [<ffffffff811965f4>] ? > call_rwsem_down_read_failed+0x14/0x30 > Dec 2 02:15:06 appserver01 kernel: [124220.996273] [<ffffffffa028f0bc>] ? > user_dlm_lock+0x0/0x47 [ocfs2_stack_user] > Dec 2 02:15:06 appserver01 kernel: [124220.996277] [<ffffffff810b885b>] ? > zone_watermark_ok+0x20/0xb1 > Dec 2 02:15:06 appserver01 kernel: [124220.996279] [<ffffffff812fc665>] ? > down_read+0x17/0x19 > Dec 2 02:15:06 appserver01 kernel: [124220.996285] [<ffffffffa02133b6>] ? > dlm_lock+0x56/0x149 [dlm] > Dec 2 02:15:06 appserver01 kernel: [124220.996289] [<ffffffff810c79c0>] ? > zone_statistics+0x3c/0x5d > Dec 2 02:15:06 appserver01 kernel: [124220.996291] [<ffffffffa028f0fe>] ? > user_dlm_lock+0x42/0x47 [ocfs2_stack_user] > Dec 2 02:15:06 appserver01 kernel: [124220.996294] [<ffffffffa028f000>] ? > fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] > Dec 2 02:15:06 appserver01 kernel: [124220.996297] [<ffffffffa028f02d>] ? > fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] > Dec 2 02:15:06 appserver01 kernel: [124220.996305] [<ffffffffa031587a>] ? > __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996310] [<ffffffff8100f657>] ? > __switch_to+0x140/0x297 > Dec 2 02:15:06 appserver01 kernel: [124220.996317] [<ffffffffa0315cd8>] ? > ocfs2_cluster_lock+0x15/0x17 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996323] [<ffffffffa03195c2>] ? > ocfs2_super_lock+0xc7/0x2a9 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996330] [<ffffffffa03195c2>] ? > ocfs2_super_lock+0xc7/0x2a9 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996337] [<ffffffffa0329f9e>] ? > __ocfs2_recovery_thread+0x0/0x122b [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996343] [<ffffffffa032a07f>] ? > __ocfs2_recovery_thread+0xe1/0x122b [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996346] [<ffffffff812fba90>] ? > thread_return+0x79/0xe0 > Dec 2 02:15:06 appserver01 kernel: [124220.996349] [<ffffffff8103a403>] ? > activate_task+0x22/0x28 > Dec 2 02:15:06 appserver01 kernel: [124220.996352] [<ffffffff8104a44f>] ? > try_to_wake_up+0x289/0x29b > Dec 2 02:15:06 appserver01 kernel: [124220.996359] [<ffffffffa0329f9e>] ? > __ocfs2_recovery_thread+0x0/0x122b [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996362] [<ffffffff81064d79>] ? > kthread+0x79/0x81 > Dec 2 02:15:06 appserver01 kernel: [124220.996364] [<ffffffff81011baa>] ? > child_rip+0xa/0x20 > Dec 2 02:15:06 appserver01 kernel: [124220.996366] [<ffffffff81064d00>] ? > kthread+0x0/0x81 > Dec 2 02:15:06 appserver01 kernel: [124220.996368] [<ffffffff81011ba0>] ? > child_rip+0x0/0x20 > Dec 2 02:15:06 appserver01 kernel: [124220.996556] ls D > ffff8801bb5a2a60 0 26318 26317 0x00000000 > Dec 2 02:15:06 appserver01 kernel: [124220.996559] ffff8801bb5a2a60 > 0000000000000082 ffff8801bb7734c8 ffffffff81103ab9 > Dec 2 02:15:06 appserver01 kernel: [124220.996561] ffff88016843dd58 > ffff88016843ddf8 000000000000f9e0 ffff88016843dfd8 > Dec 2 02:15:06 appserver01 kernel: [124220.996563] 0000000000015780 > 0000000000015780 ffff8801bcf1a350 ffff8801bcf1a648 > Dec 2 02:15:06 appserver01 kernel: [124220.996566] Call Trace: > Dec 2 02:15:06 appserver01 kernel: [124220.996570] [<ffffffff81103ab9>] ? > mntput_no_expire+0x23/0xee > Dec 2 02:15:06 appserver01 kernel: [124220.996573] [<ffffffff810f75af>] ? > __link_path_walk+0x6f0/0x6f5 > Dec 2 02:15:06 appserver01 kernel: [124220.996580] [<ffffffffa03296af>] ? > ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996582] [<ffffffff81065046>] ? > autoremove_wake_function+0x0/0x2e > Dec 2 02:15:06 appserver01 kernel: [124220.996589] [<ffffffffa0319923>] ? > ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996596] [<ffffffffa0324f2d>] ? > ocfs2_inode_revalidate+0x145/0x221 [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996603] [<ffffffffa03208d9>] ? > ocfs2_getattr+0x79/0x16a [ocfs2] > Dec 2 02:15:06 appserver01 kernel: [124220.996606] [<ffffffff810f2591>] ? > vfs_fstatat+0x43/0x57 > Dec 2 02:15:06 appserver01 kernel: [124220.996609] [<ffffffff810f25fb>] ? > sys_newlstat+0x11/0x30 > Dec 2 02:15:06 appserver01 kernel: [124220.996612] [<ffffffff812ff306>] ? > do_page_fault+0x2e0/0x2fc > Dec 2 02:15:06 appserver01 kernel: [124220.996614] [<ffffffff812fd1a5>] ? > page_fault+0x25/0x30 > Dec 2 02:15:06 appserver01 kernel: [124220.996616] [<ffffffff81010b42>] ? > system_call_fastpath+0x16/0x1b > Dec 2 02:17:06 appserver01 kernel: [124340.899149] events/0 D > ffff88017e7faa60 0 6 2 0x00000000 > Dec 2 02:17:06 appserver01 kernel: [124340.899153] ffff88017e7faa60 > 0000000000000046 ffff880006e157e8 ffff8801bf09e388 > Dec 2 02:17:06 appserver01 kernel: [124340.899157] ffff8801bc88f1b8 > ffff8801bc88f1a8 000000000000f9e0 ffff8801bf0b3fd8 > Dec 2 02:17:06 appserver01 kernel: [124340.899160] 0000000000015780 > 0000000000015780 ffff8801bf09e350 ffff8801bf09e648 > Dec 2 02:17:06 appserver01 kernel: [124340.899162] Call Trace: > Dec 2 02:17:06 appserver01 kernel: [124340.899169] [<ffffffff812fba90>] ? > thread_return+0x79/0xe0 > Dec 2 02:17:06 appserver01 kernel: [124340.899172] [<ffffffff812fcc4f>] ? > rwsem_down_failed_common+0x8c/0xa8 > Dec 2 02:17:06 appserver01 kernel: [124340.899175] [<ffffffff812fccb2>] ? > rwsem_down_read_failed+0x22/0x2b > Dec 2 02:17:06 appserver01 kernel: [124340.899179] [<ffffffff811965f4>] ? > call_rwsem_down_read_failed+0x14/0x30 > Dec 2 02:17:06 appserver01 kernel: [124340.899185] [<ffffffffa028f0bc>] ? > user_dlm_lock+0x0/0x47 [ocfs2_stack_user] > Dec 2 02:17:06 appserver01 kernel: [124340.899188] [<ffffffff812fc665>] ? > down_read+0x17/0x19 > Dec 2 02:17:06 appserver01 kernel: [124340.899193] [<ffffffffa02133b6>] ? > dlm_lock+0x56/0x149 [dlm] > Dec 2 02:17:06 appserver01 kernel: [124340.899198] [<ffffffff810168c1>] ? > sched_clock+0x5/0x8 > Dec 2 02:17:06 appserver01 kernel: [124340.899202] [<ffffffff81049412>] ? > update_rq_clock+0xf/0x28 > Dec 2 02:17:06 appserver01 kernel: [124340.899205] [<ffffffff8104a44f>] ? > try_to_wake_up+0x289/0x29b > Dec 2 02:17:06 appserver01 kernel: [124340.899209] [<ffffffff810fd0ce>] ? > pollwake+0x53/0x59 > Dec 2 02:17:06 appserver01 kernel: [124340.899211] [<ffffffff8104a461>] ? > default_wake_function+0x0/0x9 > Dec 2 02:17:06 appserver01 kernel: [124340.899214] [<ffffffffa028f0fe>] ? > user_dlm_lock+0x42/0x47 [ocfs2_stack_user] > Dec 2 02:17:06 appserver01 kernel: [124340.899217] [<ffffffffa028f000>] ? > fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user] > Dec 2 02:17:06 appserver01 kernel: [124340.899219] [<ffffffffa028f02d>] ? > fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user] > Dec 2 02:17:06 appserver01 kernel: [124340.899228] [<ffffffffa031587a>] ? > __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899231] [<ffffffff812fba90>] ? > thread_return+0x79/0xe0 > Dec 2 02:17:06 appserver01 kernel: [124340.899237] [<ffffffffa0315cd8>] ? > ocfs2_cluster_lock+0x15/0x17 [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899244] [<ffffffffa0317472>] ? > ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899250] [<ffffffffa0317472>] ? > ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899257] [<ffffffffa0328abe>] ? > ocfs2_queue_orphan_scan+0x29/0x126 [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899259] [<ffffffff812fc3c6>] ? > mutex_lock+0xd/0x31 > Dec 2 02:17:06 appserver01 kernel: [124340.899266] [<ffffffffa0328be0>] ? > ocfs2_orphan_scan_work+0x25/0x4d [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899270] [<ffffffff81061a13>] ? > worker_thread+0x188/0x21d > Dec 2 02:17:06 appserver01 kernel: [124340.899276] [<ffffffffa0328bbb>] ? > ocfs2_orphan_scan_work+0x0/0x4d [ocfs2] > Dec 2 02:17:06 appserver01 kernel: [124340.899280] [<ffffffff81065046>] ? > autoremove_wake_function+0x0/0x2e > Dec 2 02:17:06 appserver01 kernel: [124340.899282] [<ffffffff8106188b>] ? > worker_thread+0x0/0x21d > Dec 2 02:17:06 appserver01 kernel: [124340.899284] [<ffffffff81064d79>] ? > kthread+0x79/0x81 > Dec 2 02:17:06 appserver01 kernel: [124340.899287] [<ffffffff81011baa>] ? > child_rip+0xa/0x20 > Dec 2 02:17:06 appserver01 kernel: [124340.899289] [<ffffffff81064d00>] ? > kthread+0x0/0x81 > Dec 2 02:17:06 appserver01 kernel: [124340.899291] [<ffffffff81011ba0>] ? > child_rip+0x0/0x20 > > -- > Hälsningar / Greetings > > Stefan Midjich > [De omnibus dubitandum] -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed