Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, Dec 02, 2012 at 11:52:47AM +0100, Stefan Midjich wrote:
> Fortunately the data volume was only mounted but not in use.
>
> I found a similar list post on
> http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html but it
> had no replies on what could cause this. I've been thinking the DRBD
> traffic should be on a separate network but have not set this up yet. Right
> now the DRBD traffic goes over the same vNetwork that other traffic goes
> over, including multicast VIP traffic form both LVS and pacemaker clusters.
>
> In words the SyncSource node started using a critical load average of
> resources and became unresponsive. This is a VM setup split over different
> physical ESX hosts but even the local console was dead. So a forced reset
> was in order.
>
> The cluster services came up fine, corosync+pacemaker+o2cb+ocfs2_dlm. The
With cluster file systems,
you need tested and confirmed working fencing, aka STONITH.
Fencing/STONITH is a hard requirement.
This is not negotiable.
If you try to get away without it,
and the network layer has so much as a hickup,
your IO will block.
Hard.
Up to here, this was not even considering DRBD...
If you want to use cluster file systems on top of DRBD,
you *additionally* need to integrate DRBD
replication link breakage into your fencing setup.
Some keywords to search for:
fencing resource-and-stonith; fence-peer handler; obliterate-peer;
> cluster is Debian Squeeze with corosync, pacemaker, openais and cman from
> backports. Only corosync and pacemaker are services actually used. Other
> packages are only installed for access to things like fencing and resource
> agents. Drbd 8.3.7 is used from Debian stable repository.
>
> The drbd config is mostly stock, here is the reource definition.
>
> resource shared0 {
> meta-disk internal;
> device /dev/drbd1;
> syncer {
> verify-alg sha1;
> }
> net {
> allow-two-primaries;
> }
> on appserver01 {
> disk /dev/mapper/shared0_appserver01-lv0;
> address 10.221.182.31:7789;
> }
> on appserver02 {
> disk /dev/mapper/shared0_appserver02-lv0;
> address 10.221.182.32:7789;
> }
> }
>
> The logs on the SyncSource node show the following happening at the time of
> the failure.
>
> Dec 2 02:09:56 appserver01 kernel: [123911.353113] block drbd1: peer(
> Primary -> Unknown ) conn( SyncSource -> NetworkFailure )
> Dec 2 02:09:56 appserver01 kernel: [123911.353123] block drbd1: asender
> terminated
> Dec 2 02:09:56 appserver01 kernel: [123911.353126] block drbd1:
> Terminating drbd1_asender
> Dec 2 02:09:56 appserver01 kernel: [123911.353967] block drbd1: Connection
> closed
> Dec 2 02:09:56 appserver01 kernel: [123911.353974] block drbd1: conn(
> NetworkFailure -> Unconnected )
> Dec 2 02:09:56 appserver01 kernel: [123911.353977] block drbd1: receiver
> terminated
> Dec 2 02:09:56 appserver01 kernel: [123911.353978] block drbd1: Restarting
> drbd1_receiver
> Dec 2 02:09:56 appserver01 kernel: [123911.353980] block drbd1: receiver
> (re)started
> Dec 2 02:09:56 appserver01 kernel: [123911.353983] block drbd1: conn(
> Unconnected -> WFConnection )
> Dec 2 02:13:06 appserver01 kernel: [124101.093326] ocfs2rec D
> ffff88017e7fa350 0 26221 2 0x00000000
> Dec 2 02:13:06 appserver01 kernel: [124101.093330] ffff88017e7fa350
> 0000000000000046 ffff88018dad4000 0000000000000010
> Dec 2 02:13:06 appserver01 kernel: [124101.093333] 0000000000000616
> ffffea000455c168 000000000000f9e0 ffff88018dad5fd8
> Dec 2 02:13:06 appserver01 kernel: [124101.093335] 0000000000015780
> 0000000000015780 ffff88017e266350 ffff88017e266648
> Dec 2 02:13:06 appserver01 kernel: [124101.093338] Call Trace:
> Dec 2 02:13:06 appserver01 kernel: [124101.093346] [<ffffffff812fcc4f>] ?
> rwsem_down_failed_common+0x8c/0xa8
> Dec 2 02:13:06 appserver01 kernel: [124101.093348] [<ffffffff812fccb2>] ?
> rwsem_down_read_failed+0x22/0x2b
> Dec 2 02:13:06 appserver01 kernel: [124101.093353] [<ffffffff811965f4>] ?
> call_rwsem_down_read_failed+0x14/0x30
> Dec 2 02:13:06 appserver01 kernel: [124101.093359] [<ffffffffa028f0bc>] ?
> user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
> Dec 2 02:13:06 appserver01 kernel: [124101.093363] [<ffffffff810b885b>] ?
> zone_watermark_ok+0x20/0xb1
> Dec 2 02:13:06 appserver01 kernel: [124101.093365] [<ffffffff812fc665>] ?
> down_read+0x17/0x19
> Dec 2 02:13:06 appserver01 kernel: [124101.093371] [<ffffffffa02133b6>] ?
> dlm_lock+0x56/0x149 [dlm]
> Dec 2 02:13:06 appserver01 kernel: [124101.093374] [<ffffffff810c79c0>] ?
> zone_statistics+0x3c/0x5d
> Dec 2 02:13:06 appserver01 kernel: [124101.093377] [<ffffffffa028f0fe>] ?
> user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
> Dec 2 02:13:06 appserver01 kernel: [124101.093380] [<ffffffffa028f000>] ?
> fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
> Dec 2 02:13:06 appserver01 kernel: [124101.093382] [<ffffffffa028f02d>] ?
> fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
> Dec 2 02:13:06 appserver01 kernel: [124101.093391] [<ffffffffa031587a>] ?
> __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093395] [<ffffffff8100f657>] ?
> __switch_to+0x140/0x297
> Dec 2 02:13:06 appserver01 kernel: [124101.093402] [<ffffffffa0315cd8>] ?
> ocfs2_cluster_lock+0x15/0x17 [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093408] [<ffffffffa03195c2>] ?
> ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093415] [<ffffffffa03195c2>] ?
> ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093421] [<ffffffffa0329f9e>] ?
> __ocfs2_recovery_thread+0x0/0x122b [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093428] [<ffffffffa032a07f>] ?
> __ocfs2_recovery_thread+0xe1/0x122b [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093430] [<ffffffff812fba90>] ?
> thread_return+0x79/0xe0
> Dec 2 02:13:06 appserver01 kernel: [124101.093433] [<ffffffff8103a403>] ?
> activate_task+0x22/0x28
> Dec 2 02:13:06 appserver01 kernel: [124101.093436] [<ffffffff8104a44f>] ?
> try_to_wake_up+0x289/0x29b
> Dec 2 02:13:06 appserver01 kernel: [124101.093443] [<ffffffffa0329f9e>] ?
> __ocfs2_recovery_thread+0x0/0x122b [ocfs2]
> Dec 2 02:13:06 appserver01 kernel: [124101.093446] [<ffffffff81064d79>] ?
> kthread+0x79/0x81
> Dec 2 02:13:06 appserver01 kernel: [124101.093449] [<ffffffff81011baa>] ?
> child_rip+0xa/0x20
> Dec 2 02:13:06 appserver01 kernel: [124101.093451] [<ffffffff81064d00>] ?
> kthread+0x0/0x81
> Dec 2 02:13:06 appserver01 kernel: [124101.093453] [<ffffffff81011ba0>] ?
> child_rip+0x0/0x20
>
> Then a few moments passed.
>
> Dec 2 02:13:32 appserver01 kernel: [124127.071151] block drbd1: Handshake
> successful: Agreed network protocol version 91
> Dec 2 02:13:32 appserver01 kernel: [124127.071157] block drbd1: conn(
> WFConnection -> WFReportParams )
> Dec 2 02:13:32 appserver01 kernel: [124127.076732] block drbd1: Starting
> asender thread (from drbd1_receiver [7526])
> Dec 2 02:13:32 appserver01 kernel: [124127.078447] block drbd1:
> data-integrity-alg: <not-used>
> Dec 2 02:13:32 appserver01 kernel: [124127.078456] block drbd1:
> drbd_sync_handshake:
> Dec 2 02:13:32 appserver01 kernel: [124127.078459] block drbd1: self
> 7843E95E721AF0ED:54BC6F3AD7F42585:52FF69A8720BCEAC:BA309D9B7FCA3C07
> bits:115301551 flags:0
> Dec 2 02:13:32 appserver01 kernel: [124127.078461] block drbd1: peer
> 54BC6F3AD7F42584:0000000000000000:0000000000000000:0000000000000000
> bits:115314775 flags:2
> Dec 2 02:13:32 appserver01 kernel: [124127.078464] block drbd1:
> uuid_compare()=1 by rule 70
> Dec 2 02:13:32 appserver01 kernel: [124127.078465] block drbd1: Becoming
> sync source due to disk states.
> Dec 2 02:13:32 appserver01 kernel: [124127.078469] block drbd1: peer(
> Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
> Dec 2 02:13:39 appserver01 kernel: [124134.091066] block drbd1: conn(
> WFBitMapS -> SyncSource )
> Dec 2 02:13:39 appserver01 kernel: [124134.091078] block drbd1: Began
> resync as SyncSource (will sync 461259100 KB [115314775 bits set]).
>
> And after yet some more moments passing it started to repeatedly post call
> traces. Here is just one cycle of these traces. At this point the load was
> critical and I must assume the server was unresponsive because the status
> of the alarms didn't change until manual intervention. It kept posting call
> traces for 4 minutes and then I must assume DRBD died because it was quiet
> until reboot.
>
> Dec 2 02:15:06 appserver01 kernel: [124220.996240] ocfs2rec D
> ffff88017e7fa350 0 26221 2 0x00000000
> Dec 2 02:15:06 appserver01 kernel: [124220.996244] ffff88017e7fa350
> 0000000000000046 ffff88018dad4000 0000000000000010
> Dec 2 02:15:06 appserver01 kernel: [124220.996247] 0000000000000616
> ffffea000455c168 000000000000f9e0 ffff88018dad5fd8
> Dec 2 02:15:06 appserver01 kernel: [124220.996250] 0000000000015780
> 0000000000015780 ffff88017e266350 ffff88017e266648
> Dec 2 02:15:06 appserver01 kernel: [124220.996252] Call Trace:
> Dec 2 02:15:06 appserver01 kernel: [124220.996260] [<ffffffff812fcc4f>] ?
> rwsem_down_failed_common+0x8c/0xa8
> Dec 2 02:15:06 appserver01 kernel: [124220.996262] [<ffffffff812fccb2>] ?
> rwsem_down_read_failed+0x22/0x2b
> Dec 2 02:15:06 appserver01 kernel: [124220.996267] [<ffffffff811965f4>] ?
> call_rwsem_down_read_failed+0x14/0x30
> Dec 2 02:15:06 appserver01 kernel: [124220.996273] [<ffffffffa028f0bc>] ?
> user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
> Dec 2 02:15:06 appserver01 kernel: [124220.996277] [<ffffffff810b885b>] ?
> zone_watermark_ok+0x20/0xb1
> Dec 2 02:15:06 appserver01 kernel: [124220.996279] [<ffffffff812fc665>] ?
> down_read+0x17/0x19
> Dec 2 02:15:06 appserver01 kernel: [124220.996285] [<ffffffffa02133b6>] ?
> dlm_lock+0x56/0x149 [dlm]
> Dec 2 02:15:06 appserver01 kernel: [124220.996289] [<ffffffff810c79c0>] ?
> zone_statistics+0x3c/0x5d
> Dec 2 02:15:06 appserver01 kernel: [124220.996291] [<ffffffffa028f0fe>] ?
> user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
> Dec 2 02:15:06 appserver01 kernel: [124220.996294] [<ffffffffa028f000>] ?
> fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
> Dec 2 02:15:06 appserver01 kernel: [124220.996297] [<ffffffffa028f02d>] ?
> fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
> Dec 2 02:15:06 appserver01 kernel: [124220.996305] [<ffffffffa031587a>] ?
> __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996310] [<ffffffff8100f657>] ?
> __switch_to+0x140/0x297
> Dec 2 02:15:06 appserver01 kernel: [124220.996317] [<ffffffffa0315cd8>] ?
> ocfs2_cluster_lock+0x15/0x17 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996323] [<ffffffffa03195c2>] ?
> ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996330] [<ffffffffa03195c2>] ?
> ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996337] [<ffffffffa0329f9e>] ?
> __ocfs2_recovery_thread+0x0/0x122b [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996343] [<ffffffffa032a07f>] ?
> __ocfs2_recovery_thread+0xe1/0x122b [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996346] [<ffffffff812fba90>] ?
> thread_return+0x79/0xe0
> Dec 2 02:15:06 appserver01 kernel: [124220.996349] [<ffffffff8103a403>] ?
> activate_task+0x22/0x28
> Dec 2 02:15:06 appserver01 kernel: [124220.996352] [<ffffffff8104a44f>] ?
> try_to_wake_up+0x289/0x29b
> Dec 2 02:15:06 appserver01 kernel: [124220.996359] [<ffffffffa0329f9e>] ?
> __ocfs2_recovery_thread+0x0/0x122b [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996362] [<ffffffff81064d79>] ?
> kthread+0x79/0x81
> Dec 2 02:15:06 appserver01 kernel: [124220.996364] [<ffffffff81011baa>] ?
> child_rip+0xa/0x20
> Dec 2 02:15:06 appserver01 kernel: [124220.996366] [<ffffffff81064d00>] ?
> kthread+0x0/0x81
> Dec 2 02:15:06 appserver01 kernel: [124220.996368] [<ffffffff81011ba0>] ?
> child_rip+0x0/0x20
> Dec 2 02:15:06 appserver01 kernel: [124220.996556] ls D
> ffff8801bb5a2a60 0 26318 26317 0x00000000
> Dec 2 02:15:06 appserver01 kernel: [124220.996559] ffff8801bb5a2a60
> 0000000000000082 ffff8801bb7734c8 ffffffff81103ab9
> Dec 2 02:15:06 appserver01 kernel: [124220.996561] ffff88016843dd58
> ffff88016843ddf8 000000000000f9e0 ffff88016843dfd8
> Dec 2 02:15:06 appserver01 kernel: [124220.996563] 0000000000015780
> 0000000000015780 ffff8801bcf1a350 ffff8801bcf1a648
> Dec 2 02:15:06 appserver01 kernel: [124220.996566] Call Trace:
> Dec 2 02:15:06 appserver01 kernel: [124220.996570] [<ffffffff81103ab9>] ?
> mntput_no_expire+0x23/0xee
> Dec 2 02:15:06 appserver01 kernel: [124220.996573] [<ffffffff810f75af>] ?
> __link_path_walk+0x6f0/0x6f5
> Dec 2 02:15:06 appserver01 kernel: [124220.996580] [<ffffffffa03296af>] ?
> ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996582] [<ffffffff81065046>] ?
> autoremove_wake_function+0x0/0x2e
> Dec 2 02:15:06 appserver01 kernel: [124220.996589] [<ffffffffa0319923>] ?
> ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996596] [<ffffffffa0324f2d>] ?
> ocfs2_inode_revalidate+0x145/0x221 [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996603] [<ffffffffa03208d9>] ?
> ocfs2_getattr+0x79/0x16a [ocfs2]
> Dec 2 02:15:06 appserver01 kernel: [124220.996606] [<ffffffff810f2591>] ?
> vfs_fstatat+0x43/0x57
> Dec 2 02:15:06 appserver01 kernel: [124220.996609] [<ffffffff810f25fb>] ?
> sys_newlstat+0x11/0x30
> Dec 2 02:15:06 appserver01 kernel: [124220.996612] [<ffffffff812ff306>] ?
> do_page_fault+0x2e0/0x2fc
> Dec 2 02:15:06 appserver01 kernel: [124220.996614] [<ffffffff812fd1a5>] ?
> page_fault+0x25/0x30
> Dec 2 02:15:06 appserver01 kernel: [124220.996616] [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Dec 2 02:17:06 appserver01 kernel: [124340.899149] events/0 D
> ffff88017e7faa60 0 6 2 0x00000000
> Dec 2 02:17:06 appserver01 kernel: [124340.899153] ffff88017e7faa60
> 0000000000000046 ffff880006e157e8 ffff8801bf09e388
> Dec 2 02:17:06 appserver01 kernel: [124340.899157] ffff8801bc88f1b8
> ffff8801bc88f1a8 000000000000f9e0 ffff8801bf0b3fd8
> Dec 2 02:17:06 appserver01 kernel: [124340.899160] 0000000000015780
> 0000000000015780 ffff8801bf09e350 ffff8801bf09e648
> Dec 2 02:17:06 appserver01 kernel: [124340.899162] Call Trace:
> Dec 2 02:17:06 appserver01 kernel: [124340.899169] [<ffffffff812fba90>] ?
> thread_return+0x79/0xe0
> Dec 2 02:17:06 appserver01 kernel: [124340.899172] [<ffffffff812fcc4f>] ?
> rwsem_down_failed_common+0x8c/0xa8
> Dec 2 02:17:06 appserver01 kernel: [124340.899175] [<ffffffff812fccb2>] ?
> rwsem_down_read_failed+0x22/0x2b
> Dec 2 02:17:06 appserver01 kernel: [124340.899179] [<ffffffff811965f4>] ?
> call_rwsem_down_read_failed+0x14/0x30
> Dec 2 02:17:06 appserver01 kernel: [124340.899185] [<ffffffffa028f0bc>] ?
> user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
> Dec 2 02:17:06 appserver01 kernel: [124340.899188] [<ffffffff812fc665>] ?
> down_read+0x17/0x19
> Dec 2 02:17:06 appserver01 kernel: [124340.899193] [<ffffffffa02133b6>] ?
> dlm_lock+0x56/0x149 [dlm]
> Dec 2 02:17:06 appserver01 kernel: [124340.899198] [<ffffffff810168c1>] ?
> sched_clock+0x5/0x8
> Dec 2 02:17:06 appserver01 kernel: [124340.899202] [<ffffffff81049412>] ?
> update_rq_clock+0xf/0x28
> Dec 2 02:17:06 appserver01 kernel: [124340.899205] [<ffffffff8104a44f>] ?
> try_to_wake_up+0x289/0x29b
> Dec 2 02:17:06 appserver01 kernel: [124340.899209] [<ffffffff810fd0ce>] ?
> pollwake+0x53/0x59
> Dec 2 02:17:06 appserver01 kernel: [124340.899211] [<ffffffff8104a461>] ?
> default_wake_function+0x0/0x9
> Dec 2 02:17:06 appserver01 kernel: [124340.899214] [<ffffffffa028f0fe>] ?
> user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
> Dec 2 02:17:06 appserver01 kernel: [124340.899217] [<ffffffffa028f000>] ?
> fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
> Dec 2 02:17:06 appserver01 kernel: [124340.899219] [<ffffffffa028f02d>] ?
> fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
> Dec 2 02:17:06 appserver01 kernel: [124340.899228] [<ffffffffa031587a>] ?
> __ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899231] [<ffffffff812fba90>] ?
> thread_return+0x79/0xe0
> Dec 2 02:17:06 appserver01 kernel: [124340.899237] [<ffffffffa0315cd8>] ?
> ocfs2_cluster_lock+0x15/0x17 [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899244] [<ffffffffa0317472>] ?
> ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899250] [<ffffffffa0317472>] ?
> ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899257] [<ffffffffa0328abe>] ?
> ocfs2_queue_orphan_scan+0x29/0x126 [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899259] [<ffffffff812fc3c6>] ?
> mutex_lock+0xd/0x31
> Dec 2 02:17:06 appserver01 kernel: [124340.899266] [<ffffffffa0328be0>] ?
> ocfs2_orphan_scan_work+0x25/0x4d [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899270] [<ffffffff81061a13>] ?
> worker_thread+0x188/0x21d
> Dec 2 02:17:06 appserver01 kernel: [124340.899276] [<ffffffffa0328bbb>] ?
> ocfs2_orphan_scan_work+0x0/0x4d [ocfs2]
> Dec 2 02:17:06 appserver01 kernel: [124340.899280] [<ffffffff81065046>] ?
> autoremove_wake_function+0x0/0x2e
> Dec 2 02:17:06 appserver01 kernel: [124340.899282] [<ffffffff8106188b>] ?
> worker_thread+0x0/0x21d
> Dec 2 02:17:06 appserver01 kernel: [124340.899284] [<ffffffff81064d79>] ?
> kthread+0x79/0x81
> Dec 2 02:17:06 appserver01 kernel: [124340.899287] [<ffffffff81011baa>] ?
> child_rip+0xa/0x20
> Dec 2 02:17:06 appserver01 kernel: [124340.899289] [<ffffffff81064d00>] ?
> kthread+0x0/0x81
> Dec 2 02:17:06 appserver01 kernel: [124340.899291] [<ffffffff81011ba0>] ?
> child_rip+0x0/0x20
>
> --
> Hälsningar / Greetings
>
> Stefan Midjich
> [De omnibus dubitandum]
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed