[DRBD-user] NetworkFailure after 2 days of operation in active/active

Stefan Midjich swehack at gmail.com
Sun Dec 2 11:52:47 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Fortunately the data volume was only mounted but not in use.

I found a similar list post on
http://lists.linbit.com/pipermail/drbd-user/2008-April/009156.html but it
had no replies on what could cause this. I've been thinking the DRBD
traffic should be on a separate network but have not set this up yet. Right
now the DRBD traffic goes over the same vNetwork that other traffic goes
over, including multicast VIP traffic form both LVS and pacemaker clusters.

In words the SyncSource node started using a critical load average of
resources and became unresponsive. This is a VM setup split over different
physical ESX hosts but even the local console was dead. So a forced reset
was in order.

The cluster services came up fine, corosync+pacemaker+o2cb+ocfs2_dlm. The
cluster is Debian Squeeze with corosync, pacemaker, openais and cman from
backports. Only corosync and pacemaker are services actually used. Other
packages are only installed for access to things like fencing and resource
agents. Drbd 8.3.7 is used from Debian stable repository.

The drbd config is mostly stock, here is the reource definition.

resource shared0 {
 meta-disk internal;
 device  /dev/drbd1;
 syncer {
  verify-alg sha1;
 }
 net {
  allow-two-primaries;
 }
 on appserver01 {
  disk   /dev/mapper/shared0_appserver01-lv0;
  address  10.221.182.31:7789;
 }
 on appserver02 {
  disk   /dev/mapper/shared0_appserver02-lv0;
  address  10.221.182.32:7789;
 }
}

The logs on the SyncSource node show the following happening at the time of
the failure.

Dec  2 02:09:56 appserver01 kernel: [123911.353113] block drbd1: peer(
Primary -> Unknown ) conn( SyncSource -> NetworkFailure )
Dec  2 02:09:56 appserver01 kernel: [123911.353123] block drbd1: asender
terminated
Dec  2 02:09:56 appserver01 kernel: [123911.353126] block drbd1:
Terminating drbd1_asender
Dec  2 02:09:56 appserver01 kernel: [123911.353967] block drbd1: Connection
closed
Dec  2 02:09:56 appserver01 kernel: [123911.353974] block drbd1: conn(
NetworkFailure -> Unconnected )
Dec  2 02:09:56 appserver01 kernel: [123911.353977] block drbd1: receiver
terminated
Dec  2 02:09:56 appserver01 kernel: [123911.353978] block drbd1: Restarting
drbd1_receiver
Dec  2 02:09:56 appserver01 kernel: [123911.353980] block drbd1: receiver
(re)started
Dec  2 02:09:56 appserver01 kernel: [123911.353983] block drbd1: conn(
Unconnected -> WFConnection )
Dec  2 02:13:06 appserver01 kernel: [124101.093326] ocfs2rec      D
ffff88017e7fa350     0 26221      2 0x00000000
Dec  2 02:13:06 appserver01 kernel: [124101.093330]  ffff88017e7fa350
0000000000000046 ffff88018dad4000 0000000000000010
Dec  2 02:13:06 appserver01 kernel: [124101.093333]  0000000000000616
ffffea000455c168 000000000000f9e0 ffff88018dad5fd8
Dec  2 02:13:06 appserver01 kernel: [124101.093335]  0000000000015780
0000000000015780 ffff88017e266350 ffff88017e266648
Dec  2 02:13:06 appserver01 kernel: [124101.093338] Call Trace:
Dec  2 02:13:06 appserver01 kernel: [124101.093346]  [<ffffffff812fcc4f>] ?
rwsem_down_failed_common+0x8c/0xa8
Dec  2 02:13:06 appserver01 kernel: [124101.093348]  [<ffffffff812fccb2>] ?
rwsem_down_read_failed+0x22/0x2b
Dec  2 02:13:06 appserver01 kernel: [124101.093353]  [<ffffffff811965f4>] ?
call_rwsem_down_read_failed+0x14/0x30
Dec  2 02:13:06 appserver01 kernel: [124101.093359]  [<ffffffffa028f0bc>] ?
user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
Dec  2 02:13:06 appserver01 kernel: [124101.093363]  [<ffffffff810b885b>] ?
zone_watermark_ok+0x20/0xb1
Dec  2 02:13:06 appserver01 kernel: [124101.093365]  [<ffffffff812fc665>] ?
down_read+0x17/0x19
Dec  2 02:13:06 appserver01 kernel: [124101.093371]  [<ffffffffa02133b6>] ?
dlm_lock+0x56/0x149 [dlm]
Dec  2 02:13:06 appserver01 kernel: [124101.093374]  [<ffffffff810c79c0>] ?
zone_statistics+0x3c/0x5d
Dec  2 02:13:06 appserver01 kernel: [124101.093377]  [<ffffffffa028f0fe>] ?
user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
Dec  2 02:13:06 appserver01 kernel: [124101.093380]  [<ffffffffa028f000>] ?
fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
Dec  2 02:13:06 appserver01 kernel: [124101.093382]  [<ffffffffa028f02d>] ?
fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
Dec  2 02:13:06 appserver01 kernel: [124101.093391]  [<ffffffffa031587a>] ?
__ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093395]  [<ffffffff8100f657>] ?
__switch_to+0x140/0x297
Dec  2 02:13:06 appserver01 kernel: [124101.093402]  [<ffffffffa0315cd8>] ?
ocfs2_cluster_lock+0x15/0x17 [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093408]  [<ffffffffa03195c2>] ?
ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093415]  [<ffffffffa03195c2>] ?
ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093421]  [<ffffffffa0329f9e>] ?
__ocfs2_recovery_thread+0x0/0x122b [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093428]  [<ffffffffa032a07f>] ?
__ocfs2_recovery_thread+0xe1/0x122b [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093430]  [<ffffffff812fba90>] ?
thread_return+0x79/0xe0
Dec  2 02:13:06 appserver01 kernel: [124101.093433]  [<ffffffff8103a403>] ?
activate_task+0x22/0x28
Dec  2 02:13:06 appserver01 kernel: [124101.093436]  [<ffffffff8104a44f>] ?
try_to_wake_up+0x289/0x29b
Dec  2 02:13:06 appserver01 kernel: [124101.093443]  [<ffffffffa0329f9e>] ?
__ocfs2_recovery_thread+0x0/0x122b [ocfs2]
Dec  2 02:13:06 appserver01 kernel: [124101.093446]  [<ffffffff81064d79>] ?
kthread+0x79/0x81
Dec  2 02:13:06 appserver01 kernel: [124101.093449]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Dec  2 02:13:06 appserver01 kernel: [124101.093451]  [<ffffffff81064d00>] ?
kthread+0x0/0x81
Dec  2 02:13:06 appserver01 kernel: [124101.093453]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20

Then a few moments passed.

Dec  2 02:13:32 appserver01 kernel: [124127.071151] block drbd1: Handshake
successful: Agreed network protocol version 91
Dec  2 02:13:32 appserver01 kernel: [124127.071157] block drbd1: conn(
WFConnection -> WFReportParams )
Dec  2 02:13:32 appserver01 kernel: [124127.076732] block drbd1: Starting
asender thread (from drbd1_receiver [7526])
Dec  2 02:13:32 appserver01 kernel: [124127.078447] block drbd1:
data-integrity-alg: <not-used>
Dec  2 02:13:32 appserver01 kernel: [124127.078456] block drbd1:
drbd_sync_handshake:
Dec  2 02:13:32 appserver01 kernel: [124127.078459] block drbd1: self
7843E95E721AF0ED:54BC6F3AD7F42585:52FF69A8720BCEAC:BA309D9B7FCA3C07
bits:115301551 flags:0
Dec  2 02:13:32 appserver01 kernel: [124127.078461] block drbd1: peer
54BC6F3AD7F42584:0000000000000000:0000000000000000:0000000000000000
bits:115314775 flags:2
Dec  2 02:13:32 appserver01 kernel: [124127.078464] block drbd1:
uuid_compare()=1 by rule 70
Dec  2 02:13:32 appserver01 kernel: [124127.078465] block drbd1: Becoming
sync source due to disk states.
Dec  2 02:13:32 appserver01 kernel: [124127.078469] block drbd1: peer(
Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Dec  2 02:13:39 appserver01 kernel: [124134.091066] block drbd1: conn(
WFBitMapS -> SyncSource )
Dec  2 02:13:39 appserver01 kernel: [124134.091078] block drbd1: Began
resync as SyncSource (will sync 461259100 KB [115314775 bits set]).

And after yet some more moments passing it started to repeatedly post call
traces. Here is just one cycle of these traces. At this point the load was
critical and I must assume the server was unresponsive because the status
of the alarms didn't change until manual intervention. It kept posting call
traces for 4 minutes and then I must assume DRBD died because it was quiet
until reboot.

Dec  2 02:15:06 appserver01 kernel: [124220.996240] ocfs2rec      D
ffff88017e7fa350     0 26221      2 0x00000000
Dec  2 02:15:06 appserver01 kernel: [124220.996244]  ffff88017e7fa350
0000000000000046 ffff88018dad4000 0000000000000010
Dec  2 02:15:06 appserver01 kernel: [124220.996247]  0000000000000616
ffffea000455c168 000000000000f9e0 ffff88018dad5fd8
Dec  2 02:15:06 appserver01 kernel: [124220.996250]  0000000000015780
0000000000015780 ffff88017e266350 ffff88017e266648
Dec  2 02:15:06 appserver01 kernel: [124220.996252] Call Trace:
Dec  2 02:15:06 appserver01 kernel: [124220.996260]  [<ffffffff812fcc4f>] ?
rwsem_down_failed_common+0x8c/0xa8
Dec  2 02:15:06 appserver01 kernel: [124220.996262]  [<ffffffff812fccb2>] ?
rwsem_down_read_failed+0x22/0x2b
Dec  2 02:15:06 appserver01 kernel: [124220.996267]  [<ffffffff811965f4>] ?
call_rwsem_down_read_failed+0x14/0x30
Dec  2 02:15:06 appserver01 kernel: [124220.996273]  [<ffffffffa028f0bc>] ?
user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
Dec  2 02:15:06 appserver01 kernel: [124220.996277]  [<ffffffff810b885b>] ?
zone_watermark_ok+0x20/0xb1
Dec  2 02:15:06 appserver01 kernel: [124220.996279]  [<ffffffff812fc665>] ?
down_read+0x17/0x19
Dec  2 02:15:06 appserver01 kernel: [124220.996285]  [<ffffffffa02133b6>] ?
dlm_lock+0x56/0x149 [dlm]
Dec  2 02:15:06 appserver01 kernel: [124220.996289]  [<ffffffff810c79c0>] ?
zone_statistics+0x3c/0x5d
Dec  2 02:15:06 appserver01 kernel: [124220.996291]  [<ffffffffa028f0fe>] ?
user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
Dec  2 02:15:06 appserver01 kernel: [124220.996294]  [<ffffffffa028f000>] ?
fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
Dec  2 02:15:06 appserver01 kernel: [124220.996297]  [<ffffffffa028f02d>] ?
fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
Dec  2 02:15:06 appserver01 kernel: [124220.996305]  [<ffffffffa031587a>] ?
__ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996310]  [<ffffffff8100f657>] ?
__switch_to+0x140/0x297
Dec  2 02:15:06 appserver01 kernel: [124220.996317]  [<ffffffffa0315cd8>] ?
ocfs2_cluster_lock+0x15/0x17 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996323]  [<ffffffffa03195c2>] ?
ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996330]  [<ffffffffa03195c2>] ?
ocfs2_super_lock+0xc7/0x2a9 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996337]  [<ffffffffa0329f9e>] ?
__ocfs2_recovery_thread+0x0/0x122b [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996343]  [<ffffffffa032a07f>] ?
__ocfs2_recovery_thread+0xe1/0x122b [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996346]  [<ffffffff812fba90>] ?
thread_return+0x79/0xe0
Dec  2 02:15:06 appserver01 kernel: [124220.996349]  [<ffffffff8103a403>] ?
activate_task+0x22/0x28
Dec  2 02:15:06 appserver01 kernel: [124220.996352]  [<ffffffff8104a44f>] ?
try_to_wake_up+0x289/0x29b
Dec  2 02:15:06 appserver01 kernel: [124220.996359]  [<ffffffffa0329f9e>] ?
__ocfs2_recovery_thread+0x0/0x122b [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996362]  [<ffffffff81064d79>] ?
kthread+0x79/0x81
Dec  2 02:15:06 appserver01 kernel: [124220.996364]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Dec  2 02:15:06 appserver01 kernel: [124220.996366]  [<ffffffff81064d00>] ?
kthread+0x0/0x81
Dec  2 02:15:06 appserver01 kernel: [124220.996368]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Dec  2 02:15:06 appserver01 kernel: [124220.996556] ls            D
ffff8801bb5a2a60     0 26318  26317 0x00000000
Dec  2 02:15:06 appserver01 kernel: [124220.996559]  ffff8801bb5a2a60
0000000000000082 ffff8801bb7734c8 ffffffff81103ab9
Dec  2 02:15:06 appserver01 kernel: [124220.996561]  ffff88016843dd58
ffff88016843ddf8 000000000000f9e0 ffff88016843dfd8
Dec  2 02:15:06 appserver01 kernel: [124220.996563]  0000000000015780
0000000000015780 ffff8801bcf1a350 ffff8801bcf1a648
Dec  2 02:15:06 appserver01 kernel: [124220.996566] Call Trace:
Dec  2 02:15:06 appserver01 kernel: [124220.996570]  [<ffffffff81103ab9>] ?
mntput_no_expire+0x23/0xee
Dec  2 02:15:06 appserver01 kernel: [124220.996573]  [<ffffffff810f75af>] ?
__link_path_walk+0x6f0/0x6f5
Dec  2 02:15:06 appserver01 kernel: [124220.996580]  [<ffffffffa03296af>] ?
ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996582]  [<ffffffff81065046>] ?
autoremove_wake_function+0x0/0x2e
Dec  2 02:15:06 appserver01 kernel: [124220.996589]  [<ffffffffa0319923>] ?
ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996596]  [<ffffffffa0324f2d>] ?
ocfs2_inode_revalidate+0x145/0x221 [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996603]  [<ffffffffa03208d9>] ?
ocfs2_getattr+0x79/0x16a [ocfs2]
Dec  2 02:15:06 appserver01 kernel: [124220.996606]  [<ffffffff810f2591>] ?
vfs_fstatat+0x43/0x57
Dec  2 02:15:06 appserver01 kernel: [124220.996609]  [<ffffffff810f25fb>] ?
sys_newlstat+0x11/0x30
Dec  2 02:15:06 appserver01 kernel: [124220.996612]  [<ffffffff812ff306>] ?
do_page_fault+0x2e0/0x2fc
Dec  2 02:15:06 appserver01 kernel: [124220.996614]  [<ffffffff812fd1a5>] ?
page_fault+0x25/0x30
Dec  2 02:15:06 appserver01 kernel: [124220.996616]  [<ffffffff81010b42>] ?
system_call_fastpath+0x16/0x1b
Dec  2 02:17:06 appserver01 kernel: [124340.899149] events/0      D
ffff88017e7faa60     0     6      2 0x00000000
Dec  2 02:17:06 appserver01 kernel: [124340.899153]  ffff88017e7faa60
0000000000000046 ffff880006e157e8 ffff8801bf09e388
Dec  2 02:17:06 appserver01 kernel: [124340.899157]  ffff8801bc88f1b8
ffff8801bc88f1a8 000000000000f9e0 ffff8801bf0b3fd8
Dec  2 02:17:06 appserver01 kernel: [124340.899160]  0000000000015780
0000000000015780 ffff8801bf09e350 ffff8801bf09e648
Dec  2 02:17:06 appserver01 kernel: [124340.899162] Call Trace:
Dec  2 02:17:06 appserver01 kernel: [124340.899169]  [<ffffffff812fba90>] ?
thread_return+0x79/0xe0
Dec  2 02:17:06 appserver01 kernel: [124340.899172]  [<ffffffff812fcc4f>] ?
rwsem_down_failed_common+0x8c/0xa8
Dec  2 02:17:06 appserver01 kernel: [124340.899175]  [<ffffffff812fccb2>] ?
rwsem_down_read_failed+0x22/0x2b
Dec  2 02:17:06 appserver01 kernel: [124340.899179]  [<ffffffff811965f4>] ?
call_rwsem_down_read_failed+0x14/0x30
Dec  2 02:17:06 appserver01 kernel: [124340.899185]  [<ffffffffa028f0bc>] ?
user_dlm_lock+0x0/0x47 [ocfs2_stack_user]
Dec  2 02:17:06 appserver01 kernel: [124340.899188]  [<ffffffff812fc665>] ?
down_read+0x17/0x19
Dec  2 02:17:06 appserver01 kernel: [124340.899193]  [<ffffffffa02133b6>] ?
dlm_lock+0x56/0x149 [dlm]
Dec  2 02:17:06 appserver01 kernel: [124340.899198]  [<ffffffff810168c1>] ?
sched_clock+0x5/0x8
Dec  2 02:17:06 appserver01 kernel: [124340.899202]  [<ffffffff81049412>] ?
update_rq_clock+0xf/0x28
Dec  2 02:17:06 appserver01 kernel: [124340.899205]  [<ffffffff8104a44f>] ?
try_to_wake_up+0x289/0x29b
Dec  2 02:17:06 appserver01 kernel: [124340.899209]  [<ffffffff810fd0ce>] ?
pollwake+0x53/0x59
Dec  2 02:17:06 appserver01 kernel: [124340.899211]  [<ffffffff8104a461>] ?
default_wake_function+0x0/0x9
Dec  2 02:17:06 appserver01 kernel: [124340.899214]  [<ffffffffa028f0fe>] ?
user_dlm_lock+0x42/0x47 [ocfs2_stack_user]
Dec  2 02:17:06 appserver01 kernel: [124340.899217]  [<ffffffffa028f000>] ?
fsdlm_lock_ast_wrapper+0x0/0x2d [ocfs2_stack_user]
Dec  2 02:17:06 appserver01 kernel: [124340.899219]  [<ffffffffa028f02d>] ?
fsdlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_user]
Dec  2 02:17:06 appserver01 kernel: [124340.899228]  [<ffffffffa031587a>] ?
__ocfs2_cluster_lock+0x47c/0x8c5 [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899231]  [<ffffffff812fba90>] ?
thread_return+0x79/0xe0
Dec  2 02:17:06 appserver01 kernel: [124340.899237]  [<ffffffffa0315cd8>] ?
ocfs2_cluster_lock+0x15/0x17 [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899244]  [<ffffffffa0317472>] ?
ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899250]  [<ffffffffa0317472>] ?
ocfs2_orphan_scan_lock+0x5d/0xa8 [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899257]  [<ffffffffa0328abe>] ?
ocfs2_queue_orphan_scan+0x29/0x126 [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899259]  [<ffffffff812fc3c6>] ?
mutex_lock+0xd/0x31
Dec  2 02:17:06 appserver01 kernel: [124340.899266]  [<ffffffffa0328be0>] ?
ocfs2_orphan_scan_work+0x25/0x4d [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899270]  [<ffffffff81061a13>] ?
worker_thread+0x188/0x21d
Dec  2 02:17:06 appserver01 kernel: [124340.899276]  [<ffffffffa0328bbb>] ?
ocfs2_orphan_scan_work+0x0/0x4d [ocfs2]
Dec  2 02:17:06 appserver01 kernel: [124340.899280]  [<ffffffff81065046>] ?
autoremove_wake_function+0x0/0x2e
Dec  2 02:17:06 appserver01 kernel: [124340.899282]  [<ffffffff8106188b>] ?
worker_thread+0x0/0x21d
Dec  2 02:17:06 appserver01 kernel: [124340.899284]  [<ffffffff81064d79>] ?
kthread+0x79/0x81
Dec  2 02:17:06 appserver01 kernel: [124340.899287]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Dec  2 02:17:06 appserver01 kernel: [124340.899289]  [<ffffffff81064d00>] ?
kthread+0x0/0x81
Dec  2 02:17:06 appserver01 kernel: [124340.899291]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20

-- 
Hälsningar / Greetings

Stefan Midjich
[De omnibus dubitandum]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121202/d8286164/attachment.htm>


More information about the drbd-user mailing list