Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have been digging a bit more in the current state and these are my findings: Once a migration is right about to be finished, the secondary node received the signal of becoming primary and it is when drbdsetup is trying to set the secondary, it fails in the primary This is the drbdsetup process in d-state, under hostname6 root 27691 0.0 0.0 3964 536 ? D 00:06 0:00 drbdsetup /dev/drbd1 secondary hostname6 1 inst-test3.google.com disk/0 secondary hostname5 hostname5 0 inst-test3.google.com disk/0 primary hostname6 node: hostname5 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---b- node: hostname6 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---b- [xen-test] root at hostname5:~# drbdsetup /dev/drbd0 status <resource minor="0" cs="Connected" ro1="Primary" ro2="Primary" ds1="UpToDate" ds2="UpToDate" /> For hostname6's trace: http://db.tt/FI8rmrpw And hostname5's relevant pieces. If more are needed, i can post them. Feb 27 16:08:43 hostname5 kernel: [64271.712343] drbd0_receiver S ffff88003e411dc0 0 19682 2 0x00000000 Feb 27 16:08:43 hostname5 kernel: [64271.712351] ffff88002b205970 0000000000000246 ffff88002b2058f0 ffff880016daea00 Feb 27 16:08:43 hostname5 kernel: [64271.712360] 000000000000000c 0000000000000000 ffff8800020fd160 0000000000011dc0 Feb 27 16:08:43 hostname5 kernel: [64271.712369] ffff88002b205fd8 ffff88002b204010 ffff88002b205fd8 0000000000011dc0 Feb 27 16:08:43 hostname5 kernel: [64271.712377] Call Trace: Feb 27 16:08:43 hostname5 kernel: [64271.712382] [<ffffffff810d1057>] ? kfree+0x17/0xc0 Feb 27 16:08:43 hostname5 kernel: [64271.712388] [<ffffffff8149878a>] schedule+0x3a/0x60 Feb 27 16:08:43 hostname5 kernel: [64271.712393] [<ffffffff81498b95>] schedule_timeout+0x185/0x1e0 Feb 27 16:08:43 hostname5 kernel: [64271.712400] [<ffffffff8104c3e2>] ? local_bh_enable_ip+0x22/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712406] [<ffffffff8149a444>] ? _raw_spin_unlock_bh+0x14/0x20 Feb 27 16:08:43 hostname5 kernel: [64271.712413] [<ffffffff813b63f1>] sk_wait_data+0xd1/0xe0 Feb 27 16:08:43 hostname5 kernel: [64271.712419] [<ffffffff810622b0>] ? wake_up_bit+0x40/0x40 Feb 27 16:08:43 hostname5 kernel: [64271.712425] [<ffffffff8104be62>] ? local_bh_enable+0x22/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712431] [<ffffffff81403d01>] tcp_recvmsg+0x651/0xc80 Feb 27 16:08:43 hostname5 kernel: [64271.712437] [<ffffffff81424f7a>] inet_recvmsg+0x4a/0x80 Feb 27 16:08:43 hostname5 kernel: [64271.712444] [<ffffffff81005485>] ? arbitrary_virt_to_machine+0x85/0xb0 Feb 27 16:08:43 hostname5 kernel: [64271.712450] [<ffffffff813b1961>] sock_recvmsg+0xc1/0xf0 Feb 27 16:08:43 hostname5 kernel: [64271.712456] [<ffffffff81003129>] ? xen_end_context_switch+0x19/0x20 Feb 27 16:08:43 hostname5 kernel: [64271.712462] [<ffffffff81009915>] ? __switch_to+0x145/0x370 Feb 27 16:08:43 hostname5 kernel: [64271.712468] [<ffffffff8103e75e>] ? finish_task_switch+0x5e/0xc0 Feb 27 16:08:43 hostname5 kernel: [64271.712475] [<ffffffff814981ea>] ? __schedule+0x29a/0x760 Feb 27 16:08:43 hostname5 kernel: [64271.712481] [<ffffffff8149a279>] ? _raw_spin_unlock_irqrestore+0x19/0x20 Feb 27 16:08:43 hostname5 kernel: [64271.712493] [<ffffffffa00569ec>] drbd_recv+0x8c/0x230 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712505] [<ffffffffa0059bac>] ? drbd_may_finish_epoch+0x9c/0x3a0 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712517] [<ffffffffa0057a0e>] drbd_recv_header+0x2e/0x130 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712528] [<ffffffffa00583e6>] drbdd+0x46/0x200 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712540] [<ffffffffa005e2e5>] drbdd_init+0x85/0x130 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712551] [<ffffffffa006a174>] drbd_thread_setup+0x64/0xf0 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712563] [<ffffffffa006a110>] ? _drbd_thread_stop+0x100/0x100 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712569] [<ffffffff81061e06>] kthread+0x96/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712576] [<ffffffff8149ceb4>] kernel_thread_helper+0x4/0x10 Feb 27 16:08:43 hostname5 kernel: [64271.712582] [<ffffffff8149af76>] ? int_ret_from_sys_call+0x7/0x1b Feb 27 16:08:43 hostname5 kernel: [64271.712589] [<ffffffff8149a6bc>] ? retint_restore_args+0x5/0x6 Feb 27 16:08:43 hostname5 kernel: [64271.712595] [<ffffffff8149ceb0>] ? gs_change+0x13/0x13 Feb 27 16:08:43 hostname5 kernel: [64271.712599] drbd0_asender S ffff88003e411dc0 0 19689 2 0x00000000 Feb 27 16:08:43 hostname5 kernel: [64271.712607] ffff8800249739a0 0000000000000246 ffff880024973920 ffffffff810cd8c8 Feb 27 16:08:43 hostname5 kernel: [64271.712616] 000000000000a570 ffff880016ea94b0 ffff8800020f8000 0000000000011dc0 Feb 27 16:08:43 hostname5 kernel: [64271.712625] ffff880024973fd8 ffff880024972010 ffff880024973fd8 0000000000011dc0 Feb 27 16:08:43 hostname5 kernel: [64271.712633] Call Trace: Feb 27 16:08:43 hostname5 kernel: [64271.712637] [<ffffffff810cd8c8>] ? ksize+0x18/0xc0 Feb 27 16:08:43 hostname5 kernel: [64271.712665] [<ffffffff8149a22a>] ? _raw_spin_lock_irqsave+0x2a/0x40 Feb 27 16:08:43 hostname5 kernel: [64271.712671] [<ffffffff8149878a>] schedule+0x3a/0x60 Feb 27 16:08:43 hostname5 kernel: [64271.712677] [<ffffffff81498b45>] schedule_timeout+0x135/0x1e0 Feb 27 16:08:43 hostname5 kernel: [64271.712684] [<ffffffff81052540>] ? add_timer_on+0xa0/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712690] [<ffffffff813b63f1>] sk_wait_data+0xd1/0xe0 Feb 27 16:08:43 hostname5 kernel: [64271.712696] [<ffffffff810622b0>] ? wake_up_bit+0x40/0x40 Feb 27 16:08:43 hostname5 kernel: [64271.712702] [<ffffffff8104be62>] ? local_bh_enable+0x22/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712708] [<ffffffff81403d01>] tcp_recvmsg+0x651/0xc80 Feb 27 16:08:43 hostname5 kernel: [64271.712715] [<ffffffff810085d1>] ? m2p_remove_override+0x251/0x2f0 Feb 27 16:08:43 hostname5 kernel: [64271.712721] [<ffffffff81424f7a>] inet_recvmsg+0x4a/0x80 Feb 27 16:08:43 hostname5 kernel: [64271.712727] [<ffffffff813b1961>] sock_recvmsg+0xc1/0xf0 Feb 27 16:08:43 hostname5 kernel: [64271.712733] [<ffffffff810d0ac5>] ? kmem_cache_free+0x15/0x90 Feb 27 16:08:43 hostname5 kernel: [64271.712740] [<ffffffff81096442>] ? mempool_free_slab+0x12/0x20 Feb 27 16:08:43 hostname5 kernel: [64271.712747] [<ffffffff81096555>] ? mempool_free+0x85/0x90 Feb 27 16:08:43 hostname5 kernel: [64271.712758] [<ffffffffa005ff0a>] ? _req_may_be_done+0x12a/0x4e0 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712765] [<ffffffff8103a02e>] ? __wake_up+0x4e/0x70 Feb 27 16:08:43 hostname5 kernel: [64271.712771] [<ffffffff8149a279>] ? _raw_spin_unlock_irqrestore+0x19/0x20 Feb 27 16:08:43 hostname5 kernel: [64271.712777] [<ffffffff8103a02e>] ? __wake_up+0x4e/0x70 Feb 27 16:08:43 hostname5 kernel: [64271.712788] [<ffffffffa00541b3>] drbd_recv_short+0x73/0x90 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712800] [<ffffffffa0059569>] drbd_asender+0x189/0x730 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712806] [<ffffffff8103e75e>] ? finish_task_switch+0x5e/0xc0 Feb 27 16:08:43 hostname5 kernel: [64271.712818] [<ffffffffa006a174>] drbd_thread_setup+0x64/0xf0 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712829] [<ffffffffa006a110>] ? _drbd_thread_stop+0x100/0x100 [drbd] Feb 27 16:08:43 hostname5 kernel: [64271.712836] [<ffffffff81061e06>] kthread+0x96/0xa0 Feb 27 16:08:43 hostname5 kernel: [64271.712842] [<ffffffff8149ceb4>] kernel_thread_helper+0x4/0x10 Feb 27 16:08:43 hostname5 kernel: [64271.712849] [<ffffffff8149af76>] ? int_ret_from_sys_call+0x7/0x1b Feb 27 16:08:43 hostname5 kernel: [64271.712855] [<ffffffff8149a6bc>] ? retint_restore_args+0x5/0x6 Feb 27 16:08:43 hostname5 kernel: [64271.712861] [<ffffffff8149ceb0>] ? gs_change+0x13/0x13 -- climent () gmail ! com On Wed, Feb 27, 2013 at 9:25 AM, Jesus Climent <climent at gmail.com> wrote: > I have managed to create a repro case, when the system is under a high > load of I/O. From a set of 4 test clusters, all except one got a > "drbdsetup /dev/drbdX secondary" hang. > > What other information should I send to the list in order to evaluate > this problem? > > On Wed, Feb 27, 2013 at 7:45 AM, Lars Ellenberg > <lars.ellenberg at linbit.com> wrote: >> On Tue, Feb 26, 2013 at 06:01:04PM -0500, Jesus Climent wrote: >>> Has anybody taken a look into it? >> >> Yep. >> Nothing obvious, sorry. >> >> Lars >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user > > > > -- > climent () gmail ! com -- climent () gmail ! com