[DRBD-user] Several cases of hangups found. I have some stack traces to send.

Jesus Climent climent at gmail.com
Fri Mar 1 20:42:33 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Any luck with the traces I sent?

On Wed, Feb 27, 2013 at 1:50 PM, Jesus Climent <climent at gmail.com> wrote:
> I have been digging a bit more in the current state and these are my findings:
>
> Once a migration is right about to be finished, the secondary node
> received the signal of becoming primary and it is when drbdsetup is
> trying to set the secondary, it fails in the primary
>
> This is the drbdsetup process in d-state, under hostname6
> root     27691  0.0  0.0   3964   536 ?        D    00:06   0:00
> drbdsetup /dev/drbd1 secondary
>
> hostname6  1 inst-test3.google.com disk/0 secondary hostname5
> hostname5  0 inst-test3.google.com disk/0 primary hostname6
>
> node: hostname5
>   0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---b-
>
> node: hostname6
>   1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---b-
>
> [xen-test] root at hostname5:~# drbdsetup /dev/drbd0 status
> <resource minor="0" cs="Connected" ro1="Primary" ro2="Primary"
> ds1="UpToDate" ds2="UpToDate" />
>
> For hostname6's trace:
>
> http://db.tt/FI8rmrpw
>
> And hostname5's relevant pieces. If more are needed, i can post them.
>
> Feb 27 16:08:43 hostname5 kernel: [64271.712343] drbd0_receiver  S
> ffff88003e411dc0     0 19682      2 0x00000000
> Feb 27 16:08:43 hostname5 kernel: [64271.712351]  ffff88002b205970
> 0000000000000246 ffff88002b2058f0 ffff880016daea00
> Feb 27 16:08:43 hostname5 kernel: [64271.712360]  000000000000000c
> 0000000000000000 ffff8800020fd160 0000000000011dc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712369]  ffff88002b205fd8
> ffff88002b204010 ffff88002b205fd8 0000000000011dc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712377] Call Trace:
> Feb 27 16:08:43 hostname5 kernel: [64271.712382]  [<ffffffff810d1057>]
> ? kfree+0x17/0xc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712388]  [<ffffffff8149878a>]
> schedule+0x3a/0x60
> Feb 27 16:08:43 hostname5 kernel: [64271.712393]  [<ffffffff81498b95>]
> schedule_timeout+0x185/0x1e0
> Feb 27 16:08:43 hostname5 kernel: [64271.712400]  [<ffffffff8104c3e2>]
> ? local_bh_enable_ip+0x22/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712406]  [<ffffffff8149a444>]
> ? _raw_spin_unlock_bh+0x14/0x20
> Feb 27 16:08:43 hostname5 kernel: [64271.712413]  [<ffffffff813b63f1>]
> sk_wait_data+0xd1/0xe0
> Feb 27 16:08:43 hostname5 kernel: [64271.712419]  [<ffffffff810622b0>]
> ? wake_up_bit+0x40/0x40
> Feb 27 16:08:43 hostname5 kernel: [64271.712425]  [<ffffffff8104be62>]
> ? local_bh_enable+0x22/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712431]  [<ffffffff81403d01>]
> tcp_recvmsg+0x651/0xc80
> Feb 27 16:08:43 hostname5 kernel: [64271.712437]  [<ffffffff81424f7a>]
> inet_recvmsg+0x4a/0x80
> Feb 27 16:08:43 hostname5 kernel: [64271.712444]  [<ffffffff81005485>]
> ? arbitrary_virt_to_machine+0x85/0xb0
> Feb 27 16:08:43 hostname5 kernel: [64271.712450]  [<ffffffff813b1961>]
> sock_recvmsg+0xc1/0xf0
> Feb 27 16:08:43 hostname5 kernel: [64271.712456]  [<ffffffff81003129>]
> ? xen_end_context_switch+0x19/0x20
> Feb 27 16:08:43 hostname5 kernel: [64271.712462]  [<ffffffff81009915>]
> ? __switch_to+0x145/0x370
> Feb 27 16:08:43 hostname5 kernel: [64271.712468]  [<ffffffff8103e75e>]
> ? finish_task_switch+0x5e/0xc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712475]  [<ffffffff814981ea>]
> ? __schedule+0x29a/0x760
> Feb 27 16:08:43 hostname5 kernel: [64271.712481]  [<ffffffff8149a279>]
> ? _raw_spin_unlock_irqrestore+0x19/0x20
> Feb 27 16:08:43 hostname5 kernel: [64271.712493]  [<ffffffffa00569ec>]
> drbd_recv+0x8c/0x230 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712505]  [<ffffffffa0059bac>]
> ? drbd_may_finish_epoch+0x9c/0x3a0 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712517]  [<ffffffffa0057a0e>]
> drbd_recv_header+0x2e/0x130 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712528]  [<ffffffffa00583e6>]
> drbdd+0x46/0x200 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712540]  [<ffffffffa005e2e5>]
> drbdd_init+0x85/0x130 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712551]  [<ffffffffa006a174>]
> drbd_thread_setup+0x64/0xf0 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712563]  [<ffffffffa006a110>]
> ? _drbd_thread_stop+0x100/0x100 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712569]  [<ffffffff81061e06>]
> kthread+0x96/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712576]  [<ffffffff8149ceb4>]
> kernel_thread_helper+0x4/0x10
> Feb 27 16:08:43 hostname5 kernel: [64271.712582]  [<ffffffff8149af76>]
> ? int_ret_from_sys_call+0x7/0x1b
> Feb 27 16:08:43 hostname5 kernel: [64271.712589]  [<ffffffff8149a6bc>]
> ? retint_restore_args+0x5/0x6
> Feb 27 16:08:43 hostname5 kernel: [64271.712595]  [<ffffffff8149ceb0>]
> ? gs_change+0x13/0x13
> Feb 27 16:08:43 hostname5 kernel: [64271.712599] drbd0_asender   S
> ffff88003e411dc0     0 19689      2 0x00000000
> Feb 27 16:08:43 hostname5 kernel: [64271.712607]  ffff8800249739a0
> 0000000000000246 ffff880024973920 ffffffff810cd8c8
> Feb 27 16:08:43 hostname5 kernel: [64271.712616]  000000000000a570
> ffff880016ea94b0 ffff8800020f8000 0000000000011dc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712625]  ffff880024973fd8
> ffff880024972010 ffff880024973fd8 0000000000011dc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712633] Call Trace:
> Feb 27 16:08:43 hostname5 kernel: [64271.712637]  [<ffffffff810cd8c8>]
> ? ksize+0x18/0xc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712665]  [<ffffffff8149a22a>]
> ? _raw_spin_lock_irqsave+0x2a/0x40
> Feb 27 16:08:43 hostname5 kernel: [64271.712671]  [<ffffffff8149878a>]
> schedule+0x3a/0x60
> Feb 27 16:08:43 hostname5 kernel: [64271.712677]  [<ffffffff81498b45>]
> schedule_timeout+0x135/0x1e0
> Feb 27 16:08:43 hostname5 kernel: [64271.712684]  [<ffffffff81052540>]
> ? add_timer_on+0xa0/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712690]  [<ffffffff813b63f1>]
> sk_wait_data+0xd1/0xe0
> Feb 27 16:08:43 hostname5 kernel: [64271.712696]  [<ffffffff810622b0>]
> ? wake_up_bit+0x40/0x40
> Feb 27 16:08:43 hostname5 kernel: [64271.712702]  [<ffffffff8104be62>]
> ? local_bh_enable+0x22/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712708]  [<ffffffff81403d01>]
> tcp_recvmsg+0x651/0xc80
> Feb 27 16:08:43 hostname5 kernel: [64271.712715]  [<ffffffff810085d1>]
> ? m2p_remove_override+0x251/0x2f0
> Feb 27 16:08:43 hostname5 kernel: [64271.712721]  [<ffffffff81424f7a>]
> inet_recvmsg+0x4a/0x80
> Feb 27 16:08:43 hostname5 kernel: [64271.712727]  [<ffffffff813b1961>]
> sock_recvmsg+0xc1/0xf0
> Feb 27 16:08:43 hostname5 kernel: [64271.712733]  [<ffffffff810d0ac5>]
> ? kmem_cache_free+0x15/0x90
> Feb 27 16:08:43 hostname5 kernel: [64271.712740]  [<ffffffff81096442>]
> ? mempool_free_slab+0x12/0x20
> Feb 27 16:08:43 hostname5 kernel: [64271.712747]  [<ffffffff81096555>]
> ? mempool_free+0x85/0x90
> Feb 27 16:08:43 hostname5 kernel: [64271.712758]  [<ffffffffa005ff0a>]
> ? _req_may_be_done+0x12a/0x4e0 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712765]  [<ffffffff8103a02e>]
> ? __wake_up+0x4e/0x70
> Feb 27 16:08:43 hostname5 kernel: [64271.712771]  [<ffffffff8149a279>]
> ? _raw_spin_unlock_irqrestore+0x19/0x20
> Feb 27 16:08:43 hostname5 kernel: [64271.712777]  [<ffffffff8103a02e>]
> ? __wake_up+0x4e/0x70
> Feb 27 16:08:43 hostname5 kernel: [64271.712788]  [<ffffffffa00541b3>]
> drbd_recv_short+0x73/0x90 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712800]  [<ffffffffa0059569>]
> drbd_asender+0x189/0x730 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712806]  [<ffffffff8103e75e>]
> ? finish_task_switch+0x5e/0xc0
> Feb 27 16:08:43 hostname5 kernel: [64271.712818]  [<ffffffffa006a174>]
> drbd_thread_setup+0x64/0xf0 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712829]  [<ffffffffa006a110>]
> ? _drbd_thread_stop+0x100/0x100 [drbd]
> Feb 27 16:08:43 hostname5 kernel: [64271.712836]  [<ffffffff81061e06>]
> kthread+0x96/0xa0
> Feb 27 16:08:43 hostname5 kernel: [64271.712842]  [<ffffffff8149ceb4>]
> kernel_thread_helper+0x4/0x10
> Feb 27 16:08:43 hostname5 kernel: [64271.712849]  [<ffffffff8149af76>]
> ? int_ret_from_sys_call+0x7/0x1b
> Feb 27 16:08:43 hostname5 kernel: [64271.712855]  [<ffffffff8149a6bc>]
> ? retint_restore_args+0x5/0x6
> Feb 27 16:08:43 hostname5 kernel: [64271.712861]  [<ffffffff8149ceb0>]
> ? gs_change+0x13/0x13
>
>
> --
> climent () gmail ! com
>
> On Wed, Feb 27, 2013 at 9:25 AM, Jesus Climent <climent at gmail.com> wrote:
>> I have managed to create a repro case, when the system is under a high
>> load of I/O. From a set of 4 test clusters, all except one got a
>> "drbdsetup /dev/drbdX secondary" hang.
>>
>> What other information should I send to the list in order to evaluate
>> this problem?
>>
>> On Wed, Feb 27, 2013 at 7:45 AM, Lars Ellenberg
>> <lars.ellenberg at linbit.com> wrote:
>>> On Tue, Feb 26, 2013 at 06:01:04PM -0500, Jesus Climent wrote:
>>>> Has anybody taken a look into it?
>>>
>>> Yep.
>>> Nothing obvious, sorry.
>>>
>>>         Lars
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>>
>> --
>> climent () gmail ! com
>
>
>
> --
> climent () gmail ! com



--
climent () gmail ! com



More information about the drbd-user mailing list