[DRBD-user] Several cases of hangups found. I have some stack traces to send.

Jesus Climent climent at gmail.com
Mon Mar 4 22:44:26 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Friendly ping

On Fri, Mar 1, 2013 at 2:42 PM, Jesus Climent <climent at gmail.com> wrote:
> Any luck with the traces I sent?
>
> On Wed, Feb 27, 2013 at 1:50 PM, Jesus Climent <climent at gmail.com> wrote:
>> I have been digging a bit more in the current state and these are my findings:
>>
>> Once a migration is right about to be finished, the secondary node
>> received the signal of becoming primary and it is when drbdsetup is
>> trying to set the secondary, it fails in the primary
>>
>> This is the drbdsetup process in d-state, under hostname6
>> root     27691  0.0  0.0   3964   536 ?        D    00:06   0:00
>> drbdsetup /dev/drbd1 secondary
>>
>> hostname6  1 inst-test3.google.com disk/0 secondary hostname5
>> hostname5  0 inst-test3.google.com disk/0 primary hostname6
>>
>> node: hostname5
>>   0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---b-
>>
>> node: hostname6
>>   1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---b-
>>
>> [xen-test] root at hostname5:~# drbdsetup /dev/drbd0 status
>> <resource minor="0" cs="Connected" ro1="Primary" ro2="Primary"
>> ds1="UpToDate" ds2="UpToDate" />
>>
>> For hostname6's trace:
>>
>> http://db.tt/FI8rmrpw
>>
>> And hostname5's relevant pieces. If more are needed, i can post them.
>>
>> Feb 27 16:08:43 hostname5 kernel: [64271.712343] drbd0_receiver  S
>> ffff88003e411dc0     0 19682      2 0x00000000
>> Feb 27 16:08:43 hostname5 kernel: [64271.712351]  ffff88002b205970
>> 0000000000000246 ffff88002b2058f0 ffff880016daea00
>> Feb 27 16:08:43 hostname5 kernel: [64271.712360]  000000000000000c
>> 0000000000000000 ffff8800020fd160 0000000000011dc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712369]  ffff88002b205fd8
>> ffff88002b204010 ffff88002b205fd8 0000000000011dc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712377] Call Trace:
>> Feb 27 16:08:43 hostname5 kernel: [64271.712382]  [<ffffffff810d1057>]
>> ? kfree+0x17/0xc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712388]  [<ffffffff8149878a>]
>> schedule+0x3a/0x60
>> Feb 27 16:08:43 hostname5 kernel: [64271.712393]  [<ffffffff81498b95>]
>> schedule_timeout+0x185/0x1e0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712400]  [<ffffffff8104c3e2>]
>> ? local_bh_enable_ip+0x22/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712406]  [<ffffffff8149a444>]
>> ? _raw_spin_unlock_bh+0x14/0x20
>> Feb 27 16:08:43 hostname5 kernel: [64271.712413]  [<ffffffff813b63f1>]
>> sk_wait_data+0xd1/0xe0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712419]  [<ffffffff810622b0>]
>> ? wake_up_bit+0x40/0x40
>> Feb 27 16:08:43 hostname5 kernel: [64271.712425]  [<ffffffff8104be62>]
>> ? local_bh_enable+0x22/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712431]  [<ffffffff81403d01>]
>> tcp_recvmsg+0x651/0xc80
>> Feb 27 16:08:43 hostname5 kernel: [64271.712437]  [<ffffffff81424f7a>]
>> inet_recvmsg+0x4a/0x80
>> Feb 27 16:08:43 hostname5 kernel: [64271.712444]  [<ffffffff81005485>]
>> ? arbitrary_virt_to_machine+0x85/0xb0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712450]  [<ffffffff813b1961>]
>> sock_recvmsg+0xc1/0xf0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712456]  [<ffffffff81003129>]
>> ? xen_end_context_switch+0x19/0x20
>> Feb 27 16:08:43 hostname5 kernel: [64271.712462]  [<ffffffff81009915>]
>> ? __switch_to+0x145/0x370
>> Feb 27 16:08:43 hostname5 kernel: [64271.712468]  [<ffffffff8103e75e>]
>> ? finish_task_switch+0x5e/0xc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712475]  [<ffffffff814981ea>]
>> ? __schedule+0x29a/0x760
>> Feb 27 16:08:43 hostname5 kernel: [64271.712481]  [<ffffffff8149a279>]
>> ? _raw_spin_unlock_irqrestore+0x19/0x20
>> Feb 27 16:08:43 hostname5 kernel: [64271.712493]  [<ffffffffa00569ec>]
>> drbd_recv+0x8c/0x230 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712505]  [<ffffffffa0059bac>]
>> ? drbd_may_finish_epoch+0x9c/0x3a0 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712517]  [<ffffffffa0057a0e>]
>> drbd_recv_header+0x2e/0x130 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712528]  [<ffffffffa00583e6>]
>> drbdd+0x46/0x200 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712540]  [<ffffffffa005e2e5>]
>> drbdd_init+0x85/0x130 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712551]  [<ffffffffa006a174>]
>> drbd_thread_setup+0x64/0xf0 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712563]  [<ffffffffa006a110>]
>> ? _drbd_thread_stop+0x100/0x100 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712569]  [<ffffffff81061e06>]
>> kthread+0x96/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712576]  [<ffffffff8149ceb4>]
>> kernel_thread_helper+0x4/0x10
>> Feb 27 16:08:43 hostname5 kernel: [64271.712582]  [<ffffffff8149af76>]
>> ? int_ret_from_sys_call+0x7/0x1b
>> Feb 27 16:08:43 hostname5 kernel: [64271.712589]  [<ffffffff8149a6bc>]
>> ? retint_restore_args+0x5/0x6
>> Feb 27 16:08:43 hostname5 kernel: [64271.712595]  [<ffffffff8149ceb0>]
>> ? gs_change+0x13/0x13
>> Feb 27 16:08:43 hostname5 kernel: [64271.712599] drbd0_asender   S
>> ffff88003e411dc0     0 19689      2 0x00000000
>> Feb 27 16:08:43 hostname5 kernel: [64271.712607]  ffff8800249739a0
>> 0000000000000246 ffff880024973920 ffffffff810cd8c8
>> Feb 27 16:08:43 hostname5 kernel: [64271.712616]  000000000000a570
>> ffff880016ea94b0 ffff8800020f8000 0000000000011dc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712625]  ffff880024973fd8
>> ffff880024972010 ffff880024973fd8 0000000000011dc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712633] Call Trace:
>> Feb 27 16:08:43 hostname5 kernel: [64271.712637]  [<ffffffff810cd8c8>]
>> ? ksize+0x18/0xc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712665]  [<ffffffff8149a22a>]
>> ? _raw_spin_lock_irqsave+0x2a/0x40
>> Feb 27 16:08:43 hostname5 kernel: [64271.712671]  [<ffffffff8149878a>]
>> schedule+0x3a/0x60
>> Feb 27 16:08:43 hostname5 kernel: [64271.712677]  [<ffffffff81498b45>]
>> schedule_timeout+0x135/0x1e0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712684]  [<ffffffff81052540>]
>> ? add_timer_on+0xa0/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712690]  [<ffffffff813b63f1>]
>> sk_wait_data+0xd1/0xe0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712696]  [<ffffffff810622b0>]
>> ? wake_up_bit+0x40/0x40
>> Feb 27 16:08:43 hostname5 kernel: [64271.712702]  [<ffffffff8104be62>]
>> ? local_bh_enable+0x22/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712708]  [<ffffffff81403d01>]
>> tcp_recvmsg+0x651/0xc80
>> Feb 27 16:08:43 hostname5 kernel: [64271.712715]  [<ffffffff810085d1>]
>> ? m2p_remove_override+0x251/0x2f0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712721]  [<ffffffff81424f7a>]
>> inet_recvmsg+0x4a/0x80
>> Feb 27 16:08:43 hostname5 kernel: [64271.712727]  [<ffffffff813b1961>]
>> sock_recvmsg+0xc1/0xf0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712733]  [<ffffffff810d0ac5>]
>> ? kmem_cache_free+0x15/0x90
>> Feb 27 16:08:43 hostname5 kernel: [64271.712740]  [<ffffffff81096442>]
>> ? mempool_free_slab+0x12/0x20
>> Feb 27 16:08:43 hostname5 kernel: [64271.712747]  [<ffffffff81096555>]
>> ? mempool_free+0x85/0x90
>> Feb 27 16:08:43 hostname5 kernel: [64271.712758]  [<ffffffffa005ff0a>]
>> ? _req_may_be_done+0x12a/0x4e0 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712765]  [<ffffffff8103a02e>]
>> ? __wake_up+0x4e/0x70
>> Feb 27 16:08:43 hostname5 kernel: [64271.712771]  [<ffffffff8149a279>]
>> ? _raw_spin_unlock_irqrestore+0x19/0x20
>> Feb 27 16:08:43 hostname5 kernel: [64271.712777]  [<ffffffff8103a02e>]
>> ? __wake_up+0x4e/0x70
>> Feb 27 16:08:43 hostname5 kernel: [64271.712788]  [<ffffffffa00541b3>]
>> drbd_recv_short+0x73/0x90 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712800]  [<ffffffffa0059569>]
>> drbd_asender+0x189/0x730 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712806]  [<ffffffff8103e75e>]
>> ? finish_task_switch+0x5e/0xc0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712818]  [<ffffffffa006a174>]
>> drbd_thread_setup+0x64/0xf0 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712829]  [<ffffffffa006a110>]
>> ? _drbd_thread_stop+0x100/0x100 [drbd]
>> Feb 27 16:08:43 hostname5 kernel: [64271.712836]  [<ffffffff81061e06>]
>> kthread+0x96/0xa0
>> Feb 27 16:08:43 hostname5 kernel: [64271.712842]  [<ffffffff8149ceb4>]
>> kernel_thread_helper+0x4/0x10
>> Feb 27 16:08:43 hostname5 kernel: [64271.712849]  [<ffffffff8149af76>]
>> ? int_ret_from_sys_call+0x7/0x1b
>> Feb 27 16:08:43 hostname5 kernel: [64271.712855]  [<ffffffff8149a6bc>]
>> ? retint_restore_args+0x5/0x6
>> Feb 27 16:08:43 hostname5 kernel: [64271.712861]  [<ffffffff8149ceb0>]
>> ? gs_change+0x13/0x13
>>
>>
>> --
>> climent () gmail ! com
>>
>> On Wed, Feb 27, 2013 at 9:25 AM, Jesus Climent <climent at gmail.com> wrote:
>>> I have managed to create a repro case, when the system is under a high
>>> load of I/O. From a set of 4 test clusters, all except one got a
>>> "drbdsetup /dev/drbdX secondary" hang.
>>>
>>> What other information should I send to the list in order to evaluate
>>> this problem?
>>>
>>> On Wed, Feb 27, 2013 at 7:45 AM, Lars Ellenberg
>>> <lars.ellenberg at linbit.com> wrote:
>>>> On Tue, Feb 26, 2013 at 06:01:04PM -0500, Jesus Climent wrote:
>>>>> Has anybody taken a look into it?
>>>>
>>>> Yep.
>>>> Nothing obvious, sorry.
>>>>
>>>>         Lars
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> drbd-user at lists.linbit.com
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>>
>>>
>>> --
>>> climent () gmail ! com
>>
>>
>>
>> --
>> climent () gmail ! com
>
>
>
> --
> climent () gmail ! com



--
climent () gmail ! com



More information about the drbd-user mailing list