[DRBD-user] Several cases of hangups found. I have some stack traces to send.

Jesus Climent climent at gmail.com
Wed Feb 27 19:50:17 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I have been digging a bit more in the current state and these are my findings:

Once a migration is right about to be finished, the secondary node
received the signal of becoming primary and it is when drbdsetup is
trying to set the secondary, it fails in the primary

This is the drbdsetup process in d-state, under hostname6
root     27691  0.0  0.0   3964   536 ?        D    00:06   0:00
drbdsetup /dev/drbd1 secondary

hostname6  1 inst-test3.google.com disk/0 secondary hostname5
hostname5  0 inst-test3.google.com disk/0 primary hostname6

node: hostname5
  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---b-

node: hostname6
  1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---b-

[xen-test] root at hostname5:~# drbdsetup /dev/drbd0 status
<resource minor="0" cs="Connected" ro1="Primary" ro2="Primary"
ds1="UpToDate" ds2="UpToDate" />

For hostname6's trace:

http://db.tt/FI8rmrpw

And hostname5's relevant pieces. If more are needed, i can post them.

Feb 27 16:08:43 hostname5 kernel: [64271.712343] drbd0_receiver  S
ffff88003e411dc0     0 19682      2 0x00000000
Feb 27 16:08:43 hostname5 kernel: [64271.712351]  ffff88002b205970
0000000000000246 ffff88002b2058f0 ffff880016daea00
Feb 27 16:08:43 hostname5 kernel: [64271.712360]  000000000000000c
0000000000000000 ffff8800020fd160 0000000000011dc0
Feb 27 16:08:43 hostname5 kernel: [64271.712369]  ffff88002b205fd8
ffff88002b204010 ffff88002b205fd8 0000000000011dc0
Feb 27 16:08:43 hostname5 kernel: [64271.712377] Call Trace:
Feb 27 16:08:43 hostname5 kernel: [64271.712382]  [<ffffffff810d1057>]
? kfree+0x17/0xc0
Feb 27 16:08:43 hostname5 kernel: [64271.712388]  [<ffffffff8149878a>]
schedule+0x3a/0x60
Feb 27 16:08:43 hostname5 kernel: [64271.712393]  [<ffffffff81498b95>]
schedule_timeout+0x185/0x1e0
Feb 27 16:08:43 hostname5 kernel: [64271.712400]  [<ffffffff8104c3e2>]
? local_bh_enable_ip+0x22/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712406]  [<ffffffff8149a444>]
? _raw_spin_unlock_bh+0x14/0x20
Feb 27 16:08:43 hostname5 kernel: [64271.712413]  [<ffffffff813b63f1>]
sk_wait_data+0xd1/0xe0
Feb 27 16:08:43 hostname5 kernel: [64271.712419]  [<ffffffff810622b0>]
? wake_up_bit+0x40/0x40
Feb 27 16:08:43 hostname5 kernel: [64271.712425]  [<ffffffff8104be62>]
? local_bh_enable+0x22/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712431]  [<ffffffff81403d01>]
tcp_recvmsg+0x651/0xc80
Feb 27 16:08:43 hostname5 kernel: [64271.712437]  [<ffffffff81424f7a>]
inet_recvmsg+0x4a/0x80
Feb 27 16:08:43 hostname5 kernel: [64271.712444]  [<ffffffff81005485>]
? arbitrary_virt_to_machine+0x85/0xb0
Feb 27 16:08:43 hostname5 kernel: [64271.712450]  [<ffffffff813b1961>]
sock_recvmsg+0xc1/0xf0
Feb 27 16:08:43 hostname5 kernel: [64271.712456]  [<ffffffff81003129>]
? xen_end_context_switch+0x19/0x20
Feb 27 16:08:43 hostname5 kernel: [64271.712462]  [<ffffffff81009915>]
? __switch_to+0x145/0x370
Feb 27 16:08:43 hostname5 kernel: [64271.712468]  [<ffffffff8103e75e>]
? finish_task_switch+0x5e/0xc0
Feb 27 16:08:43 hostname5 kernel: [64271.712475]  [<ffffffff814981ea>]
? __schedule+0x29a/0x760
Feb 27 16:08:43 hostname5 kernel: [64271.712481]  [<ffffffff8149a279>]
? _raw_spin_unlock_irqrestore+0x19/0x20
Feb 27 16:08:43 hostname5 kernel: [64271.712493]  [<ffffffffa00569ec>]
drbd_recv+0x8c/0x230 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712505]  [<ffffffffa0059bac>]
? drbd_may_finish_epoch+0x9c/0x3a0 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712517]  [<ffffffffa0057a0e>]
drbd_recv_header+0x2e/0x130 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712528]  [<ffffffffa00583e6>]
drbdd+0x46/0x200 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712540]  [<ffffffffa005e2e5>]
drbdd_init+0x85/0x130 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712551]  [<ffffffffa006a174>]
drbd_thread_setup+0x64/0xf0 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712563]  [<ffffffffa006a110>]
? _drbd_thread_stop+0x100/0x100 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712569]  [<ffffffff81061e06>]
kthread+0x96/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712576]  [<ffffffff8149ceb4>]
kernel_thread_helper+0x4/0x10
Feb 27 16:08:43 hostname5 kernel: [64271.712582]  [<ffffffff8149af76>]
? int_ret_from_sys_call+0x7/0x1b
Feb 27 16:08:43 hostname5 kernel: [64271.712589]  [<ffffffff8149a6bc>]
? retint_restore_args+0x5/0x6
Feb 27 16:08:43 hostname5 kernel: [64271.712595]  [<ffffffff8149ceb0>]
? gs_change+0x13/0x13
Feb 27 16:08:43 hostname5 kernel: [64271.712599] drbd0_asender   S
ffff88003e411dc0     0 19689      2 0x00000000
Feb 27 16:08:43 hostname5 kernel: [64271.712607]  ffff8800249739a0
0000000000000246 ffff880024973920 ffffffff810cd8c8
Feb 27 16:08:43 hostname5 kernel: [64271.712616]  000000000000a570
ffff880016ea94b0 ffff8800020f8000 0000000000011dc0
Feb 27 16:08:43 hostname5 kernel: [64271.712625]  ffff880024973fd8
ffff880024972010 ffff880024973fd8 0000000000011dc0
Feb 27 16:08:43 hostname5 kernel: [64271.712633] Call Trace:
Feb 27 16:08:43 hostname5 kernel: [64271.712637]  [<ffffffff810cd8c8>]
? ksize+0x18/0xc0
Feb 27 16:08:43 hostname5 kernel: [64271.712665]  [<ffffffff8149a22a>]
? _raw_spin_lock_irqsave+0x2a/0x40
Feb 27 16:08:43 hostname5 kernel: [64271.712671]  [<ffffffff8149878a>]
schedule+0x3a/0x60
Feb 27 16:08:43 hostname5 kernel: [64271.712677]  [<ffffffff81498b45>]
schedule_timeout+0x135/0x1e0
Feb 27 16:08:43 hostname5 kernel: [64271.712684]  [<ffffffff81052540>]
? add_timer_on+0xa0/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712690]  [<ffffffff813b63f1>]
sk_wait_data+0xd1/0xe0
Feb 27 16:08:43 hostname5 kernel: [64271.712696]  [<ffffffff810622b0>]
? wake_up_bit+0x40/0x40
Feb 27 16:08:43 hostname5 kernel: [64271.712702]  [<ffffffff8104be62>]
? local_bh_enable+0x22/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712708]  [<ffffffff81403d01>]
tcp_recvmsg+0x651/0xc80
Feb 27 16:08:43 hostname5 kernel: [64271.712715]  [<ffffffff810085d1>]
? m2p_remove_override+0x251/0x2f0
Feb 27 16:08:43 hostname5 kernel: [64271.712721]  [<ffffffff81424f7a>]
inet_recvmsg+0x4a/0x80
Feb 27 16:08:43 hostname5 kernel: [64271.712727]  [<ffffffff813b1961>]
sock_recvmsg+0xc1/0xf0
Feb 27 16:08:43 hostname5 kernel: [64271.712733]  [<ffffffff810d0ac5>]
? kmem_cache_free+0x15/0x90
Feb 27 16:08:43 hostname5 kernel: [64271.712740]  [<ffffffff81096442>]
? mempool_free_slab+0x12/0x20
Feb 27 16:08:43 hostname5 kernel: [64271.712747]  [<ffffffff81096555>]
? mempool_free+0x85/0x90
Feb 27 16:08:43 hostname5 kernel: [64271.712758]  [<ffffffffa005ff0a>]
? _req_may_be_done+0x12a/0x4e0 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712765]  [<ffffffff8103a02e>]
? __wake_up+0x4e/0x70
Feb 27 16:08:43 hostname5 kernel: [64271.712771]  [<ffffffff8149a279>]
? _raw_spin_unlock_irqrestore+0x19/0x20
Feb 27 16:08:43 hostname5 kernel: [64271.712777]  [<ffffffff8103a02e>]
? __wake_up+0x4e/0x70
Feb 27 16:08:43 hostname5 kernel: [64271.712788]  [<ffffffffa00541b3>]
drbd_recv_short+0x73/0x90 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712800]  [<ffffffffa0059569>]
drbd_asender+0x189/0x730 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712806]  [<ffffffff8103e75e>]
? finish_task_switch+0x5e/0xc0
Feb 27 16:08:43 hostname5 kernel: [64271.712818]  [<ffffffffa006a174>]
drbd_thread_setup+0x64/0xf0 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712829]  [<ffffffffa006a110>]
? _drbd_thread_stop+0x100/0x100 [drbd]
Feb 27 16:08:43 hostname5 kernel: [64271.712836]  [<ffffffff81061e06>]
kthread+0x96/0xa0
Feb 27 16:08:43 hostname5 kernel: [64271.712842]  [<ffffffff8149ceb4>]
kernel_thread_helper+0x4/0x10
Feb 27 16:08:43 hostname5 kernel: [64271.712849]  [<ffffffff8149af76>]
? int_ret_from_sys_call+0x7/0x1b
Feb 27 16:08:43 hostname5 kernel: [64271.712855]  [<ffffffff8149a6bc>]
? retint_restore_args+0x5/0x6
Feb 27 16:08:43 hostname5 kernel: [64271.712861]  [<ffffffff8149ceb0>]
? gs_change+0x13/0x13


--
climent () gmail ! com

On Wed, Feb 27, 2013 at 9:25 AM, Jesus Climent <climent at gmail.com> wrote:
> I have managed to create a repro case, when the system is under a high
> load of I/O. From a set of 4 test clusters, all except one got a
> "drbdsetup /dev/drbdX secondary" hang.
>
> What other information should I send to the list in order to evaluate
> this problem?
>
> On Wed, Feb 27, 2013 at 7:45 AM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
>> On Tue, Feb 26, 2013 at 06:01:04PM -0500, Jesus Climent wrote:
>>> Has anybody taken a look into it?
>>
>> Yep.
>> Nothing obvious, sorry.
>>
>>         Lars
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
>
> --
> climent () gmail ! com



--
climent () gmail ! com



More information about the drbd-user mailing list