[DRBD-user] Found a new disk flush error code

Tue Dec 23 13:17:50 CET 2008

On Tue, Dec 23, 2008 at 10:08:16AM +0100, Thomas Reinhold wrote:
> Hi,
>
> unfortunately, I have encountered some rather serious problems.
>
> While running a series of io benchmarks / stress tests, I got the  
> following lockup:
>
>> Dec 22 17:53:19 host kernel: BUG: soft lockup detected on CPU#5!
>> Dec 22 17:53:19 host kernel:
>> Dec 22 17:53:19 host kernel: Call Trace:
>> Dec 22 17:53:19 host kernel:  <IRQ> [<ffffffff802a360e>] softlockup_tick+0xdb/0xed
>> Dec 22 17:53:19 host kernel:  [<ffffffff8028783c>] update_process_times+0x42/0x68
>> Dec 22 17:53:19 host kernel:  [<ffffffff8026c30d>] smp_local_timer_interrupt+0x23/0x47
>> Dec 22 17:53:19 host kernel:  [<ffffffff8026ca01>] smp_apic_timer_interrupt+0x41/0x47
>> Dec 22 17:53:19 host kernel:  [<ffffffff8025878a>] apic_timer_interrupt+0x66/0x6c
>> Dec 22 17:53:19 host kernel:  <EOI> [<ffffffff8835478a>]:xfs:xfs_trans_update_ail+0x78/0xcd
>> Dec 22 17:53:19 host kernel:  [<ffffffff88353961>]:xfs:xfs_trans_chunk_committed+0x9f/0xe4
>> Dec 22 17:53:19 host kernel:  [<ffffffff883539f0>]:xfs:xfs_trans_committed+0x4a/0xdd
>> Dec 22 17:53:19 host kernel:  [<ffffffff88349830>]:xfs:xlog_state_do_callback+0x173/0x31c
>> Dec 22 17:53:19 host kernel:  [<ffffffff88360868>]:xfs:xfs_buf_iodone_work+0x0/0x37
>> Dec 22 17:53:19 host kernel:  [<ffffffff88349ac1>] :xfs:xlog_iodone+0xe8/0x10b
>> Dec 22 17:53:19 host kernel:  [<ffffffff80249152>] run_workqueue+0x94/0xe5
>> Dec 22 17:53:19 host kernel:  [<ffffffff80245aec>] worker_thread+0x0/0x122
>> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>] keventd_create_kthread+0x0/0x61
>> Dec 22 17:53:19 host kernel:  [<ffffffff80245bdc>] worker_thread+0xf0/0x122
>> Dec 22 17:53:19 host kernel:  [<ffffffff8027c8e1>] default_wake_function+0x0/0xe
>> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>] keventd_create_kthread+0x0/0x61
>> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>] keventd_create_kthread+0x0/0x61
>> Dec 22 17:53:19 host kernel:  [<ffffffff8023057c>] kthread+0xd4/0x107
>> Dec 22 17:53:19 host kernel:  [<ffffffff80258aa0>] child_rip+0xa/0x12
>> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>] keventd_create_kthread+0x0/0x61
>> Dec 22 17:53:19 host kernel:  [<ffffffff802304a8>] kthread+0x0/0x107
>> Dec 22 17:53:19 host kernel:  [<ffffffff80258a96>] child_rip+0x0/0x12
>
> This happened on two CPU cores at the same time. The system is   
> responsive, but the respective xfs und pdflush threads entered state D  
> and cannot be stopped. Neither can the filsystem be unmounted or the  
> system be gracefully shut down.

and _where_ do they get stuck?
ps -eo pid,state,wchan:40,cmd 
echo w > /proc/sysrq-trigger
# echo t > /proc/sysrq-trigger # if you really can read those

> The question is if that could be related to DRBD?

hard to say yes or no without more information
about your setup and the nature of your stress tests.
you have to investigate that your self, I guess.

drbd status during such periods?

drbd messages or other "interessting" messages?

does it help to disconnect (physically, if necessary) drbd?

does it also happen with disconnected drbd (StandAlone)?

> I'm getting more and  more convinced, that this issue is due to the
> "certified" scsi driver  not working properly,

why is that?

> but I just want to rule out that DRBD is involved.

then reproduce without DRBD ;)

what DRBD adds to the picture is own bugs, potentially, of course.
then, some own overhead,
network and disk io at the same time,
some more memory pressure,
some different timing.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed