[DRBD-user] Found a new disk flush error code

Thomas Reinhold it-beratung at thomasreinhold.de
Tue Dec 23 10:08:16 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

unfortunately, I have encountered some rather serious problems.

While running a series of io benchmarks / stress tests, I got the  
following lockup:

> Dec 22 17:53:19 host kernel: BUG: soft lockup detected on CPU#5!
> Dec 22 17:53:19 host kernel:
> Dec 22 17:53:19 host kernel: Call Trace:
> Dec 22 17:53:19 host kernel:  <IRQ> [<ffffffff802a360e>]  
> softlockup_tick+0xdb/0xed
> Dec 22 17:53:19 host kernel:  [<ffffffff8028783c>]  
> update_process_times+0x42/0x68
> Dec 22 17:53:19 host kernel:  [<ffffffff8026c30d>]  
> smp_local_timer_interrupt+0x23/0x47
> Dec 22 17:53:19 host kernel:  [<ffffffff8026ca01>]  
> smp_apic_timer_interrupt+0x41/0x47
> Dec 22 17:53:19 host kernel:  [<ffffffff8025878a>]  
> apic_timer_interrupt+0x66/0x6c
> Dec 22 17:53:19 host kernel:  <EOI>  
> [<ffffffff8835478a>] :xfs:xfs_trans_update_ail+0x78/0xcd
> Dec 22 17:53:19 host kernel:   
> [<ffffffff88353961>] :xfs:xfs_trans_chunk_committed+0x9f/0xe4
> Dec 22 17:53:19 host kernel:   
> [<ffffffff883539f0>] :xfs:xfs_trans_committed+0x4a/0xdd
> Dec 22 17:53:19 host kernel:   
> [<ffffffff88349830>] :xfs:xlog_state_do_callback+0x173/0x31c
> Dec 22 17:53:19 host kernel:   
> [<ffffffff88360868>] :xfs:xfs_buf_iodone_work+0x0/0x37
> Dec 22 17:53:19 host kernel:  [<ffffffff88349ac1>] :xfs:xlog_iodone 
> +0xe8/0x10b
> Dec 22 17:53:19 host kernel:  [<ffffffff80249152>] run_workqueue 
> +0x94/0xe5
> Dec 22 17:53:19 host kernel:  [<ffffffff80245aec>] worker_thread 
> +0x0/0x122
> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>]  
> keventd_create_kthread+0x0/0x61
> Dec 22 17:53:19 host kernel:  [<ffffffff80245bdc>] worker_thread 
> +0xf0/0x122
> Dec 22 17:53:19 host kernel:  [<ffffffff8027c8e1>]  
> default_wake_function+0x0/0xe
> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>]  
> keventd_create_kthread+0x0/0x61
> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>]  
> keventd_create_kthread+0x0/0x61
> Dec 22 17:53:19 host kernel:  [<ffffffff8023057c>] kthread+0xd4/0x107
> Dec 22 17:53:19 host kernel:  [<ffffffff80258aa0>] child_rip+0xa/0x12
> Dec 22 17:53:19 host kernel:  [<ffffffff8028f823>]  
> keventd_create_kthread+0x0/0x61
> Dec 22 17:53:19 host kernel:  [<ffffffff802304a8>] kthread+0x0/0x107
> Dec 22 17:53:19 host kernel:  [<ffffffff80258a96>] child_rip+0x0/0x12

This happened on two CPU cores at the same time. The system is   
responsive, but the respective xfs und pdflush threads entered state D  
and cannot be stopped. Neither can the filsystem be unmounted or the  
system be gracefully shut down.


The question is if that could be related to DRBD? I'm getting more and  
more convinced, that this issue is due to the "certified" scsi driver  
not working properly, but I just want to rule out that DRBD is involved.

Thanks,

   Thomas


Am 19.12.2008 um 20:44 schrieb Thomas Reinhold:

>
> Am 18.12.2008 um 17:27 schrieb Lars Ellenberg:
>
>> On Thu, Dec 18, 2008 at 04:46:10PM +0100, Thomas Reinhold wrote:
>>> Hi,
>>>
>>> I've done a little further testing and ran DRBD directly on top of  
>>> the raid set
>>> (without using dm_crypt). Still got the same disk flush errors  
>>> when having
>>> flushing enabled.
>>>
>>> So can I assume that either the lower level scsi driver  
>>> megaraid_sas (Debian
>>> 2.6.18.6-amd64) or the raid controller (LSI MegaRaid 1078) does  
>>> not support
>>> flushing?
>>
>> absolutely.
>>
>>> And another question: Can disabling flushing in DRBD cause any  
>>> other problems
>>> than data corruptions at power loss?
>>
>> I'd say "no" if you promise not to sue me in case I'm wrong.
>
> How could I sue you for using a free product? I would have to buy a  
> support contract first ;-)
>
> Anyways, thanks for your help! I have disabled the raid controller  
> cache for now, as I dislike the idea of having too much data in  
> cache (even though we are using an UPS).
>
> The disk caches are still enabled, however, as the performance  
> impact of disabling both caches would be too great. We'll see how  
> that works with XFS.
>
> If we encounter any problems, I'll get back to the list.
>
>
> Regards,
>
>   Thomas
>
>
>>
>>
>> -- 
>> : Lars Ellenberg
>> : LINBIT | Your Way to High Availability
>> : DRBD/HA support and consulting http://www.linbit.com
>>
>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>> __
>> please don't Cc me, but send to list   --   I'm subscribed
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20081223/577214a1/attachment.htm>


More information about the drbd-user mailing list