Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote:
> / 2004-10-22 12:55:46 +0100
> \ Matthew Hodgson:
>
>>>hm.
>>>could you kick the kernel log daemon (klogd -i), and
>>>then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ?
>>
>># klogd -i
>># echo t > /proc/sysrq-trigger
>># cat /var/log/kern.log
>>Oct 22 12:49:46 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB)
>>Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f896ad33>] [<f897a1f9>] [<f896fe7b>]
>>Oct 22 12:49:46 kernel: [<f8969bad>] [<f897a1f9>] [<f89663ab>] [<f8969de8>] [<f897a7f0>] [<f896ff5a>]
>>Oct 22 12:49:46 kernel: [<c010578e>] [<f896fee0>]
>>Oct 22 12:49:46 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB)
>>Oct 22 12:49:46 kernel: Call Trace: [<c0311ca6>] [<c0105de9>] [<c0105eb7>] [<f8965174>] [<f897a46d>]
>>Oct 22 12:49:46 kernel: [<f896ff5a>] [<c010578e>] [<f896fee0>]
>>Oct 22 12:49:46 kernel: drbdsetup D 4000A490 0 6151 1 4542 (NOTLB)
>
> there we are.
> drbdsetup and drbd0_receiver deadlocking each other.
> wtf.
>
> btw, my hope was that if you had klogd running, those funny numbers
> would get decoded to kernel symbols...
oops - my bad; I haven't played with the commandline options for
klogd before - here we go with the symbols being deferenced from
(hopefully the correct) System.map:
(apologies again for supersize lines...)
Oct 24 15:37:52 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB)
Oct 24 15:37:52 kernel: Call Trace: [__down+114/192] [__down_failed+8/12] [drbd:drbd_asender+1923/2032] [drbd:__insmod_drbd_S.rodata_L838+25195/31698] [drbd:_set_cstate+139/560]
Oct 24 15:37:52 kernel: [drbd:drbd_send_handshake+173/656] [drbd:__insmod_drbd_S.rodata_L838+25195/31698] [drbd:drbd_connect+523/14688] [drbd:drbdd_init+88/2080] [drbd:__insmod_drbd_S.rodata_L838+26722/31698] [drbd:_set_cstate+362/560]
Oct 24 15:37:52 kernel: [arch_kernel_thread+46/64] [drbd:_set_cstate+240/560]
Oct 24 15:37:52 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB)
Oct 24 15:37:52 kernel: Call Trace: [sense_data_texts+934/1024] [__down_interruptible+137/240] [__down_failed_interruptible+7/12] [drbd:drbd_worker+1220/1776] [drbd:__insmod_drbd_S.rodata_L838+25823/31698]
Oct 24 15:37:52 kernel: [drbd:_set_cstate+362/560] [arch_kernel_thread+46/64] [drbd:_set_cstate+240/560]
Oct 24 15:37:52 kernel: drbdsetup D 4000A490 0 6151 1 11057 4542 (NOTLB)
Oct 24 15:37:52 kernel: Call Trace: [__down+114/192] [__down_failed+8/12] [drbd:restore_old_sigset+367/942] [drbd:drbd_send_sync_param+98/224] [drbd:drbd_set_state+1516/2304]
Oct 24 15:37:52 kernel: [drbd:drbd_ioctl+1918/4048] [blkdev_ioctl+53/64] [sys_ioctl+245/707] [system_call+51/56]
>>>or at least give the output of /proc/drbd, and
>>>ps -eo pid,comm,stat,wchan ?
>>
>># cat /proc/drbd
>>version: 0.7.5 (api:76/proto:74)
>>SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22
>> 0: cs:WFReportParams st:Primary/Unknown ld:Consistent
>> ns:8715928 nr:0 dw:36426099 dr:7245115 al:8923 bm:2370 lo:0 pe:0 ua:0 ap:0
>> 1: cs:Unconfigured
>>
>># ps -eo pid,comm,stat,wchan
>> PID COMMAND STAT WCHAN
>> 328 drbd0_receiver D down
>> 4542 drbd0_worker S down_interruptible
>> 6151 drbdsetup D down
>
>
> exactly.
> one owns the smaphore, and waits for the other to die
> while the other tries to get the first semaphore it self.
> :(
>
> can you (by looking into heartbeat logfiles e.g.) figure out what this
> drbdsetup tries to do?
The drbdsetup you see there at process 6151 was one run by
me a while after the module hung with:
# drbdsetup /dev/drbd0 syncer -r 512000
I forget precisely why I was running it - I guess I was
trying to nudge it into reestablishing a connection to the
slave.
Or do you want to know the syntax of the original setup?
The DRBD was being run entirely on its own - no heardbeatd
or drbdadm - just:
# modprobe drbd
# drbdsetup /dev/drbd0 disk /dev/sda3 internal -1
# drbdsetup /dev/drbd0 primary
# drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C
# drbdsetup /dev/drbd0 syncer -r 512000
# mount /dev/drbd0 /mnt
to get the master up and running.
> in short, I think we need to down_interruptible sometimes where we
> currently use down.
If there's anything more I can do in trying to reproduce
or investigate where things have hung, just say.
best regards,
Matthew.
--
______________________________________________________________
Matthew Hodgson matthew at mxtelecom.com Tel: +44 845 6667778
Systems Analyst, MX Telecom Ltd.