Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: > / 2004-10-22 12:55:46 +0100 > \ Matthew Hodgson: > >>>hm. >>>could you kick the kernel log daemon (klogd -i), and >>>then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ? >> >># klogd -i >># echo t > /proc/sysrq-trigger >># cat /var/log/kern.log >>Oct 22 12:49:46 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB) >>Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f896ad33>] [<f897a1f9>] [<f896fe7b>] >>Oct 22 12:49:46 kernel: [<f8969bad>] [<f897a1f9>] [<f89663ab>] [<f8969de8>] [<f897a7f0>] [<f896ff5a>] >>Oct 22 12:49:46 kernel: [<c010578e>] [<f896fee0>] >>Oct 22 12:49:46 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB) >>Oct 22 12:49:46 kernel: Call Trace: [<c0311ca6>] [<c0105de9>] [<c0105eb7>] [<f8965174>] [<f897a46d>] >>Oct 22 12:49:46 kernel: [<f896ff5a>] [<c010578e>] [<f896fee0>] >>Oct 22 12:49:46 kernel: drbdsetup D 4000A490 0 6151 1 4542 (NOTLB) > > there we are. > drbdsetup and drbd0_receiver deadlocking each other. > wtf. > > btw, my hope was that if you had klogd running, those funny numbers > would get decoded to kernel symbols... oops - my bad; I haven't played with the commandline options for klogd before - here we go with the symbols being deferenced from (hopefully the correct) System.map: (apologies again for supersize lines...) Oct 24 15:37:52 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB) Oct 24 15:37:52 kernel: Call Trace: [__down+114/192] [__down_failed+8/12] [drbd:drbd_asender+1923/2032] [drbd:__insmod_drbd_S.rodata_L838+25195/31698] [drbd:_set_cstate+139/560] Oct 24 15:37:52 kernel: [drbd:drbd_send_handshake+173/656] [drbd:__insmod_drbd_S.rodata_L838+25195/31698] [drbd:drbd_connect+523/14688] [drbd:drbdd_init+88/2080] [drbd:__insmod_drbd_S.rodata_L838+26722/31698] [drbd:_set_cstate+362/560] Oct 24 15:37:52 kernel: [arch_kernel_thread+46/64] [drbd:_set_cstate+240/560] Oct 24 15:37:52 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB) Oct 24 15:37:52 kernel: Call Trace: [sense_data_texts+934/1024] [__down_interruptible+137/240] [__down_failed_interruptible+7/12] [drbd:drbd_worker+1220/1776] [drbd:__insmod_drbd_S.rodata_L838+25823/31698] Oct 24 15:37:52 kernel: [drbd:_set_cstate+362/560] [arch_kernel_thread+46/64] [drbd:_set_cstate+240/560] Oct 24 15:37:52 kernel: drbdsetup D 4000A490 0 6151 1 11057 4542 (NOTLB) Oct 24 15:37:52 kernel: Call Trace: [__down+114/192] [__down_failed+8/12] [drbd:restore_old_sigset+367/942] [drbd:drbd_send_sync_param+98/224] [drbd:drbd_set_state+1516/2304] Oct 24 15:37:52 kernel: [drbd:drbd_ioctl+1918/4048] [blkdev_ioctl+53/64] [sys_ioctl+245/707] [system_call+51/56] >>>or at least give the output of /proc/drbd, and >>>ps -eo pid,comm,stat,wchan ? >> >># cat /proc/drbd >>version: 0.7.5 (api:76/proto:74) >>SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22 >> 0: cs:WFReportParams st:Primary/Unknown ld:Consistent >> ns:8715928 nr:0 dw:36426099 dr:7245115 al:8923 bm:2370 lo:0 pe:0 ua:0 ap:0 >> 1: cs:Unconfigured >> >># ps -eo pid,comm,stat,wchan >> PID COMMAND STAT WCHAN >> 328 drbd0_receiver D down >> 4542 drbd0_worker S down_interruptible >> 6151 drbdsetup D down > > > exactly. > one owns the smaphore, and waits for the other to die > while the other tries to get the first semaphore it self. > :( > > can you (by looking into heartbeat logfiles e.g.) figure out what this > drbdsetup tries to do? The drbdsetup you see there at process 6151 was one run by me a while after the module hung with: # drbdsetup /dev/drbd0 syncer -r 512000 I forget precisely why I was running it - I guess I was trying to nudge it into reestablishing a connection to the slave. Or do you want to know the syntax of the original setup? The DRBD was being run entirely on its own - no heardbeatd or drbdadm - just: # modprobe drbd # drbdsetup /dev/drbd0 disk /dev/sda3 internal -1 # drbdsetup /dev/drbd0 primary # drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C # drbdsetup /dev/drbd0 syncer -r 512000 # mount /dev/drbd0 /mnt to get the master up and running. > in short, I think we need to down_interruptible sometimes where we > currently use down. If there's anything more I can do in trying to reproduce or investigate where things have hung, just say. best regards, Matthew. -- ______________________________________________________________ Matthew Hodgson matthew at mxtelecom.com Tel: +44 845 6667778 Systems Analyst, MX Telecom Ltd.