Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-04-26 18:01:12 +0400
\ Eugene Crosser:
> This I got from today's CVS version when a would be secondary booted and
> tried to connect. nfsa1 was primary, nfsa2 was freshly booted. The
> block device apparently stopped responding.
>
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>)
<NULL> should read "WRITE" or "READ"...
corresponding code reads: printk("... %s)\n", ..., rw ? "WRITE" : "READ");
and I thought I had disabled that debugging aid again.
will get commented out in one of the next commits, I guess.
though the <NULL> is strange nevertheless, and worries me a little bit.
please make sure you got a clean checkout and compile.
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331017,<NULL>)
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19682 in drivers/block/drbd/drbd_actlog.c:191
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19828 in drivers/block/drbd/drbd_actlog.c:192
this is also some debug message which I think was commented out again?
it should be, anyways.
> Apr 26 17:41:28 nfsa2.mail.back ntpd[127]: kernel time discipline status 0040
oh? ntpd? whats up? anyways, seems to be unrelated...
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>)
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331020,<NULL>)
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19683 in drivers/block/drbd/drbd_actlog.c:191
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19831 in drivers/block/drbd/drbd_actlog.c:192
> Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: size = 214165504 KB
> Apr 26 17:41:29 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331101,<NULL>)
> Apr 26 17:41:29 nfsa2.mail.back kernel: sync_page_io(,428332588,<NULL>)
oops, syslog lost messages? kernel prink buffer wrapped around?
> Apr 26 17:41:29 nfsa1.mail.back kernel: drbd0: Connection established.
established while drbdsetup is still creating the bitmap area?
we need to have a look here.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104
#define ECONNRESET 104 /* Connection reset by peer */
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3696
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)
receiver updates GCs.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -32
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent WriteHint size=8 sent=0
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: A work callback returned not ok!
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: worker terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer
there is the verbose error reporting, too.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection lost.
exactly.
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332632,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332639,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection established.
aha. again, should not yet connect, still not setup!
but maybe even the debug printk is the problem here, so...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332640,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332647,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104
again.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3424
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332648,<NULL>)
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332649,<NULL>)
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_synge_io(,428343933,<NULL>)
again, lost messages...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343934,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343946,<NULL>)
And, for the fun of it, a nice little oops.
> Apr 26 17:41:30 nfsa1.mail.back kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000038
> Apr 26 17:41:30 nfsa1.mail.back kernel: printing eip:
> Apr 26 17:41:30 nfsa1.mail.back kernel: c02b1983
> Apr 26 17:41:30 nfsa1.mail.back kernel: *pde = 00000000
> Apr 26 17:41:30 nfsa1.mail.back kernel: Oops: 0000 [#1]
> Apr 26 17:41:30 nfsa1.mail.back kernel: SMP
> Apr 26 17:41:30 nfsa1.mail.back kernel: CPU: 1
> Apr 26 17:41:30 nfsa1.mail.back kernel: EIP: 0060:[<c02b1983>] Not tainted
> Apr 26 17:41:30 nfsa1.mail.back kernel: EFLAGS: 00010286 (2.6.6-rc2)
> Apr 26 17:41:30 nfsa1.mail.back kernel: EIP is at sock_sendmsg+0x9b/0xca
> Apr 26 17:41:30 nfsa1.mail.back kernel: eax: 00000000 ebx: 00000008 ecx: c728c380 edx: f68ad460
> Apr 26 17:41:30 nfsa1.mail.back kernel: esi: d263defc edi: d263de2c ebp: 00000000 esp: d263de1c
> Apr 26 17:41:30 nfsa1.mail.back kernel: ds: 007b es: 007b ss: 0068
> Apr 26 17:41:30 nfsa1.mail.back kernel: Process drbd0_worker (pid: 548, threadinfo=d263c000 task=f68ad460)
> Apr 26 17:41:30 nfsa1.mail.back kernel: Stack: d263de2c c728c380 d263defc 00000008 f68ad460 00002697 00000000 00000001
> Apr 26 17:41:30 nfsa1.mail.back kernel: ffffffff 00000000 f76dde5c 00000000 c038e8a0 d263df6c d263df6c f68ad460
> Apr 26 17:41:30 nfsa1.mail.back kernel: c02f458c d263dea0 f5f2f000 d263df6c 00000008 00000000 00004100 00000008
> Apr 26 17:41:30 nfsa1.mail.back kernel: Call Trace:
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02f458c>] inet_recvmsg+0x5a/0x75
but what is this?
recvmsg? we are sending here!
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c0317f48>] schedule+0x380/0x64f
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02483a0>] drbd_send+0xc2/0x271
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024746b>] _drbd_send_cmd+0xc5/0x14d
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02475c8>] drbd_send_cmd+0xd5/0x1d4
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024bf47>] w_send_write_hint+0x42/0x44
aha? no. code looks correct.
well, there is this remote possibility on an SMP,
that asender died, and reset the mdev->asender.task to NULL,
right between the two statements in wake_asender(mdev):
| static inline void wake_asender(drbd_dev *mdev) {
| if(mdev->asender.task) { // could be wrong if asender just terminated.
HERE.
on SMP, now asender dies, and sets asender.task to NULL,
and later in force_sig_info it will be dereferenced.
| force_sig(DRBD_SIG, mdev->asender.task);
| }
|}
but this should show in the backtrace,
which says it oopsed in sock_sendmsg...
strange.
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024c637>] drbd_worker+0xfa/0x24e
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02471a3>] drbd_thread_setup+0x3d/0x5b
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c0247166>] drbd_thread_setup+0x0/0x5b
> Apr 26 17:41:30 nfsa1.mail.back kernel: [<c01022a9>] kernel_thread_helper+0x5/0xb
> Apr 26 17:41:30 nfsa1.mail.back kernel:
> Apr 26 17:41:30 nfsa1.mail.back kernel: Code: ff 50 38 3d ef fd ff ff 74 1c 8b 9c 24 b0 00 00 00 8b b4 24
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343947,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343964,<NULL>)
sorry, no idea.