[DRBD-user] 0.7 another Oops trace

Lars Ellenberg Lars.Ellenberg at linbit.com
Mon Apr 26 17:02:24 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2004-04-26 18:01:12 +0400
\ Eugene Crosser:
> This I got from today's CVS version when a would be secondary booted and
> tried to connect.  nfsa1 was primary, nfsa2 was freshly booted.  The
> block device apparently stopped responding.
> 
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>)

<NULL> should read "WRITE" or "READ"...
corresponding code reads: printk("... %s)\n", ..., rw ? "WRITE" : "READ");

and I thought I had disabled that debugging aid again.
will get commented out in one of the next commits, I guess.
though the <NULL> is strange nevertheless, and worries me a little bit.
please make sure you got a clean checkout and compile.


> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331017,<NULL>)
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19682 in drivers/block/drbd/drbd_actlog.c:191
> Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19828 in drivers/block/drbd/drbd_actlog.c:192

this is also some debug message which I think was commented out again?
it should be, anyways.

> Apr 26 17:41:28 nfsa2.mail.back ntpd[127]: kernel time discipline status 0040

oh? ntpd? whats up? anyways, seems to be unrelated...

> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>)
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331020,<NULL>)
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19683 in drivers/block/drbd/drbd_actlog.c:191
> Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19831 in drivers/block/drbd/drbd_actlog.c:192
> Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: size = 214165504 KB
> Apr 26 17:41:29 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331101,<NULL>)
> Apr 26 17:41:29 nfsa2.mail.back kernel: sync_page_io(,428332588,<NULL>)

oops, syslog lost messages? kernel prink buffer wrapped around?

> Apr 26 17:41:29 nfsa1.mail.back kernel: drbd0: Connection established.

established while drbdsetup is still creating the bitmap area?
we need to have a look here.

> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104

#define ECONNRESET      104     /* Connection reset by peer */

> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3696
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)

receiver updates GCs.

> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -32
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent WriteHint size=8 sent=0
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: A work callback returned not ok!
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: worker terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer

there is the verbose error reporting, too.

> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection lost.

exactly.

> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332632,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332639,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection established.

aha. again, should not yet connect, still not setup!
but maybe even the debug printk is the problem here, so... 

> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332640,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332647,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104

again.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3424
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>)
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer.
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer
> Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332648,<NULL>)
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332649,<NULL>)
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_synge_io(,428343933,<NULL>)
again, lost messages...

> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343934,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343946,<NULL>)


And, for the fun of it, a nice little oops.

> Apr 26 17:41:30 nfsa1.mail.back kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000038
> Apr 26 17:41:30 nfsa1.mail.back kernel:  printing eip:
> Apr 26 17:41:30 nfsa1.mail.back kernel: c02b1983
> Apr 26 17:41:30 nfsa1.mail.back kernel: *pde = 00000000
> Apr 26 17:41:30 nfsa1.mail.back kernel: Oops: 0000 [#1]
> Apr 26 17:41:30 nfsa1.mail.back kernel: SMP 
> Apr 26 17:41:30 nfsa1.mail.back kernel: CPU:    1
> Apr 26 17:41:30 nfsa1.mail.back kernel: EIP:    0060:[<c02b1983>]    Not tainted
> Apr 26 17:41:30 nfsa1.mail.back kernel: EFLAGS: 00010286   (2.6.6-rc2) 
> Apr 26 17:41:30 nfsa1.mail.back kernel: EIP is at sock_sendmsg+0x9b/0xca
> Apr 26 17:41:30 nfsa1.mail.back kernel: eax: 00000000   ebx: 00000008   ecx: c728c380   edx: f68ad460
> Apr 26 17:41:30 nfsa1.mail.back kernel: esi: d263defc   edi: d263de2c   ebp: 00000000   esp: d263de1c
> Apr 26 17:41:30 nfsa1.mail.back kernel: ds: 007b   es: 007b   ss: 0068
> Apr 26 17:41:30 nfsa1.mail.back kernel: Process drbd0_worker (pid: 548, threadinfo=d263c000 task=f68ad460)
> Apr 26 17:41:30 nfsa1.mail.back kernel: Stack: d263de2c c728c380 d263defc 00000008 f68ad460 00002697 00000000 00000001 
> Apr 26 17:41:30 nfsa1.mail.back kernel:        ffffffff 00000000 f76dde5c 00000000 c038e8a0 d263df6c d263df6c f68ad460 
> Apr 26 17:41:30 nfsa1.mail.back kernel:        c02f458c d263dea0 f5f2f000 d263df6c 00000008 00000000 00004100 00000008 
> Apr 26 17:41:30 nfsa1.mail.back kernel: Call Trace:
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c02f458c>] inet_recvmsg+0x5a/0x75

but what is this?
recvmsg? we are sending here!

> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c0317f48>] schedule+0x380/0x64f
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c02483a0>] drbd_send+0xc2/0x271
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c024746b>] _drbd_send_cmd+0xc5/0x14d
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c02475c8>] drbd_send_cmd+0xd5/0x1d4
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c024bf47>] w_send_write_hint+0x42/0x44

aha? no. code looks correct.
well, there is this remote possibility on an SMP,
that asender died, and reset the mdev->asender.task to NULL,
right between the two statements in wake_asender(mdev):

| static inline void wake_asender(drbd_dev *mdev) {
|        if(mdev->asender.task) { // could be wrong if asender just terminated.

	HERE.
on SMP, now asender dies, and sets asender.task to NULL,
and later in force_sig_info it will be dereferenced.

|		force_sig(DRBD_SIG, mdev->asender.task);
|       }
|}

but this should show in the backtrace,
which says it oopsed in sock_sendmsg...
strange.

> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c024c637>] drbd_worker+0xfa/0x24e
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c02471a3>] drbd_thread_setup+0x3d/0x5b
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c0247166>] drbd_thread_setup+0x0/0x5b
> Apr 26 17:41:30 nfsa1.mail.back kernel:  [<c01022a9>] kernel_thread_helper+0x5/0xb
> Apr 26 17:41:30 nfsa1.mail.back kernel: 
> Apr 26 17:41:30 nfsa1.mail.back kernel: Code: ff 50 38 3d ef fd ff ff 74 1c 8b 9c 24 b0 00 00 00 8b b4 24 


> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343947,<NULL>)
...
> Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343964,<NULL>)


sorry, no idea.



More information about the drbd-user mailing list