Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-04-26 18:01:12 +0400 \ Eugene Crosser: > This I got from today's CVS version when a would be secondary booted and > tried to connect. nfsa1 was primary, nfsa2 was freshly booted. The > block device apparently stopped responding. > > Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>) <NULL> should read "WRITE" or "READ"... corresponding code reads: printk("... %s)\n", ..., rw ? "WRITE" : "READ"); and I thought I had disabled that debugging aid again. will get commented out in one of the next commits, I guess. though the <NULL> is strange nevertheless, and worries me a little bit. please make sure you got a clean checkout and compile. > Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331017,<NULL>) > Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19682 in drivers/block/drbd/drbd_actlog.c:191 > Apr 26 17:41:27 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19828 in drivers/block/drbd/drbd_actlog.c:192 this is also some debug message which I think was commented out again? it should be, anyways. > Apr 26 17:41:28 nfsa2.mail.back ntpd[127]: kernel time discipline status 0040 oh? ntpd? whats up? anyways, seems to be unrelated... > Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428336000,<NULL>) > Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: nfsd [518]:drbd_md_sync_page_io(,428331020,<NULL>) > Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: al_ext->lc_number = 19683 in drivers/block/drbd/drbd_actlog.c:191 > Apr 26 17:41:28 nfsa1.mail.back kernel: drbd0: mdev->act_log->new_number = 19831 in drivers/block/drbd/drbd_actlog.c:192 > Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331008,<NULL>) > Apr 26 17:41:28 nfsa2.mail.back kernel: drbd0: size = 214165504 KB > Apr 26 17:41:29 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428331101,<NULL>) > Apr 26 17:41:29 nfsa2.mail.back kernel: sync_page_io(,428332588,<NULL>) oops, syslog lost messages? kernel prink buffer wrapped around? > Apr 26 17:41:29 nfsa1.mail.back kernel: drbd0: Connection established. established while drbdsetup is still creating the bitmap area? we need to have a look here. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104 #define ECONNRESET 104 /* Connection reset by peer */ > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3696 > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>) receiver updates GCs. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -32 > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent WriteHint size=8 sent=0 > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: A work callback returned not ok! > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: worker terminated > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer there is the verbose error reporting, too. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0 > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>) > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection lost. exactly. > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332632,<NULL>) ... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332639,<NULL>) > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: Connection established. aha. again, should not yet connect, still not setup! but maybe even the debug printk is the problem here, so... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332640,<NULL>) ... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332647,<NULL>) > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock_sendmsg returned -104 again. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short sent ReportBitMap size=4096 sent=3424 > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: drbd0_receiver [141]:drbd_md_sync_page_io(,428331008,<NULL>) > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: meta connection shut down by peer. > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: asender terminated > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: sock was shut down by peer > Apr 26 17:41:30 nfsa1.mail.back kernel: drbd0: short read expecting header on sock: r=0 > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332648,<NULL>) > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428332649,<NULL>) > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_synge_io(,428343933,<NULL>) again, lost messages... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343934,<NULL>) ... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343946,<NULL>) And, for the fun of it, a nice little oops. > Apr 26 17:41:30 nfsa1.mail.back kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000038 > Apr 26 17:41:30 nfsa1.mail.back kernel: printing eip: > Apr 26 17:41:30 nfsa1.mail.back kernel: c02b1983 > Apr 26 17:41:30 nfsa1.mail.back kernel: *pde = 00000000 > Apr 26 17:41:30 nfsa1.mail.back kernel: Oops: 0000 [#1] > Apr 26 17:41:30 nfsa1.mail.back kernel: SMP > Apr 26 17:41:30 nfsa1.mail.back kernel: CPU: 1 > Apr 26 17:41:30 nfsa1.mail.back kernel: EIP: 0060:[<c02b1983>] Not tainted > Apr 26 17:41:30 nfsa1.mail.back kernel: EFLAGS: 00010286 (2.6.6-rc2) > Apr 26 17:41:30 nfsa1.mail.back kernel: EIP is at sock_sendmsg+0x9b/0xca > Apr 26 17:41:30 nfsa1.mail.back kernel: eax: 00000000 ebx: 00000008 ecx: c728c380 edx: f68ad460 > Apr 26 17:41:30 nfsa1.mail.back kernel: esi: d263defc edi: d263de2c ebp: 00000000 esp: d263de1c > Apr 26 17:41:30 nfsa1.mail.back kernel: ds: 007b es: 007b ss: 0068 > Apr 26 17:41:30 nfsa1.mail.back kernel: Process drbd0_worker (pid: 548, threadinfo=d263c000 task=f68ad460) > Apr 26 17:41:30 nfsa1.mail.back kernel: Stack: d263de2c c728c380 d263defc 00000008 f68ad460 00002697 00000000 00000001 > Apr 26 17:41:30 nfsa1.mail.back kernel: ffffffff 00000000 f76dde5c 00000000 c038e8a0 d263df6c d263df6c f68ad460 > Apr 26 17:41:30 nfsa1.mail.back kernel: c02f458c d263dea0 f5f2f000 d263df6c 00000008 00000000 00004100 00000008 > Apr 26 17:41:30 nfsa1.mail.back kernel: Call Trace: > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02f458c>] inet_recvmsg+0x5a/0x75 but what is this? recvmsg? we are sending here! > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c0317f48>] schedule+0x380/0x64f > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02483a0>] drbd_send+0xc2/0x271 > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024746b>] _drbd_send_cmd+0xc5/0x14d > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02475c8>] drbd_send_cmd+0xd5/0x1d4 > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024bf47>] w_send_write_hint+0x42/0x44 aha? no. code looks correct. well, there is this remote possibility on an SMP, that asender died, and reset the mdev->asender.task to NULL, right between the two statements in wake_asender(mdev): | static inline void wake_asender(drbd_dev *mdev) { | if(mdev->asender.task) { // could be wrong if asender just terminated. HERE. on SMP, now asender dies, and sets asender.task to NULL, and later in force_sig_info it will be dereferenced. | force_sig(DRBD_SIG, mdev->asender.task); | } |} but this should show in the backtrace, which says it oopsed in sock_sendmsg... strange. > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c024c637>] drbd_worker+0xfa/0x24e > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c02471a3>] drbd_thread_setup+0x3d/0x5b > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c0247166>] drbd_thread_setup+0x0/0x5b > Apr 26 17:41:30 nfsa1.mail.back kernel: [<c01022a9>] kernel_thread_helper+0x5/0xb > Apr 26 17:41:30 nfsa1.mail.back kernel: > Apr 26 17:41:30 nfsa1.mail.back kernel: Code: ff 50 38 3d ef fd ff ff 74 1c 8b 9c 24 b0 00 00 00 8b b4 24 > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343947,<NULL>) ... > Apr 26 17:41:30 nfsa2.mail.back kernel: drbd0: drbdsetup [139]:drbd_md_sync_page_io(,428343964,<NULL>) sorry, no idea.