Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, First, my apologies for what its going to be an incomplete and probably unhelpful problem report. Some minutes ago one of my test boxes paniced leaving the following information behind: Jul 30 10:30:12 sdev01 kernel: drbd1: IO error on backing device! Jul 30 10:30:12 sdev01 kernel: Kernel panic: drbd1: IO error on backing device! Jul 30 10:30:12 sdev01 kernel: Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 435580 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 438990 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7351 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7352 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7353 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7354 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7355 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7356 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7357 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 7358 Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:18 sdev01 kernel: REISERFS: panic (device drbd1): journal-601, buffer write failed Jul 30 10:30:18 sdev01 kernel: Jul 30 10:30:18 sdev01 kernel: ------------[ cut here ]------------ Jul 30 10:30:18 sdev01 kernel: kernel BUG at /usr/src/linux-2.6.8-rc2-bk1-vs1.9.2.5/fs/reiserfs/prints.c:362! Jul 30 10:30:18 sdev01 kernel: invalid operand: 0000 [#1] Jul 30 10:30:18 sdev01 kernel: PREEMPT SMP Jul 30 10:30:18 sdev01 kernel: Modules linked in: drbd nfsd exportfs lockd sunrpc md5 ipv6 uhci_hcd ehci_hcd usbcore serverworks ide_core sworks_agp evdev ext3jbd mbcache raid1 e1000 tg3 e100 reiserfs dm_mod raid5 md xor ext2 aic7xxx Jul 30 10:30:18 sdev01 kernel: CPU: 0 Jul 30 10:30:18 sdev01 kernel: EIP: 0060:[__crc_pm_idle+265817/384520] Not tainted Jul 30 10:30:18 sdev01 kernel: EFLAGS: 00010246 (2.6.8-rc2-bk1-vs1.9.2.7) Jul 30 10:30:18 sdev01 kernel: EIP is at reiserfs_panic+0x48/0x70 [reiserfs] Jul 30 10:30:18 sdev01 kernel: eax: 00000045 ebx: f896dad7 ecx: c02f0fd4 edx: c19cbe00 Jul 30 10:30:18 sdev01 kernel: esi: f7f64a00 edi: f7f64b44 ebp: c19cbe14 esp: c19cbdfc Jul 30 10:30:18 sdev01 kernel: ds: 007b es: 007b ss: 0068 Jul 30 10:30:18 sdev01 kernel: Process pdflush (pid: 58, threadinfo=c19ca000 task=c19c71a0) Jul 30 10:30:18 sdev01 kernel: Stack: f8969fd4 f7f64b44 f8978660 f7f64a00 c3d69360 f7f64a00 c19cbe48 f895cdb7 Jul 30 10:30:18 sdev01 kernel: f7f64a00 f896b1e8 00000282 f8d70070 c3d69378 00000001 f7f64a00 d1974a7c Jul 30 10:30:18 sdev01 kernel: f7f64a00 f8d70104 f6305960 c19cbec0 f89619e0 00000000 00000246 f51eb280 Jul 30 10:30:18 sdev01 kernel: Call Trace: Jul 30 10:30:18 sdev01 kernel: [show_stack+122/144] show_stack+0x7a/0x90 Jul 30 10:30:18 sdev01 kernel: [show_registers+343/448] show_registers+0x157/0x 1c0 Jul 30 10:30:18 sdev01 kernel: [die+166/288] die+0xa6/0x120 Jul 30 10:30:18 sdev01 kernel: [do_invalid_op+134/160] do_invalid_op+0x86/0xa0 Jul 30 10:30:18 sdev01 kernel: [error_code+45/56] error_code+0x2d/0x38 Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+315192/384520] flush_commit_list+0x407/0x440 [reiserfs] Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+334689/384520] do_journal_end+0xa70/0xafc [reiserfs] Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+329382/384520] journal_end_sync+0x45/0x90 [reiserfs] Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+251489/384520] reiserfs_sync_fs+0x60/0xc0 [reiserfs] Jul 30 10:30:18 sdev01 kernel: [sync_supers+267/288] sync_supers+0x10b/0x120 Jul 30 10:30:18 sdev01 kernel: [wb_kupdate+49/272] wb_kupdate+0x31/0x110 Jul 30 10:30:18 sdev01 kernel: [__pdflush+248/560] __pdflush+0xf8/0x230 Jul 30 10:30:18 sdev01 kernel: [pdflush+30/32] pdflush+0x1e/0x20 Jul 30 10:30:18 sdev01 kernel: [kthread+150/224] kthread+0x96/0xe0 Jul 30 10:30:18 sdev01 kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 Jul 30 10:30:18 sdev01 kernel: Code: 0f 0b 6a 01 f8 9f 96 f8 c7 04 24 34 a0 96 f8 85 f6 b9 60 86 Jul 30 10:30:18 sdev01 kernel: <4>ReiserFS: dm-2: warning: clm-2100: nesting info a different FS Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1 Jul 30 10:30:31 sdev01 kernel: drbd2: sock was shut down by peer Jul 30 10:30:31 sdev01 kernel: drbd2: short read expecting header on sock: r=0 Jul 30 10:30:35 sdev01 kernel: drbd1: sock was shut down by peer Jul 30 10:30:35 sdev01 kernel: drbd1: short read expecting header on sock: r=0 The other box reported: Jul 30 10:26:51 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired, ko = 4294967295 Jul 30 10:26:54 sdev02 kernel: drbd0: PingAck did not arrive in time. Jul 30 10:26:54 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired, ko = 4294967294 Jul 30 10:26:54 sdev02 kernel: drbd0: short read expecting header on sock: r=-512 Jul 30 10:26:54 sdev02 kernel: drbd0: asender terminated Jul 30 10:26:54 sdev02 kernel: drbd0: worker terminated Jul 30 10:26:54 sdev02 kernel: drbd0: _drbd_send_page: size=4096 len=2040 sent=-4 Jul 30 10:26:54 sdev02 kernel: drbd0: Connection lost. Jul 30 10:27:02 sdev02 kernel: drbd2: PingAck did not arrive in time. Jul 30 10:27:02 sdev02 kernel: drbd2: short read expecting header on sock: r=-512 Jul 30 10:27:02 sdev02 kernel: drbd2: asender terminated Jul 30 10:27:02 sdev02 kernel: drbd2: worker terminated Jul 30 10:27:02 sdev02 kernel: drbd2: Connection lost. Jul 30 10:27:06 sdev02 kernel: drbd1: PingAck did not arrive in time. Jul 30 10:27:06 sdev02 kernel: drbd1: short read expecting header on sock: r=-512 Jul 30 10:27:06 sdev02 kernel: drbd1: asender terminated Jul 30 10:27:06 sdev02 kernel: drbd1: worker terminated Jul 30 10:27:06 sdev02 kernel: drbd1: Connection lost. Time between the two boxes if off by some minutes and the above logs do represent the same event. The initial problem appears to be the: "drbd1: IO error on backing device!" which is simply not possible. The backing device for drbd1 is a lvm partition ontop of a software raid5. The raid5 should no be able by definition to cause an I/O error as long as at least 2 of the 3 underlying devices are OK. /proc/mdstat shows that all underlying devices are online. sdev01:~# cat /proc/mdstat Personalities : [raid1] [raid5] md1 : active raid5 sdb2[0] sda2[2] sdc2[1] 8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU] md0 : active raid1 sdb1[0] sda1[2] sdc1[1] 128384 blocks [3/3] [UUU] md2 : active raid5 sdb3[0] sda3[2] sdc3[1] 27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU] sdev01:~# cat /proc/drbd version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74) 0: cs:Connected st:Secondary/Primary ld:Consistent ns:0 nr:1349444 dw:1349444 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 1: cs:BrokenPipe st:Primary/Unknown ld:Consistent ns:944572 nr:0 dw:942460 dr:208912 al:14 bm:23 lo:1 pe:0 ua:0 ap:1 2: cs:BrokenPipe st:Primary/Unknown ld:Consistent ns:1188288 nr:0 dw:1188288 dr:1031684 al:112 bm:0 lo:0 pe:0 ua:0 ap:0 sdev02:~# cat /proc/mdstat Personalities : [raid1] [raid5] md1 : active raid5 sdb2[0] sdc2[2] sda2[1] 8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU] md0 : active raid1 sdb1[0] sda1[2] sdc1[1] 128384 blocks [3/3] [UUU] md2 : active raid5 sdb3[0] sdc3[2] sda3[1] 27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU] sdev02:~# cat /proc/drbd version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74) 0: cs:WFConnection st:Primary/Unknown ld:Consistent ns:1349696 nr:0 dw:1368272 dr:69932 al:83 bm:2 lo:0 pe:0 ua:0 ap:0 1: cs:WFConnection st:Secondary/Unknown ld:Consistent ns:0 nr:944572 dw:944572 dr:0 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 2: cs:WFConnection st:Secondary/Unknown ld:Consistent ns:0 nr:1188288 dw:1188288 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 Kernel: 2.6.8-rc2-bk1 + vserver-1.9.2.7 + drbd-svn #1442 Any idea what happend here? Regards Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040730/e62286f3/attachment.pgp>