Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
First, my apologies for what its going to be an incomplete and probably
unhelpful problem report.
Some minutes ago one of my test boxes paniced leaving the following
information behind:
Jul 30 10:30:12 sdev01 kernel: drbd1: IO error on backing device!
Jul 30 10:30:12 sdev01 kernel: Kernel panic: drbd1: IO error on backing
device!
Jul 30 10:30:12 sdev01 kernel:
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
435580
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
438990
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7351
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7352
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7353
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7354
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7355
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7356
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7357
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block
7358
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: REISERFS: panic (device drbd1): journal-601,
buffer write failed
Jul 30 10:30:18 sdev01 kernel:
Jul 30 10:30:18 sdev01 kernel: ------------[ cut here ]------------
Jul 30 10:30:18 sdev01 kernel: kernel BUG
at /usr/src/linux-2.6.8-rc2-bk1-vs1.9.2.5/fs/reiserfs/prints.c:362!
Jul 30 10:30:18 sdev01 kernel: invalid operand: 0000 [#1]
Jul 30 10:30:18 sdev01 kernel: PREEMPT SMP
Jul 30 10:30:18 sdev01 kernel: Modules linked in: drbd nfsd exportfs lockd
sunrpc md5 ipv6 uhci_hcd ehci_hcd usbcore serverworks ide_core sworks_agp
evdev ext3jbd mbcache raid1 e1000 tg3 e100 reiserfs dm_mod raid5 md xor ext2
aic7xxx
Jul 30 10:30:18 sdev01 kernel: CPU: 0
Jul 30 10:30:18 sdev01 kernel: EIP: 0060:[__crc_pm_idle+265817/384520]
Not tainted
Jul 30 10:30:18 sdev01 kernel: EFLAGS: 00010246 (2.6.8-rc2-bk1-vs1.9.2.7)
Jul 30 10:30:18 sdev01 kernel: EIP is at reiserfs_panic+0x48/0x70 [reiserfs]
Jul 30 10:30:18 sdev01 kernel: eax: 00000045 ebx: f896dad7 ecx: c02f0fd4
edx: c19cbe00
Jul 30 10:30:18 sdev01 kernel: esi: f7f64a00 edi: f7f64b44 ebp: c19cbe14
esp: c19cbdfc
Jul 30 10:30:18 sdev01 kernel: ds: 007b es: 007b ss: 0068
Jul 30 10:30:18 sdev01 kernel: Process pdflush (pid: 58, threadinfo=c19ca000
task=c19c71a0)
Jul 30 10:30:18 sdev01 kernel: Stack: f8969fd4 f7f64b44 f8978660 f7f64a00
c3d69360 f7f64a00 c19cbe48 f895cdb7
Jul 30 10:30:18 sdev01 kernel: f7f64a00 f896b1e8 00000282 f8d70070
c3d69378 00000001 f7f64a00 d1974a7c
Jul 30 10:30:18 sdev01 kernel: f7f64a00 f8d70104 f6305960 c19cbec0
f89619e0 00000000 00000246 f51eb280
Jul 30 10:30:18 sdev01 kernel: Call Trace:
Jul 30 10:30:18 sdev01 kernel: [show_stack+122/144] show_stack+0x7a/0x90
Jul 30 10:30:18 sdev01 kernel: [show_registers+343/448]
show_registers+0x157/0x
1c0
Jul 30 10:30:18 sdev01 kernel: [die+166/288] die+0xa6/0x120
Jul 30 10:30:18 sdev01 kernel: [do_invalid_op+134/160]
do_invalid_op+0x86/0xa0
Jul 30 10:30:18 sdev01 kernel: [error_code+45/56] error_code+0x2d/0x38
Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+315192/384520]
flush_commit_list+0x407/0x440 [reiserfs]
Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+334689/384520]
do_journal_end+0xa70/0xafc [reiserfs]
Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+329382/384520]
journal_end_sync+0x45/0x90 [reiserfs]
Jul 30 10:30:18 sdev01 kernel: [__crc_pm_idle+251489/384520]
reiserfs_sync_fs+0x60/0xc0 [reiserfs]
Jul 30 10:30:18 sdev01 kernel: [sync_supers+267/288] sync_supers+0x10b/0x120
Jul 30 10:30:18 sdev01 kernel: [wb_kupdate+49/272] wb_kupdate+0x31/0x110
Jul 30 10:30:18 sdev01 kernel: [__pdflush+248/560] __pdflush+0xf8/0x230
Jul 30 10:30:18 sdev01 kernel: [pdflush+30/32] pdflush+0x1e/0x20
Jul 30 10:30:18 sdev01 kernel: [kthread+150/224] kthread+0x96/0xe0
Jul 30 10:30:18 sdev01 kernel: [kernel_thread_helper+5/16]
kernel_thread_helper+0x5/0x10
Jul 30 10:30:18 sdev01 kernel: Code: 0f 0b 6a 01 f8 9f 96 f8 c7 04 24 34 a0 96
f8 85 f6 b9 60 86
Jul 30 10:30:18 sdev01 kernel: <4>ReiserFS: dm-2: warning: clm-2100: nesting
info a different FS
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:31 sdev01 kernel: drbd2: sock was shut down by peer
Jul 30 10:30:31 sdev01 kernel: drbd2: short read expecting header on sock: r=0
Jul 30 10:30:35 sdev01 kernel: drbd1: sock was shut down by peer
Jul 30 10:30:35 sdev01 kernel: drbd1: short read expecting header on sock: r=0
The other box reported:
Jul 30 10:26:51 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired,
ko = 4294967295
Jul 30 10:26:54 sdev02 kernel: drbd0: PingAck did not arrive in time.
Jul 30 10:26:54 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired,
ko = 4294967294
Jul 30 10:26:54 sdev02 kernel: drbd0: short read expecting header on sock:
r=-512
Jul 30 10:26:54 sdev02 kernel: drbd0: asender terminated
Jul 30 10:26:54 sdev02 kernel: drbd0: worker terminated
Jul 30 10:26:54 sdev02 kernel: drbd0: _drbd_send_page: size=4096 len=2040
sent=-4
Jul 30 10:26:54 sdev02 kernel: drbd0: Connection lost.
Jul 30 10:27:02 sdev02 kernel: drbd2: PingAck did not arrive in time.
Jul 30 10:27:02 sdev02 kernel: drbd2: short read expecting header on sock:
r=-512
Jul 30 10:27:02 sdev02 kernel: drbd2: asender terminated
Jul 30 10:27:02 sdev02 kernel: drbd2: worker terminated
Jul 30 10:27:02 sdev02 kernel: drbd2: Connection lost.
Jul 30 10:27:06 sdev02 kernel: drbd1: PingAck did not arrive in time.
Jul 30 10:27:06 sdev02 kernel: drbd1: short read expecting header on sock:
r=-512
Jul 30 10:27:06 sdev02 kernel: drbd1: asender terminated
Jul 30 10:27:06 sdev02 kernel: drbd1: worker terminated
Jul 30 10:27:06 sdev02 kernel: drbd1: Connection lost.
Time between the two boxes if off by some minutes and the above logs do
represent the same event.
The initial problem appears to be the: "drbd1: IO error on backing device!"
which is simply not possible. The backing device for drbd1 is a lvm partition
ontop of a software raid5. The raid5 should no be able by definition to cause
an I/O error as long as at least 2 of the 3 underlying devices are
OK. /proc/mdstat shows that all underlying devices are online.
sdev01:~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdb2[0] sda2[2] sdc2[1]
8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU]
md0 : active raid1 sdb1[0] sda1[2] sdc1[1]
128384 blocks [3/3] [UUU]
md2 : active raid5 sdb3[0] sda3[2] sdc3[1]
27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU]
sdev01:~# cat /proc/drbd
version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74)
0: cs:Connected st:Secondary/Primary ld:Consistent
ns:0 nr:1349444 dw:1349444 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
1: cs:BrokenPipe st:Primary/Unknown ld:Consistent
ns:944572 nr:0 dw:942460 dr:208912 al:14 bm:23 lo:1 pe:0 ua:0 ap:1
2: cs:BrokenPipe st:Primary/Unknown ld:Consistent
ns:1188288 nr:0 dw:1188288 dr:1031684 al:112 bm:0 lo:0 pe:0 ua:0 ap:0
sdev02:~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdb2[0] sdc2[2] sda2[1]
8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU]
md0 : active raid1 sdb1[0] sda1[2] sdc1[1]
128384 blocks [3/3] [UUU]
md2 : active raid5 sdb3[0] sdc3[2] sda3[1]
27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU]
sdev02:~# cat /proc/drbd
version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74)
0: cs:WFConnection st:Primary/Unknown ld:Consistent
ns:1349696 nr:0 dw:1368272 dr:69932 al:83 bm:2 lo:0 pe:0 ua:0 ap:0
1: cs:WFConnection st:Secondary/Unknown ld:Consistent
ns:0 nr:944572 dw:944572 dr:0 al:0 bm:23 lo:0 pe:0 ua:0 ap:0
2: cs:WFConnection st:Secondary/Unknown ld:Consistent
ns:0 nr:1188288 dw:1188288 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
Kernel: 2.6.8-rc2-bk1 + vserver-1.9.2.7 + drbd-svn #1442
Any idea what happend here?
Regards
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040730/e62286f3/attachment.pgp>