[DRBD-user] spurios IO error on backing device with 0.7.0 svn $Rev: 1442

Andreas Schultz aschultz at tpip.net
Fri Jul 30 10:50:42 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

First, my apologies for what its going to be an incomplete and probably 
unhelpful problem report.

Some minutes ago one of my test boxes paniced leaving the following 
information behind:

Jul 30 10:30:12 sdev01 kernel: drbd1: IO error on backing device!
Jul 30 10:30:12 sdev01 kernel: Kernel panic: drbd1: IO error on backing 
device!
Jul 30 10:30:12 sdev01 kernel:
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
435580
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
438990
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7351
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7352
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7353
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7354
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7355
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7356
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7357
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: Buffer I/O error on device drbd1, logical block 
7358
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:18 sdev01 kernel: REISERFS: panic (device drbd1): journal-601, 
buffer write failed
Jul 30 10:30:18 sdev01 kernel:
Jul 30 10:30:18 sdev01 kernel: ------------[ cut here ]------------
Jul 30 10:30:18 sdev01 kernel: kernel BUG 
at /usr/src/linux-2.6.8-rc2-bk1-vs1.9.2.5/fs/reiserfs/prints.c:362!
Jul 30 10:30:18 sdev01 kernel: invalid operand: 0000 [#1]
Jul 30 10:30:18 sdev01 kernel: PREEMPT SMP
Jul 30 10:30:18 sdev01 kernel: Modules linked in: drbd nfsd exportfs lockd 
sunrpc md5 ipv6 uhci_hcd ehci_hcd usbcore serverworks ide_core sworks_agp 
evdev ext3jbd mbcache raid1 e1000 tg3 e100 reiserfs dm_mod raid5 md xor ext2 
aic7xxx
Jul 30 10:30:18 sdev01 kernel: CPU:    0
Jul 30 10:30:18 sdev01 kernel: EIP:    0060:[__crc_pm_idle+265817/384520]    
Not tainted
Jul 30 10:30:18 sdev01 kernel: EFLAGS: 00010246   (2.6.8-rc2-bk1-vs1.9.2.7)
Jul 30 10:30:18 sdev01 kernel: EIP is at reiserfs_panic+0x48/0x70 [reiserfs]
Jul 30 10:30:18 sdev01 kernel: eax: 00000045   ebx: f896dad7   ecx: c02f0fd4   
edx: c19cbe00
Jul 30 10:30:18 sdev01 kernel: esi: f7f64a00   edi: f7f64b44   ebp: c19cbe14   
esp: c19cbdfc
Jul 30 10:30:18 sdev01 kernel: ds: 007b   es: 007b   ss: 0068
Jul 30 10:30:18 sdev01 kernel: Process pdflush (pid: 58, threadinfo=c19ca000 
task=c19c71a0)
Jul 30 10:30:18 sdev01 kernel: Stack: f8969fd4 f7f64b44 f8978660 f7f64a00 
c3d69360 f7f64a00 c19cbe48 f895cdb7
Jul 30 10:30:18 sdev01 kernel:        f7f64a00 f896b1e8 00000282 f8d70070 
c3d69378 00000001 f7f64a00 d1974a7c
Jul 30 10:30:18 sdev01 kernel:        f7f64a00 f8d70104 f6305960 c19cbec0 
f89619e0 00000000 00000246 f51eb280
Jul 30 10:30:18 sdev01 kernel: Call Trace:
Jul 30 10:30:18 sdev01 kernel:  [show_stack+122/144] show_stack+0x7a/0x90
Jul 30 10:30:18 sdev01 kernel:  [show_registers+343/448] 
show_registers+0x157/0x
1c0
Jul 30 10:30:18 sdev01 kernel:  [die+166/288] die+0xa6/0x120
Jul 30 10:30:18 sdev01 kernel:  [do_invalid_op+134/160] 
do_invalid_op+0x86/0xa0
Jul 30 10:30:18 sdev01 kernel:  [error_code+45/56] error_code+0x2d/0x38
Jul 30 10:30:18 sdev01 kernel:  [__crc_pm_idle+315192/384520] 
flush_commit_list+0x407/0x440 [reiserfs]
Jul 30 10:30:18 sdev01 kernel:  [__crc_pm_idle+334689/384520] 
do_journal_end+0xa70/0xafc [reiserfs]
Jul 30 10:30:18 sdev01 kernel:  [__crc_pm_idle+329382/384520] 
journal_end_sync+0x45/0x90 [reiserfs]
Jul 30 10:30:18 sdev01 kernel:  [__crc_pm_idle+251489/384520] 
reiserfs_sync_fs+0x60/0xc0 [reiserfs]
Jul 30 10:30:18 sdev01 kernel:  [sync_supers+267/288] sync_supers+0x10b/0x120
Jul 30 10:30:18 sdev01 kernel:  [wb_kupdate+49/272] wb_kupdate+0x31/0x110
Jul 30 10:30:18 sdev01 kernel:  [__pdflush+248/560] __pdflush+0xf8/0x230
Jul 30 10:30:18 sdev01 kernel:  [pdflush+30/32] pdflush+0x1e/0x20
Jul 30 10:30:18 sdev01 kernel:  [kthread+150/224] kthread+0x96/0xe0
Jul 30 10:30:18 sdev01 kernel:  [kernel_thread_helper+5/16] 
kernel_thread_helper+0x5/0x10
Jul 30 10:30:18 sdev01 kernel: Code: 0f 0b 6a 01 f8 9f 96 f8 c7 04 24 34 a0 96 
f8 85 f6 b9 60 86
Jul 30 10:30:18 sdev01 kernel:  <4>ReiserFS: dm-2: warning: clm-2100: nesting 
info a different FS
Jul 30 10:30:18 sdev01 kernel: lost page write due to I/O error on drbd1
Jul 30 10:30:31 sdev01 kernel: drbd2: sock was shut down by peer
Jul 30 10:30:31 sdev01 kernel: drbd2: short read expecting header on sock: r=0
Jul 30 10:30:35 sdev01 kernel: drbd1: sock was shut down by peer
Jul 30 10:30:35 sdev01 kernel: drbd1: short read expecting header on sock: r=0

The other box reported:
Jul 30 10:26:51 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired, 
ko = 4294967295
Jul 30 10:26:54 sdev02 kernel: drbd0: PingAck did not arrive in time.
Jul 30 10:26:54 sdev02 kernel: drbd0: [pdflush/58] sock_sendmsg time expired, 
ko = 4294967294
Jul 30 10:26:54 sdev02 kernel: drbd0: short read expecting header on sock: 
r=-512
Jul 30 10:26:54 sdev02 kernel: drbd0: asender terminated
Jul 30 10:26:54 sdev02 kernel: drbd0: worker terminated
Jul 30 10:26:54 sdev02 kernel: drbd0: _drbd_send_page: size=4096 len=2040 
sent=-4
Jul 30 10:26:54 sdev02 kernel: drbd0: Connection lost.
Jul 30 10:27:02 sdev02 kernel: drbd2: PingAck did not arrive in time.
Jul 30 10:27:02 sdev02 kernel: drbd2: short read expecting header on sock: 
r=-512
Jul 30 10:27:02 sdev02 kernel: drbd2: asender terminated
Jul 30 10:27:02 sdev02 kernel: drbd2: worker terminated
Jul 30 10:27:02 sdev02 kernel: drbd2: Connection lost.
Jul 30 10:27:06 sdev02 kernel: drbd1: PingAck did not arrive in time.
Jul 30 10:27:06 sdev02 kernel: drbd1: short read expecting header on sock: 
r=-512
Jul 30 10:27:06 sdev02 kernel: drbd1: asender terminated
Jul 30 10:27:06 sdev02 kernel: drbd1: worker terminated
Jul 30 10:27:06 sdev02 kernel: drbd1: Connection lost.

Time between the two boxes if off by some minutes and the above logs do 
represent the same event.

The initial problem appears to be the: "drbd1: IO error on backing device!" 
which is simply not possible. The backing device for drbd1 is a lvm partition 
ontop of a software raid5. The raid5 should no be able by definition to cause 
an I/O error as long as at least 2 of the 3 underlying devices are 
OK. /proc/mdstat shows that all underlying devices are online.

sdev01:~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdb2[0] sda2[2] sdc2[1]
      8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU]

md0 : active raid1 sdb1[0] sda1[2] sdc1[1]
      128384 blocks [3/3] [UUU]

md2 : active raid5 sdb3[0] sda3[2] sdc3[1]
      27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU]

sdev01:~# cat /proc/drbd
version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74)

 0: cs:Connected st:Secondary/Primary ld:Consistent
    ns:0 nr:1349444 dw:1349444 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
 1: cs:BrokenPipe st:Primary/Unknown ld:Consistent
    ns:944572 nr:0 dw:942460 dr:208912 al:14 bm:23 lo:1 pe:0 ua:0 ap:1
 2: cs:BrokenPipe st:Primary/Unknown ld:Consistent
    ns:1188288 nr:0 dw:1188288 dr:1031684 al:112 bm:0 lo:0 pe:0 ua:0 ap:0

sdev02:~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdb2[0] sdc2[2] sda2[1]
      8225152 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU]

md0 : active raid1 sdb1[0] sda1[2] sdc1[1]
      128384 blocks [3/3] [UUU]

md2 : active raid5 sdb3[0] sdc3[2] sda3[1]
      27069184 blocks level 5, 128k chunk, algorithm 0 [3/3] [UUU]

sdev02:~# cat /proc/drbd
version: 0.7.0 svn $Rev: 1442 $ (api:74/proto:74)

 0: cs:WFConnection st:Primary/Unknown ld:Consistent
    ns:1349696 nr:0 dw:1368272 dr:69932 al:83 bm:2 lo:0 pe:0 ua:0 ap:0
 1: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:944572 dw:944572 dr:0 al:0 bm:23 lo:0 pe:0 ua:0 ap:0
 2: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:1188288 dw:1188288 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

Kernel: 2.6.8-rc2-bk1 + vserver-1.9.2.7 + drbd-svn #1442

Any idea what happend here?

Regards

Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040730/e62286f3/attachment.pgp>


More information about the drbd-user mailing list