Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everyone,
before writing this message ive spent few weeks browsing the net and
searching for answers to my problem, and since ive found no answers at
all im writing this ;)
So, the problem apeared `bout a month ago, back then we were using
heartbeat+drbd+reiserfs, all the time everything was running fine, but
one day the primary node failed, it just hanged and heartbeat couldnt
disconnect it, when we forcibly disconnected primary's connection with
the secondary and used "hb_takeover" on the secondary node everything
seemed to run ok. Restarting ex-primary node and surfing trough logs
gave me this:
Aug 28 16:58:22 serv1 kernel: ------------[ cut here ]------------
Aug 28 16:58:22 serv1 kernel: kernel BUG at fs/reiserfs/journal.c:2809!
Aug 28 16:58:22 serv1 kernel: invalid operand: 0000 [#1]
Aug 28 16:58:22 serv1 kernel: SMP
Aug 28 16:58:22 serv1 kernel: Modules linked in:
Aug 28 16:58:22 serv1 kernel: CPU: 0
Aug 28 16:58:22 serv1 kernel: EIP: 0060:[journal_begin+227/240]
Not tainted VLI
Aug 28 16:58:22 serv1 kernel: EFLAGS: 00010246 (2.6.12.5)
Aug 28 16:58:22 serv1 kernel: EIP is at journal_begin+0xe3/0xf0
Aug 28 16:58:22 serv1 kernel: eax: 00000000 ebx: c3775ea8 ecx:
00000000 edx: f8d2b000
Aug 28 16:58:22 serv1 kernel: esi: c3775ef8 edi: c3774000 ebp:
e0e44400 esp: c3775e68
Aug 28 16:58:22 serv1 kernel: ds: 007b es: 007b ss: 0068
Aug 28 16:58:22 serv1 kernel: Process rm (pid: 513, threadinfo=c3774000
task=cd8ae530)
Aug 28 16:58:22 serv1 kernel: Stack: 00000024 00000000 00000000 00000000
00000012 e3ac652c 00000000 c3775ea8
Aug 28 16:58:22 serv1 kernel: c01a6fba c3775ea8 e0e44400 00000012
00000000 00000000 00000000 00000000
Aug 28 16:58:22 serv1 kernel: 00000000 00000000 00000000 f8d2b000
c3775ef8 00000000 0002f504 c3775f68
Aug 28 16:58:22 serv1 kernel: Call Trace:
Aug 28 16:58:22 serv1 kernel: [remove_save_link+58/272]
remove_save_link+0x3a/0x110
Aug 28 16:58:22 serv1 kernel: [journal_end+170/256] journal_end+0xaa/0x100
Aug 28 16:58:22 serv1 kernel: [reiserfs_delete_inode+215/224]
reiserfs_delete_inode+0xd7/0xe0
Aug 28 16:58:22 serv1 kernel: [reiserfs_delete_inode+0/224]
reiserfs_delete_inode+0x0/0xe0
Aug 28 16:58:22 serv1 kernel: [generic_delete_inode+115/256]
generic_delete_inode+0x73/0x100
Aug 28 16:58:22 serv1 kernel: [iput+99/144] iput+0x63/0x90
Aug 28 16:58:22 serv1 kernel: [sys_unlink+214/304] sys_unlink+0xd6/0x130
Aug 28 16:58:22 serv1 kernel: [sysenter_past_esp+84/117]
sysenter_past_esp+0x54/0x75
Aug 28 16:58:22 serv1 kernel: Code: 2a 40 b9 09 00 00 00 89 df 89 46 04
f3 a5 83 7b 04 01 7e 04 31 c0 eb c9 89 2c 24 b8 80 0e 31 c0 89 44 24 04
e8 3f 12 ff ff eb e9
<0f> 0b f9 0a d5 a5 30 c0 eb cc 8d 76 00 55 57 56 53 83 ec 28 8b
when doing a fsck on disks used by drbd on ex-primary node we found that
there were corruptions and data-loss on both drives. Short after
ex-Secondary node failed with the same symptoms. Configuration is pretty
much the same if not including filesystem, now we're using EXT3.
This morning when doing fsck (`fsck.ext3 -fn`) on Secondary node's
partition ive stumbled upon errors like this :
Illegal block #1013 (1380868438) in inode 16581436. IGNORED.
Illegal block #1014 (1097283444) in inode 16581436. IGNORED.
Illegal block #1015 (2001232215) in inode 16581436. IGNORED.
Illegal block #1016 (1111574607) in inode 16581436. IGNORED.
Illegal block #1017 (1500009284) in inode 16581436. IGNORED.
Illegal block #1018 (1865438273) in inode 16581436. IGNORED.
Too many illegal blocks in inode 16581436.
Clear inode? no
and
Entry 'sizelist' in /mail1/m/m/m (593495) has deleted/unused inode
592556. Clear? no
there are a LOT more entry's like this one above.
And there are entry's like this one below apearing in kern.log
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x04 { DriveStatusError }
Somehow, system's still keep on working, and i wonder for how long would
it last.
My current configuration at the moment is:
DRBD - 0.7.10 (api:77/proto:74)
Linux serv1 2.6.11.11 #1 SMP
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 957M 129M 829M 14% /
tmpfs 506M 0 506M 0% /dev/shm
/dev/sda5 95M 37M 58M 40% /boot
/dev/sda9 1.9G 246M 1.7G 13% /home
/dev/sda7 957M 33M 925M 4% /tmp
/dev/sda6 3.8G 821M 3.0G 22% /usr
/dev/sda8 9.4G 484M 8.9G 6% /var
/dev/drbd0 184G 21G 154G 12% /mail1
/dev/drbd1 184G 21G 155G 12% /mail2
and drbd.conf follows:
global {
minor-count 1;
}
resource drbd0 {
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ;
halt -f";
disk {
on-io-error panic;
}
syncer {
rate 500M;
group 1;
al-extents 997;
}
on serv1 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.10.10.11:7788;
meta-disk /dev/sda3[0];
}
on serv2 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.10.10.12:7788;
meta-disk /dev/sda11[0];
}
}
resource drbd1 {
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ;
halt -f";
disk {
on-io-error panic;
}
syncer {
rate 500M;
group 1;
al-extents 997;
}
on serv1 {
device /dev/drbd1;
disk /dev/sdc1;
address 10.10.10.11:7789;
meta-disk /dev/sda3[1];
}
on serv2 {
device /dev/drbd1;
disk /dev/sdc1;
address 10.10.10.12:7789;
meta-disk /dev/sda11[1];
}
}
Please give me a hint where to look and/or what should i do.
P.S.
Sorry for the long mail and my poor english ;)