Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everyone, before writing this message ive spent few weeks browsing the net and searching for answers to my problem, and since ive found no answers at all im writing this ;) So, the problem apeared `bout a month ago, back then we were using heartbeat+drbd+reiserfs, all the time everything was running fine, but one day the primary node failed, it just hanged and heartbeat couldnt disconnect it, when we forcibly disconnected primary's connection with the secondary and used "hb_takeover" on the secondary node everything seemed to run ok. Restarting ex-primary node and surfing trough logs gave me this: Aug 28 16:58:22 serv1 kernel: ------------[ cut here ]------------ Aug 28 16:58:22 serv1 kernel: kernel BUG at fs/reiserfs/journal.c:2809! Aug 28 16:58:22 serv1 kernel: invalid operand: 0000 [#1] Aug 28 16:58:22 serv1 kernel: SMP Aug 28 16:58:22 serv1 kernel: Modules linked in: Aug 28 16:58:22 serv1 kernel: CPU: 0 Aug 28 16:58:22 serv1 kernel: EIP: 0060:[journal_begin+227/240] Not tainted VLI Aug 28 16:58:22 serv1 kernel: EFLAGS: 00010246 (2.6.12.5) Aug 28 16:58:22 serv1 kernel: EIP is at journal_begin+0xe3/0xf0 Aug 28 16:58:22 serv1 kernel: eax: 00000000 ebx: c3775ea8 ecx: 00000000 edx: f8d2b000 Aug 28 16:58:22 serv1 kernel: esi: c3775ef8 edi: c3774000 ebp: e0e44400 esp: c3775e68 Aug 28 16:58:22 serv1 kernel: ds: 007b es: 007b ss: 0068 Aug 28 16:58:22 serv1 kernel: Process rm (pid: 513, threadinfo=c3774000 task=cd8ae530) Aug 28 16:58:22 serv1 kernel: Stack: 00000024 00000000 00000000 00000000 00000012 e3ac652c 00000000 c3775ea8 Aug 28 16:58:22 serv1 kernel: c01a6fba c3775ea8 e0e44400 00000012 00000000 00000000 00000000 00000000 Aug 28 16:58:22 serv1 kernel: 00000000 00000000 00000000 f8d2b000 c3775ef8 00000000 0002f504 c3775f68 Aug 28 16:58:22 serv1 kernel: Call Trace: Aug 28 16:58:22 serv1 kernel: [remove_save_link+58/272] remove_save_link+0x3a/0x110 Aug 28 16:58:22 serv1 kernel: [journal_end+170/256] journal_end+0xaa/0x100 Aug 28 16:58:22 serv1 kernel: [reiserfs_delete_inode+215/224] reiserfs_delete_inode+0xd7/0xe0 Aug 28 16:58:22 serv1 kernel: [reiserfs_delete_inode+0/224] reiserfs_delete_inode+0x0/0xe0 Aug 28 16:58:22 serv1 kernel: [generic_delete_inode+115/256] generic_delete_inode+0x73/0x100 Aug 28 16:58:22 serv1 kernel: [iput+99/144] iput+0x63/0x90 Aug 28 16:58:22 serv1 kernel: [sys_unlink+214/304] sys_unlink+0xd6/0x130 Aug 28 16:58:22 serv1 kernel: [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75 Aug 28 16:58:22 serv1 kernel: Code: 2a 40 b9 09 00 00 00 89 df 89 46 04 f3 a5 83 7b 04 01 7e 04 31 c0 eb c9 89 2c 24 b8 80 0e 31 c0 89 44 24 04 e8 3f 12 ff ff eb e9 <0f> 0b f9 0a d5 a5 30 c0 eb cc 8d 76 00 55 57 56 53 83 ec 28 8b when doing a fsck on disks used by drbd on ex-primary node we found that there were corruptions and data-loss on both drives. Short after ex-Secondary node failed with the same symptoms. Configuration is pretty much the same if not including filesystem, now we're using EXT3. This morning when doing fsck (`fsck.ext3 -fn`) on Secondary node's partition ive stumbled upon errors like this : Illegal block #1013 (1380868438) in inode 16581436. IGNORED. Illegal block #1014 (1097283444) in inode 16581436. IGNORED. Illegal block #1015 (2001232215) in inode 16581436. IGNORED. Illegal block #1016 (1111574607) in inode 16581436. IGNORED. Illegal block #1017 (1500009284) in inode 16581436. IGNORED. Illegal block #1018 (1865438273) in inode 16581436. IGNORED. Too many illegal blocks in inode 16581436. Clear inode? no and Entry 'sizelist' in /mail1/m/m/m (593495) has deleted/unused inode 592556. Clear? no there are a LOT more entry's like this one above. And there are entry's like this one below apearing in kern.log ata3: status=0x51 { DriveReady SeekComplete Error } ata3: error=0x04 { DriveStatusError } Somehow, system's still keep on working, and i wonder for how long would it last. My current configuration at the moment is: DRBD - 0.7.10 (api:77/proto:74) Linux serv1 2.6.11.11 #1 SMP # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 957M 129M 829M 14% / tmpfs 506M 0 506M 0% /dev/shm /dev/sda5 95M 37M 58M 40% /boot /dev/sda9 1.9G 246M 1.7G 13% /home /dev/sda7 957M 33M 925M 4% /tmp /dev/sda6 3.8G 821M 3.0G 22% /usr /dev/sda8 9.4G 484M 8.9G 6% /var /dev/drbd0 184G 21G 154G 12% /mail1 /dev/drbd1 184G 21G 155G 12% /mail2 and drbd.conf follows: global { minor-count 1; } resource drbd0 { protocol C; incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; disk { on-io-error panic; } syncer { rate 500M; group 1; al-extents 997; } on serv1 { device /dev/drbd0; disk /dev/sdb1; address 10.10.10.11:7788; meta-disk /dev/sda3[0]; } on serv2 { device /dev/drbd0; disk /dev/sdb1; address 10.10.10.12:7788; meta-disk /dev/sda11[0]; } } resource drbd1 { protocol C; incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; disk { on-io-error panic; } syncer { rate 500M; group 1; al-extents 997; } on serv1 { device /dev/drbd1; disk /dev/sdc1; address 10.10.10.11:7789; meta-disk /dev/sda3[1]; } on serv2 { device /dev/drbd1; disk /dev/sdc1; address 10.10.10.12:7789; meta-disk /dev/sda11[1]; } } Please give me a hint where to look and/or what should i do. P.S. Sorry for the long mail and my poor english ;)