Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
A couple of weeks ago I upgraded from DRBD 0.7.25 to 8.2.4 and all seems to be
working fine. In the last two weeks I got my hands on some free hardware and
upgraded one of the nodes. I was about to change the new node from DRBD
secondary (HA standby) to primary (HA active) when I remembered with DRBD 8.2.4
that I could do an online verify. So I entered "drbdadm verify data1" on the
primary node, and it all seemed to go swimmingly. However, after the verify
completed (with no oos:0 at least in the first 99% of the verify), the secondary
node (the one with the new hardware) suffered a kernel oops. I brought the
secondary node back up, invalidated the data, connected it, let it completely
synchronise and perfomed an online verify again on the primary with a kernel
oops occurring again after the verify completed. I repeated this once more and I
got a kernel oops for the third time. It is about 2 minutes after the verify
completes that the oops occurs. The data on the primary appears to be completely
fine. Any idea why this is occuring?
A few other notes, after upgrading the hardware and prior to performing the
online verify, I have also done a couple of other things. In hindsight, I should
not have tried to do so many things in one upgrade cos' now I can't be sure if
any of the other things have had anything to do with the kernel oops.
1. The primary node is running Fedora 7 with kernel 2.6.22.9-91.fc7. The
upgraded node is running Fedora 8 with kernel 2.6.23.14-107.fc8.
2. I went from having two DRBD resources (data1 and data2) to just one resource
(data1). The resource data2 is completely unconfigured and shows up in
/proc/drbd as " 1: cs:Unconfigured". The other node has only ever had the one
resource configured.
3. I have set up LVM2 on top of DRBD. This was the reason for point 1 above. I
plan on having one big DRBD partition and using LVM to chop it up and create
smaller file systems. I'm not using LVM anywhere else on this system. After the
upgrade I play to expand /dev/drbd0 partition (which requires an online software
RAID expansion of /dev/md2 underneath) which should be fun. As an aside, does it
look like I have I up LVM2 on DRBD okay?
All the gory details are below.
Cheers,
Jeff.
========= Kernel Oops ==========
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: Oops: 0000 [#1] SMP
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: EIP: 0060:[<c047d9a0>] Not tainted VLI
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: CPU: 0
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: EFLAGS: 00010086 (2.6.23.14-107.fc8 #1)
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: EIP is at kmem_cache_alloc+0x5a/0x99
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: esi: c0735230 edi: 00000292 ebp: 000080d0 esp: d15c1ebc
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: eax: 00000000 ebx: 83c38953 ecx: deb7720e edx: c119d8e0
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: Stack: 00000000 00000000 db4d3800 00000000 d1877be0 d1877be0 deb7720e
db4d3800
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: 00000000 cd424718 c04af479 d1877be0 cd424718 00000000 c04af43d
c047fbc1
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: Process exim (pid: 5915, ti=d15c1000 task=cbb63230 task.ti=d15c1000)
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: ddcf9e00 c9c23110 d1877be0 ffffff9c d15c1f30 00000004 c047fcf2
d1877be0
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<deb7720e>] if6_seq_open+0x14/0x46 [ipv6]
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: Call Trace:
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c04af479>] proc_reg_open+0x3c/0x4c
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c04af43d>] proc_reg_open+0x0/0x4c
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fbc1>] __dentry_open+0xd5/0x18c
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fcf2>] nameidata_to_filp+0x24/0x33
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fa72>] get_unused_fd_flags+0x52/0xc5
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fd87>] do_sys_open+0x48/0xca
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fd38>] do_filp_open+0x37/0x3e
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c040518a>] syscall_call+0x7/0xb
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: =======================
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: [<c047fe42>] sys_open+0x1c/0x1e
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: Code: 00 00 00 85 d2 74 06 83 7a 0c 00 75 17 89 54 24 04 89 f0 89 ea 89
0c 24 83 c9 ff e8 37 fa ff ff 89 c3 eb 0d 8b 5a 0c 0f b7 42 0a <8b> 04 83 89 42
0c 89 f8 50 9d 8d 04 05 00 00 00 00 90 66 85 ed
Message from syslogd at sauron at Feb 5 08:27:47 ...
kernel: EIP: [<c047d9a0>] kmem_cache_alloc+0x5a/0x99 SS:ESP 0068:d15c1ebc
========== /etc/drbd.conf ==========
resource data1 {
protocol C;
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
}
startup {
wfc-timeout 0; # Infinite!
degr-wfc-timeout 120; # 2 minutes.
}
disk {
on-io-error detach;
}
net {
# timeout 60; # 6 seconds (unit = 0.1 seconds)
# connect-int 10; # 10 seconds (unit = 1 second)
# ping-int 10; # 10 seconds (unit = 1 second)
# max-buffers 2048;
# max-epoch-size 2048;
# ko-count 4;
# on-disconnect reconnect;
}
syncer {
rate 50M; # 8 Mbit/s
verify-alg crc32c;
# al-extents 257;
}
on sauron.whiterabbit.com.au {
device /dev/drbd0;
disk /dev/md2;
address 172.16.0.10:7788;
meta-disk internal;
}
on shelob.whiterabbit.com.au {
device /dev/drbd0;
disk /dev/md2;
address 172.16.0.11:7788;
meta-disk internal;
}
}
========== /etc/lvm/lvm.conf ==========
Just listing the modified/new entries. Everything else is stock lvm.conf.
# (JG) Filter for DRBD devices only
filter = [ "a/drbd.*/" , "r/.*/" ]
# (JG) Types fo allow DRBD block devices
types = [ "drbd", 16 ]
# (JG) Don't automatically activate any VGs or LVs as this is done by Heartbeat
volume_list = ""
========== /etc/ha.d/haresources ==========
shelob.whiterabbit.com.au drbddisk::data1 LVM::VolGroup00
Filesystem::/dev/VolGroup00/LogVol00::/mnt/home::ext3 172.16.0.9 60.241.247.218
nfs cyrus-imapd httpd
========== LVM Steps ==========
#On the primary node:
pvcreate /dev/drbd0
vgcreate VolGroup00 /dev/drbd0
vgdisplay VolGroup00
--- Volume group ---
VG Name VolGroup00
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 19.08 GB
PE Size 4.00 MB
Total PE 4884
Alloc PE / Size 4884 / 19.08 GB
Free PE / Size 0 / 0
VG UUID Mivt0j-0tk4-1DLM-9981-ilRy-yVFq-bptQLk
lvcreate --extents 4884 --name LogVol00 VolGroup00
lvscan
ACTIVE '/dev/VolGroup00/LogVol00' [19.08 GB] inherit
mkfs.ext3 -b 4096 -L "/mnt/data1" /dev/VolGroup00/LogVol00
# On the secondary node
lvscan
No volume groups found
vgscan
Reading all physical volumes. This may take a while...
No volume groups found
pvscan
No matching physical volumes found
====================
As an aside, have I set up LVM2 on DRBD okay?