Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, A couple of weeks ago I upgraded from DRBD 0.7.25 to 8.2.4 and all seems to be working fine. In the last two weeks I got my hands on some free hardware and upgraded one of the nodes. I was about to change the new node from DRBD secondary (HA standby) to primary (HA active) when I remembered with DRBD 8.2.4 that I could do an online verify. So I entered "drbdadm verify data1" on the primary node, and it all seemed to go swimmingly. However, after the verify completed (with no oos:0 at least in the first 99% of the verify), the secondary node (the one with the new hardware) suffered a kernel oops. I brought the secondary node back up, invalidated the data, connected it, let it completely synchronise and perfomed an online verify again on the primary with a kernel oops occurring again after the verify completed. I repeated this once more and I got a kernel oops for the third time. It is about 2 minutes after the verify completes that the oops occurs. The data on the primary appears to be completely fine. Any idea why this is occuring? A few other notes, after upgrading the hardware and prior to performing the online verify, I have also done a couple of other things. In hindsight, I should not have tried to do so many things in one upgrade cos' now I can't be sure if any of the other things have had anything to do with the kernel oops. 1. The primary node is running Fedora 7 with kernel 2.6.22.9-91.fc7. The upgraded node is running Fedora 8 with kernel 2.6.23.14-107.fc8. 2. I went from having two DRBD resources (data1 and data2) to just one resource (data1). The resource data2 is completely unconfigured and shows up in /proc/drbd as " 1: cs:Unconfigured". The other node has only ever had the one resource configured. 3. I have set up LVM2 on top of DRBD. This was the reason for point 1 above. I plan on having one big DRBD partition and using LVM to chop it up and create smaller file systems. I'm not using LVM anywhere else on this system. After the upgrade I play to expand /dev/drbd0 partition (which requires an online software RAID expansion of /dev/md2 underneath) which should be fun. As an aside, does it look like I have I up LVM2 on DRBD okay? All the gory details are below. Cheers, Jeff. ========= Kernel Oops ========== Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: Oops: 0000 [#1] SMP Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: EIP: 0060:[<c047d9a0>] Not tainted VLI Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: CPU: 0 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: EFLAGS: 00010086 (2.6.23.14-107.fc8 #1) Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: EIP is at kmem_cache_alloc+0x5a/0x99 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: esi: c0735230 edi: 00000292 ebp: 000080d0 esp: d15c1ebc Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: eax: 00000000 ebx: 83c38953 ecx: deb7720e edx: c119d8e0 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: Stack: 00000000 00000000 db4d3800 00000000 d1877be0 d1877be0 deb7720e db4d3800 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: 00000000 cd424718 c04af479 d1877be0 cd424718 00000000 c04af43d c047fbc1 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: Process exim (pid: 5915, ti=d15c1000 task=cbb63230 task.ti=d15c1000) Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: ddcf9e00 c9c23110 d1877be0 ffffff9c d15c1f30 00000004 c047fcf2 d1877be0 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<deb7720e>] if6_seq_open+0x14/0x46 [ipv6] Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: Call Trace: Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c04af479>] proc_reg_open+0x3c/0x4c Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c04af43d>] proc_reg_open+0x0/0x4c Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fbc1>] __dentry_open+0xd5/0x18c Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fcf2>] nameidata_to_filp+0x24/0x33 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fa72>] get_unused_fd_flags+0x52/0xc5 Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fd87>] do_sys_open+0x48/0xca Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fd38>] do_filp_open+0x37/0x3e Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c040518a>] syscall_call+0x7/0xb Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: ======================= Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: [<c047fe42>] sys_open+0x1c/0x1e Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: Code: 00 00 00 85 d2 74 06 83 7a 0c 00 75 17 89 54 24 04 89 f0 89 ea 89 0c 24 83 c9 ff e8 37 fa ff ff 89 c3 eb 0d 8b 5a 0c 0f b7 42 0a <8b> 04 83 89 42 0c 89 f8 50 9d 8d 04 05 00 00 00 00 90 66 85 ed Message from syslogd at sauron at Feb 5 08:27:47 ... kernel: EIP: [<c047d9a0>] kmem_cache_alloc+0x5a/0x99 SS:ESP 0068:d15c1ebc ========== /etc/drbd.conf ========== resource data1 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; } startup { wfc-timeout 0; # Infinite! degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; } net { # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # max-buffers 2048; # max-epoch-size 2048; # ko-count 4; # on-disconnect reconnect; } syncer { rate 50M; # 8 Mbit/s verify-alg crc32c; # al-extents 257; } on sauron.whiterabbit.com.au { device /dev/drbd0; disk /dev/md2; address 172.16.0.10:7788; meta-disk internal; } on shelob.whiterabbit.com.au { device /dev/drbd0; disk /dev/md2; address 172.16.0.11:7788; meta-disk internal; } } ========== /etc/lvm/lvm.conf ========== Just listing the modified/new entries. Everything else is stock lvm.conf. # (JG) Filter for DRBD devices only filter = [ "a/drbd.*/" , "r/.*/" ] # (JG) Types fo allow DRBD block devices types = [ "drbd", 16 ] # (JG) Don't automatically activate any VGs or LVs as this is done by Heartbeat volume_list = "" ========== /etc/ha.d/haresources ========== shelob.whiterabbit.com.au drbddisk::data1 LVM::VolGroup00 Filesystem::/dev/VolGroup00/LogVol00::/mnt/home::ext3 172.16.0.9 60.241.247.218 nfs cyrus-imapd httpd ========== LVM Steps ========== #On the primary node: pvcreate /dev/drbd0 vgcreate VolGroup00 /dev/drbd0 vgdisplay VolGroup00 --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 4 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 19.08 GB PE Size 4.00 MB Total PE 4884 Alloc PE / Size 4884 / 19.08 GB Free PE / Size 0 / 0 VG UUID Mivt0j-0tk4-1DLM-9981-ilRy-yVFq-bptQLk lvcreate --extents 4884 --name LogVol00 VolGroup00 lvscan ACTIVE '/dev/VolGroup00/LogVol00' [19.08 GB] inherit mkfs.ext3 -b 4096 -L "/mnt/data1" /dev/VolGroup00/LogVol00 # On the secondary node lvscan No volume groups found vgscan Reading all physical volumes. This may take a while... No volume groups found pvscan No matching physical volumes found ==================== As an aside, have I set up LVM2 on DRBD okay?