[DRBD-user] kernel oops on secondary after online verify in DRBD 8.2.4

Tue Feb 5 04:36:27 CET 2008

Hi,

A couple of weeks ago I upgraded from DRBD 0.7.25 to 8.2.4 and all seems to be
working fine. In the last two weeks I got my hands on some free hardware and
upgraded one of the nodes. I was about to change the new node from DRBD
secondary (HA standby) to primary (HA active) when I remembered with DRBD 8.2.4
that I could do an online verify. So I entered "drbdadm verify data1" on the
primary node, and it all seemed to go swimmingly. However, after the verify
completed (with no oos:0 at least in the first 99% of the verify), the secondary
node (the one with the new hardware) suffered a kernel oops. I brought the
secondary node back up, invalidated the data, connected it, let it completely
synchronise and perfomed an online verify again on the primary with a kernel
oops occurring again after the verify completed. I repeated this once more and I
got a kernel oops for the third time. It is about 2 minutes after the verify
completes that the oops occurs. The data on the primary appears to be completely
fine. Any idea why this is occuring?

A few other notes, after upgrading the hardware and prior to performing the
online verify, I have also done a couple of other things. In hindsight, I should
not have tried to do so many things in one upgrade cos' now I can't be sure if
any of the other things have had anything to do with the kernel oops.
1. The primary node is running Fedora 7 with kernel 2.6.22.9-91.fc7. The
upgraded node is running Fedora 8 with kernel 2.6.23.14-107.fc8.
2. I went from having two DRBD resources (data1 and data2) to just one resource
(data1). The resource data2 is completely unconfigured and shows up in
/proc/drbd as " 1: cs:Unconfigured". The other node has only ever had the one
resource configured.
3. I have set up LVM2 on top of DRBD. This was the reason for point 1 above. I
plan on having one big DRBD partition and using LVM to chop it up and create
smaller file systems. I'm not using LVM anywhere else on this system. After the
upgrade I play to expand /dev/drbd0 partition (which requires an online software
RAID expansion of /dev/md2 underneath) which should be fun. As an aside, does it
look like I have I up LVM2 on DRBD okay?

All the gory details are below.

Cheers,
Jeff.

========= Kernel Oops ==========
Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: Oops: 0000 [#1] SMP 

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: EIP:    0060:[<c047d9a0>]    Not tainted VLI

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: CPU:    0

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: EFLAGS: 00010086   (2.6.23.14-107.fc8 #1)

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: EIP is at kmem_cache_alloc+0x5a/0x99

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: esi: c0735230   edi: 00000292   ebp: 000080d0   esp: d15c1ebc

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: eax: 00000000   ebx: 83c38953   ecx: deb7720e   edx: c119d8e0

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: Stack: 00000000 00000000 db4d3800 00000000 d1877be0 d1877be0 deb7720e
db4d3800 

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:        00000000 cd424718 c04af479 d1877be0 cd424718 00000000 c04af43d
c047fbc1 

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: Process exim (pid: 5915, ti=d15c1000 task=cbb63230 task.ti=d15c1000)

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:        ddcf9e00 c9c23110 d1877be0 ffffff9c d15c1f30 00000004 c047fcf2
d1877be0 

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<deb7720e>] if6_seq_open+0x14/0x46 [ipv6]

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: Call Trace:

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c04af479>] proc_reg_open+0x3c/0x4c

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c04af43d>] proc_reg_open+0x0/0x4c

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fbc1>] __dentry_open+0xd5/0x18c

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fcf2>] nameidata_to_filp+0x24/0x33

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fa72>] get_unused_fd_flags+0x52/0xc5

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fd87>] do_sys_open+0x48/0xca

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fd38>] do_filp_open+0x37/0x3e

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c040518a>] syscall_call+0x7/0xb

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  =======================

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel:  [<c047fe42>] sys_open+0x1c/0x1e

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: Code: 00 00 00 85 d2 74 06 83 7a 0c 00 75 17 89 54 24 04 89 f0 89 ea 89
0c 24 83 c9 ff e8 37 fa ff ff 89 c3 eb 0d 8b 5a 0c 0f b7 42 0a <8b> 04 83 89 42
0c 89 f8 50 9d 8d 04 05 00 00 00 00 90 66 85 ed 

Message from syslogd at sauron at Feb  5 08:27:47 ...
 kernel: EIP: [<c047d9a0>] kmem_cache_alloc+0x5a/0x99 SS:ESP 0068:d15c1ebc

========== /etc/drbd.conf ==========
resource data1 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }
  startup {
    wfc-timeout         0;  # Infinite!
    degr-wfc-timeout  120;  # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  net {
    # timeout         60;   #  6 seconds  (unit = 0.1 seconds)
    # connect-int     10;   # 10 seconds  (unit = 1 second)
    # ping-int        10;   # 10 seconds  (unit = 1 second)
    # max-buffers     2048;
    # max-epoch-size  2048;
    # ko-count        4;
    # on-disconnect   reconnect;
  }
  syncer {
    rate 50M;  # 8 Mbit/s
    verify-alg crc32c;
    # al-extents 257;
  }
  on sauron.whiterabbit.com.au {
    device     /dev/drbd0;
    disk       /dev/md2;
    address    172.16.0.10:7788;
    meta-disk  internal;
  }
  on shelob.whiterabbit.com.au {
    device    /dev/drbd0;
    disk      /dev/md2;
    address   172.16.0.11:7788;
    meta-disk internal;
  }
}

========== /etc/lvm/lvm.conf ==========
Just listing the modified/new entries. Everything else is stock lvm.conf.

# (JG) Filter for DRBD devices only
filter = [ "a/drbd.*/" , "r/.*/" ]
# (JG) Types fo allow DRBD block devices
types = [ "drbd", 16 ]
# (JG) Don't automatically activate any VGs or LVs as this is done by Heartbeat
volume_list = ""

========== /etc/ha.d/haresources ==========
shelob.whiterabbit.com.au drbddisk::data1 LVM::VolGroup00
Filesystem::/dev/VolGroup00/LogVol00::/mnt/home::ext3 172.16.0.9 60.241.247.218
nfs cyrus-imapd httpd

========== LVM Steps ==========
#On the primary node:
pvcreate /dev/drbd0
vgcreate VolGroup00 /dev/drbd0
vgdisplay VolGroup00
  --- Volume group ---
  VG Name               VolGroup00
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               19.08 GB
  PE Size               4.00 MB
  Total PE              4884
  Alloc PE / Size       4884 / 19.08 GB
  Free  PE / Size       0 / 0   
  VG UUID               Mivt0j-0tk4-1DLM-9981-ilRy-yVFq-bptQLk
lvcreate --extents 4884 --name LogVol00 VolGroup00
lvscan
  ACTIVE            '/dev/VolGroup00/LogVol00' [19.08 GB] inherit
mkfs.ext3 -b 4096 -L "/mnt/data1" /dev/VolGroup00/LogVol00

# On the secondary node
lvscan
  No volume groups found
vgscan
  Reading all physical volumes.  This may take a while...
  No volume groups found
pvscan
  No matching physical volumes found
====================

As an aside, have I set up LVM2 on DRBD okay?