[DRBD-user] First DRBD attempt -- HELP pls

Mon Jan 23 01:34:06 CET 2012

On Sun, Jan 22, 2012 at 4:46 PM, Trey Dockendorf <treydock at gmail.com> wrote:

> I'm in the process of setting up DRBD from scratch for the first time.
>  The first phase of my project is simply to sync one virtual server to
> another in primary/secondary with no intention to facilitate failover.  The
> primary (cllakvm2) has 7 VMs in production on it.  The secondary (cllakvm1)
> has the exact same size LVM with nothing currently stored there.
>
> I've created both resources, but what's strange is once I created the
> resource on my "primary" and started the DRBD daemon, it began to sync
> before I told it "drbdadm primary --force resource".  Now every command I
> attempt to run just hangs.  I've also tried "kill -9 <pid>" on
> the processes with no luck.  I also can't remount the /vmstore partition
> (LV where all virtual disks live).  I've tried drdbadmn down r0, drbdadm
> disconnect --force r0, and nothing will stop the processes which seem to
> hang and then never stop.  The process list of drbd processes in towards
> bottom of this email.
>
> This is on CentOS 6 with DRBD 8.4.1.  Here's my relevant configs
>
> global_common.conf
> ------------
> global {
>         usage-count no;
>         # minor-count dialog-refresh disable-ip-verification
> }
>
> common {
>         handlers {
>                 # pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 # pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 # local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
>                 # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>                 # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>                 # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>                 # before-resync-target
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>                 # after-resync-target
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>         }
>
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
>         }
>
>         options {
>                 # cpu-mask on-no-data-accessible
>         }
>
>         disk {
>                 # size max-bio-bvecs on-io-error fencing disk-barrier
> disk-flushes
>                 # disk-drain md-flushes resync-rate resync-after al-extents
>                 # c-plan-ahead c-delay-target c-fill-target c-max-rate
>                 # c-min-rate disk-timeout
>         }
>
>         net {
>                 # protocol timeout max-epoch-size max-buffers
> unplug-watermark
>                 # connect-int ping-int sndbuf-size rcvbuf-size ko-count
>                 # allow-two-primaries cram-hmac-alg shared-secret
> after-sb-0pri
>                 # after-sb-1pri after-sb-2pri always-asbp rr-conflict
>                 # ping-timeout data-integrity-alg tcp-cork on-congestion
>                 # congestion-fill congestion-extents csums-alg verify-alg
>                 # use-rle
>
>                 protocol C;
>         }
> }
>
> r0.res
> ===============
> resource r0 {
>   on cllakvm2.tamu.edu {
>     device    /dev/drbd1;
>     disk      /dev/vg_cllakvm2/lv_vmstore;
>     address   128.194.115.76:7789;
>     meta-disk internal;
>   }
>   on cllakvm1.tamu.edu {
>     device    /dev/drbd1;
>     disk      /dev/vg_cllakvm1/lv_vmstore;
>     address   165.91.253.227:7789;
>     meta-disk internal;
>   }
> }
> =================
>
> Since both were built before the DRBD configuration I had to shrink each
> filesystem in the LV by about 70M. I then ran "drbdadm create-md r0".  When
> I tried to start the drbd service on cllakvm2 I got the following..
> ==============
> # service drbd start
> Starting DRBD resources: [
>      create res: r0
>    prepare disk: r0
>     adjust disk: r0:failed(attach:10)
>      adjust net: r0
> ]
> .
> =============
>
> I then unmounted "/vmstore" (with all VMs stopped) , re-ran create-md and
> then restarted drbd which produced no errors.  Right after that that no
> drbdadm commands would respond, and saw from "/proc/drbd" that the status
> showed syncing, without is yet being promoted to primary.
>
> This is what the current status is on cllakvm1 (secondary)
> ==============
> # service drbd status
> drbd driver loaded OK; device status:
> version: 8.4.1 (api:1/proto:86-100)
> GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by dag at Build64R6,
> 2011-12-21 06:08:50
> m:res  cs         ro                   ds                         p
>  mounted  fstype
> 1:r0   Connected  Secondary/Secondary  Inconsistent/Inconsistent  C
>
> # cat /proc/drbd
> version: 8.4.1 (api:1/proto:86-100)
> GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by dag at Build64R6,
> 2011-12-21 06:08:50
>
>  1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C
> r-----
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
> oos:1132853452
> ===========
>
> On the primary, cllakvm2, this happens
>
> =============
>
> # service drbd status
> drbd driver loaded OK; device status:
> version: 8.4.1 (api:1/proto:86-100)
> GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by dag at Build64R6,
> 2011-12-21 06:08:50
> *< HANGS HERE >*
>
> # cat /proc/drbd
> version: 8.4.1 (api:1/proto:86-100)
> GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by dag at Build64R6,
> 2011-12-21 06:08:50
>
>  1: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
> oos:1132853452
>         [>....................] sync'ed:  0.1% (1106300/1106300)M
>         finish: 756809:02:32 speed: 0 (0) K/sec
>
> ====================
>
> =========
> # ps aux | grep drbd
> root      1099  0.0  0.0   4140   512 pts/10   D    16:21   0:00 drbdsetup
> sh-status 1
> root      1560  0.0  0.0 103220   872 pts/10   S+   16:40   0:00 grep drbd
> root      4484  0.0  0.0   4140   508 pts/10   D    16:23   0:00 drbdsetup
> sh-status 1
> root      6542  0.0  0.0   4140   512 pts/10   D    16:24   0:00 drbdsetup
> primary 1 --force
> root      7959  0.0  0.0   4140   512 pts/10   D    16:25   0:00 drbdsetup
> down r0
> root     10581  0.0  0.0   4140   512 pts/10   D    16:27   0:00 drbdsetup
> disconnect ipv4:128.194.115.76:7789 ipv4:165.91.253.227:7789
> root     10783  0.0  0.0   4140   512 pts/10   D    16:27   0:00 drbdsetup
> disconnect ipv4:128.194.115.76:7789 ipv4:165.91.253.227:7789 --force
> root     12652  0.0  0.0      0     0 ?        S    16:09   0:00
> [drbd_w_r0]
> root     12654  0.0  0.0      0     0 ?        S    16:09   0:00
> [drbd_r_r0]
> root     12659  0.0  0.0      0     0 ?        S    16:09   0:00
> [drbd_a_r0]
> root     26059  0.0  0.0  11284   664 pts/10   S    16:36   0:00 /bin/bash
> /etc/init.d/drbd status
> root     26062  0.0  0.0   4140   508 pts/10   D    16:36   0:00 drbdsetup
> sh-status 1
> root     27570  0.0  0.0  11284   664 pts/10   S    16:37   0:00 /bin/bash
> /etc/init.d/drbd status
> root     27573  0.0  0.0   4140   512 pts/10   D    16:37   0:00 drbdsetup
> sh-status 1
> root     32255  0.0  0.0   4140   552 pts/10   D    16:20   0:00 drbdsetup
> r0 down
> ==============
>
> Any advice is greatly welcome, I'm having a mild panic attack because the
> VMs were paused long enough to re-size the filesystem to allow internal
> metadisk but now I can't remount and they can't be started back up.
>
> Thanks
> - Trey
>

Sorry to reply to my own post, but I got around the uninterruptable sleep
processes by rebooting but now the problematic system can't attach to the
drbd resource.  I've changed to drbd83 instead of drbd84 from elrepo and
the problem is the same.
================
# drbdadm attach r0
0: Failure: (104) Can not open backing device.
Command 'drbdsetup 0 disk /dev/vg_cllakvm2/lv_vmstore
/dev/vg_cllakvm2/lv_vmstore internal --set-defaults --create-device'
terminated with exit code 10

The status of drbd on cllakvm2 (primary w/ data).
================
# cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag at Build64R6,
2011-11-20 10:57:03
 0: cs:Connected ro:Secondary/Secondary ds:Diskless/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Using drbd84 gave the same "Can not open backing device" with exit code 10.

The strange part is these systems are identical in every way except their
volume groups are named after their hostname.  The drbd setup is identical
also as I'm using Puppet for that too.  Any advice on how to troubleshoot
or resolve this ?

Thanks
- Trey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120122/75cc5f98/attachment.htm>