[DRBD-user] Problem with stacked resource failing

Andreas Kurz andreas at hastexo.com
Fri Mar 2 13:52:11 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

On 03/01/2012 09:37 PM, envisionrx wrote:
> 
> Hey all, I have a two node single primary with offsite disaster recovery (dr)
> node configuration using stacked resources that I'm having weird issues
> with.  Twice in the last week the primary node stopped responding and I had
> to disconnect/reconnect the dr node to get it working again.  When it fails
> I get the following in the primary nodes logs:
> 
> kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565
> 
> There are no relevant log entries on the DR node.
> 
> I see these messages in the logs from time to time, but usually they just
> last for a few seconds and it's all cleared up on it's own.
> 
> can anyone give me some idea of what direction to go in to try to figure out
> what the issue might be?  I've included my global.conf, drbd.conf and more
> logs from around the time it failed last.  Please let me know if any
> additional information would be helpful!

Any chance your DR node has significant different hardware setup,
especially regarding disk and raid controller capabilities? If your DR
node is under high (i/o load) because of e.g. a backup job it might be
unable to cope with DRBD replication at the same time because your i/o
stack is completely overloaded. Add something like "ko-count 6;" to the
net section, this will prevent your primary to block for too long time
though it will also go into Standalone mode which has to be resolved
manually.

Regards,
Andreas

-- 
Need help with DRBD?
http://www.hastexo.com/now

> 
> Thanks! 
> 
> here is my global.conf file:
> global {
>         usage-count yes;
>         # minor-count dialog-refresh disable-ip-verification
> }
> 
> common {
>         protocol C;
> 
>         handlers {
>                 pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
>                 local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
> #               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>                 # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>                 # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>                 # before-resync-target
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> #               after-resync-target
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>         }
> 
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
>         }
> 
>         disk {
>                 # on-io-error fencing use-bmbv no-disk-barrier
> no-disk-flushes
>                 # no-disk-drain no-md-flushes max-bio-bvecs
>    on-io-error detach;
> #   fencing resource-only;
>         }
> 
>         net {
>                 # sndbuf-size rcvbuf-size timeout connect-int ping-int
> ping-timeout max-buffers
>                 # max-epoch-size ko-count allow-two-primaries cram-hmac-alg
> shared-secret
>                 # after-sb-0pri after-sb-1pri after-sb-2pri
> data-integrity-alg no-tcp-cork
> #   data-integrity-alg crc32c;
>    after-sb-0pri discard-zero-changes;
>    after-sb-1pri consensus;
>    after-sb-2pri disconnect;
>         }
> 
>         syncer {
>                 # rate after al-extents use-rle cpu-mask verify-alg
> csums-alg
> rate 100M;
> csums-alg crc32c;
> verify-alg crc32c;
> use-rle;
>         }
> }
> 
> drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6
> uppers, meta and data1-data5, all are configured the same as the two shown
> here)
> 
> include "drbd.d/global_common.conf";
> include "drbd.d/*.res";
> resource meta_lower {
>  disk /dev/backingvg/metabacking;
>  device /dev/drbd0;
>  meta-disk internal;
>  disk {
>    fencing resource-only;
>  }
>  handlers {
>    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>  }
>  on openfiler1 {
>     address 10.50.153.1:7788;
>  }
>  on openfiler2 {
>     address 10.50.153.2:7788;
>  }
> }
> resource data1_lower {
>  device /dev/drbd1;
>  disk /dev/backingvg/256data1backing;
>  meta-disk internal;
>  disk {
>    fencing resource-only;
>  }
>  handlers {
>    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>  }
>  on openfiler1 {
>     address 10.50.153.1:7789;
>  }
>  on openfiler2 {
>     address 10.50.153.2:7789;
>  }
> }
> ...
> 
> resource meta {
>  protocol A;
>  device /dev/drbd10;
>  meta-disk internal;
>  syncer {
>    rate 1000k;
>  }
>  stacked-on-top-of meta_lower {
>    address 10.50.150.101:7788;
>  }
>  on openfiler3 {
>    disk /dev/backingvg/metabacking;
>    address 10.50.250.4:7788;
>  }
> }
> resource data1 {
>  protocol A;
>  device /dev/drbd11;
>  meta-disk internal;
>  handlers {
>     before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
>     after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
>  }
>  net {
>    sndbuf-size 512k;
>    on-congestion pull-ahead;
>    congestion-fill 500k;
>  }
>  syncer {
>    rate 1000k;
>  }
>  stacked-on-top-of data1_lower {
>    address 10.50.150.101:7789;
>  }
>  on openfiler3 {
>    disk /dev/backingvg/256data1backing;
>    address 10.50.250.4:7789;
>  }
> }
> ...
> 
> Here the log right around the time it failed:
> 
> 
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFConnection -> WFReportParams )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting
> asender thread (from drbd10_receiver [4007])
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> data-integrity-alg: <not-used>
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> drbd_sync_handshake:
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self
> D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer
> 6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> uuid_compare()=1 by rule 70
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown
> -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
> Consistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive
> bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression:
> 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync
> as SyncSource (will sync 0 KB [0 bits set]).
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync
> UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done
> (total 1 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs
> D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done
> (total 98 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs
> 42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done
> (total 117 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs
> 226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295
> kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294
> kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293
> kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292
> kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291
> kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290
> kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289
> kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288
> kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287
> kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286
> kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285
> kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284
> kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283
> kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282
> kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281
> kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280
> kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279
> kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278
> kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277
> kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276
> 
> 
> these are the only messages show until i reset the link between nodes by
> doing drbdadm down all on the dr node.
> 
> 
> 
>  



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120302/c6b5c787/attachment.pgp>


More information about the drbd-user mailing list