Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
On 03/01/2012 09:37 PM, envisionrx wrote:
>
> Hey all, I have a two node single primary with offsite disaster recovery (dr)
> node configuration using stacked resources that I'm having weird issues
> with. Twice in the last week the primary node stopped responding and I had
> to disconnect/reconnect the dr node to get it working again. When it fails
> I get the following in the primary nodes logs:
>
> kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565
>
> There are no relevant log entries on the DR node.
>
> I see these messages in the logs from time to time, but usually they just
> last for a few seconds and it's all cleared up on it's own.
>
> can anyone give me some idea of what direction to go in to try to figure out
> what the issue might be? I've included my global.conf, drbd.conf and more
> logs from around the time it failed last. Please let me know if any
> additional information would be helpful!
Any chance your DR node has significant different hardware setup,
especially regarding disk and raid controller capabilities? If your DR
node is under high (i/o load) because of e.g. a backup job it might be
unable to cope with DRBD replication at the same time because your i/o
stack is completely overloaded. Add something like "ko-count 6;" to the
net section, this will prevent your primary to block for too long time
though it will also go into Standalone mode which has to be resolved
manually.
Regards,
Andreas
--
Need help with DRBD?
http://www.hastexo.com/now
>
> Thanks!
>
> here is my global.conf file:
> global {
> usage-count yes;
> # minor-count dialog-refresh disable-ip-verification
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> # after-resync-target
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
>
> startup {
> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
> }
>
> disk {
> # on-io-error fencing use-bmbv no-disk-barrier
> no-disk-flushes
> # no-disk-drain no-md-flushes max-bio-bvecs
> on-io-error detach;
> # fencing resource-only;
> }
>
> net {
> # sndbuf-size rcvbuf-size timeout connect-int ping-int
> ping-timeout max-buffers
> # max-epoch-size ko-count allow-two-primaries cram-hmac-alg
> shared-secret
> # after-sb-0pri after-sb-1pri after-sb-2pri
> data-integrity-alg no-tcp-cork
> # data-integrity-alg crc32c;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> }
>
> syncer {
> # rate after al-extents use-rle cpu-mask verify-alg
> csums-alg
> rate 100M;
> csums-alg crc32c;
> verify-alg crc32c;
> use-rle;
> }
> }
>
> drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6
> uppers, meta and data1-data5, all are configured the same as the two shown
> here)
>
> include "drbd.d/global_common.conf";
> include "drbd.d/*.res";
> resource meta_lower {
> disk /dev/backingvg/metabacking;
> device /dev/drbd0;
> meta-disk internal;
> disk {
> fencing resource-only;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
> on openfiler1 {
> address 10.50.153.1:7788;
> }
> on openfiler2 {
> address 10.50.153.2:7788;
> }
> }
> resource data1_lower {
> device /dev/drbd1;
> disk /dev/backingvg/256data1backing;
> meta-disk internal;
> disk {
> fencing resource-only;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
> on openfiler1 {
> address 10.50.153.1:7789;
> }
> on openfiler2 {
> address 10.50.153.2:7789;
> }
> }
> ...
>
> resource meta {
> protocol A;
> device /dev/drbd10;
> meta-disk internal;
> syncer {
> rate 1000k;
> }
> stacked-on-top-of meta_lower {
> address 10.50.150.101:7788;
> }
> on openfiler3 {
> disk /dev/backingvg/metabacking;
> address 10.50.250.4:7788;
> }
> }
> resource data1 {
> protocol A;
> device /dev/drbd11;
> meta-disk internal;
> handlers {
> before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
> after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
> }
> net {
> sndbuf-size 512k;
> on-congestion pull-ahead;
> congestion-fill 500k;
> }
> syncer {
> rate 1000k;
> }
> stacked-on-top-of data1_lower {
> address 10.50.150.101:7789;
> }
> on openfiler3 {
> disk /dev/backingvg/256data1backing;
> address 10.50.250.4:7789;
> }
> }
> ...
>
> Here the log right around the time it failed:
>
>
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFConnection -> WFReportParams )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting
> asender thread (from drbd10_receiver [4007])
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> data-integrity-alg: <not-used>
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> drbd_sync_handshake:
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self
> D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer
> 6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> uuid_compare()=1 by rule 70
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown
> -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
> Consistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive
> bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression:
> 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync
> as SyncSource (will sync 0 KB [0 bits set]).
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync
> UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done
> (total 1 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs
> D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done
> (total 98 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs
> 42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done
> (total 117 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs
> 226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295
> kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294
> kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293
> kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292
> kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291
> kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290
> kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289
> kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288
> kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287
> kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286
> kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285
> kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284
> kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283
> kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282
> kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281
> kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280
> kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279
> kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278
> kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277
> kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276
>
>
> these are the only messages show until i reset the link between nodes by
> doing drbdadm down all on the dr node.
>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120302/c6b5c787/attachment.pgp>