Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, On 03/01/2012 09:37 PM, envisionrx wrote: > > Hey all, I have a two node single primary with offsite disaster recovery (dr) > node configuration using stacked resources that I'm having weird issues > with. Twice in the last week the primary node stopped responding and I had > to disconnect/reconnect the dr node to get it working again. When it fails > I get the following in the primary nodes logs: > > kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565 > > There are no relevant log entries on the DR node. > > I see these messages in the logs from time to time, but usually they just > last for a few seconds and it's all cleared up on it's own. > > can anyone give me some idea of what direction to go in to try to figure out > what the issue might be? I've included my global.conf, drbd.conf and more > logs from around the time it failed last. Please let me know if any > additional information would be helpful! Any chance your DR node has significant different hardware setup, especially regarding disk and raid controller capabilities? If your DR node is under high (i/o load) because of e.g. a backup job it might be unable to cope with DRBD replication at the same time because your i/o stack is completely overloaded. Add something like "ko-count 6;" to the net section, this will prevent your primary to block for too long time though it will also go into Standalone mode which has to be resolved manually. Regards, Andreas -- Need help with DRBD? http://www.hastexo.com/now > > Thanks! > > here is my global.conf file: > global { > usage-count yes; > # minor-count dialog-refresh disable-ip-verification > } > > common { > protocol C; > > handlers { > pri-on-incon-degr > "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > pri-lost-after-sb > "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; > halt -f"; > # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > # before-resync-target > "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; > # after-resync-target > /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; > } > > startup { > # wfc-timeout degr-wfc-timeout outdated-wfc-timeout > wait-after-sb > } > > disk { > # on-io-error fencing use-bmbv no-disk-barrier > no-disk-flushes > # no-disk-drain no-md-flushes max-bio-bvecs > on-io-error detach; > # fencing resource-only; > } > > net { > # sndbuf-size rcvbuf-size timeout connect-int ping-int > ping-timeout max-buffers > # max-epoch-size ko-count allow-two-primaries cram-hmac-alg > shared-secret > # after-sb-0pri after-sb-1pri after-sb-2pri > data-integrity-alg no-tcp-cork > # data-integrity-alg crc32c; > after-sb-0pri discard-zero-changes; > after-sb-1pri consensus; > after-sb-2pri disconnect; > } > > syncer { > # rate after al-extents use-rle cpu-mask verify-alg > csums-alg > rate 100M; > csums-alg crc32c; > verify-alg crc32c; > use-rle; > } > } > > drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6 > uppers, meta and data1-data5, all are configured the same as the two shown > here) > > include "drbd.d/global_common.conf"; > include "drbd.d/*.res"; > resource meta_lower { > disk /dev/backingvg/metabacking; > device /dev/drbd0; > meta-disk internal; > disk { > fencing resource-only; > } > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > } > on openfiler1 { > address 10.50.153.1:7788; > } > on openfiler2 { > address 10.50.153.2:7788; > } > } > resource data1_lower { > device /dev/drbd1; > disk /dev/backingvg/256data1backing; > meta-disk internal; > disk { > fencing resource-only; > } > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > } > on openfiler1 { > address 10.50.153.1:7789; > } > on openfiler2 { > address 10.50.153.2:7789; > } > } > ... > > resource meta { > protocol A; > device /dev/drbd10; > meta-disk internal; > syncer { > rate 1000k; > } > stacked-on-top-of meta_lower { > address 10.50.150.101:7788; > } > on openfiler3 { > disk /dev/backingvg/metabacking; > address 10.50.250.4:7788; > } > } > resource data1 { > protocol A; > device /dev/drbd11; > meta-disk internal; > handlers { > before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh"; > after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh"; > } > net { > sndbuf-size 512k; > on-congestion pull-ahead; > congestion-fill 500k; > } > syncer { > rate 1000k; > } > stacked-on-top-of data1_lower { > address 10.50.150.101:7789; > } > on openfiler3 { > disk /dev/backingvg/256data1backing; > address 10.50.250.4:7789; > } > } > ... > > Here the log right around the time it failed: > > > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( > WFConnection -> WFReportParams ) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting > asender thread (from drbd10_receiver [4007]) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: > data-integrity-alg: <not-used> > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: > drbd_sync_handshake: > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self > D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0 > flags:0 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer > 6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0 > flags:0 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: > uuid_compare()=1 by rule 70 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown > -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> > Consistent ) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap > stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0% > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive > bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: > 100.0% > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper > command: /sbin/drbdadm before-resync-source minor-10 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper > command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( > WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync > as SyncSource (will sync 0 KB [0 bits set]). > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync > UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done > (total 1 sec; paused 0 sec; 0 K/sec) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs > D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7 > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( > SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE > of 0 pages took 0 jiffies > kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits) > marked out-of-sync by on disk bit-map. > kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done > (total 98 sec; paused 0 sec; 0 K/sec) > kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs > 42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D > kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn( > SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE > of 0 pages took 0 jiffies > kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits) > marked out-of-sync by on disk bit-map. > kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done > (total 117 sec; paused 0 sec; 0 K/sec) > kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs > 226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1 > kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn( > SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE > of 0 pages took 0 jiffies > kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits) > marked out-of-sync by on disk bit-map. > kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295 > kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294 > kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293 > kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292 > kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291 > kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290 > kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289 > kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288 > kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287 > kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286 > kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285 > kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284 > kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283 > kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282 > kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281 > kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280 > kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279 > kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278 > kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277 > kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14: > [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276 > > > these are the only messages show until i reset the link between nodes by > doing drbdadm down all on the dr node. > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 222 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120302/c6b5c787/attachment.pgp>