Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hey all, I have a two node single primary with offsite disaster recovery (dr) node configuration using stacked resources that I'm having weird issues with. Twice in the last week the primary node stopped responding and I had to disconnect/reconnect the dr node to get it working again. When it fails I get the following in the primary nodes logs: kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565 There are no relevant log entries on the DR node. I see these messages in the logs from time to time, but usually they just last for a few seconds and it's all cleared up on it's own. can anyone give me some idea of what direction to go in to try to figure out what the issue might be? I've included my global.conf, drbd.conf and more logs from around the time it failed last. Please let me know if any additional information would be helpful! Thanks! here is my global.conf file: global { usage-count yes; # minor-count dialog-refresh disable-ip-verification } common { protocol C; handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb } disk { # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes # no-disk-drain no-md-flushes max-bio-bvecs on-io-error detach; # fencing resource-only; } net { # sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork # data-integrity-alg crc32c; after-sb-0pri discard-zero-changes; after-sb-1pri consensus; after-sb-2pri disconnect; } syncer { # rate after al-extents use-rle cpu-mask verify-alg csums-alg rate 100M; csums-alg crc32c; verify-alg crc32c; use-rle; } } drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6 uppers, meta and data1-data5, all are configured the same as the two shown here) include "drbd.d/global_common.conf"; include "drbd.d/*.res"; resource meta_lower { disk /dev/backingvg/metabacking; device /dev/drbd0; meta-disk internal; disk { fencing resource-only; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } on openfiler1 { address 10.50.153.1:7788; } on openfiler2 { address 10.50.153.2:7788; } } resource data1_lower { device /dev/drbd1; disk /dev/backingvg/256data1backing; meta-disk internal; disk { fencing resource-only; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } on openfiler1 { address 10.50.153.1:7789; } on openfiler2 { address 10.50.153.2:7789; } } ... resource meta { protocol A; device /dev/drbd10; meta-disk internal; syncer { rate 1000k; } stacked-on-top-of meta_lower { address 10.50.150.101:7788; } on openfiler3 { disk /dev/backingvg/metabacking; address 10.50.250.4:7788; } } resource data1 { protocol A; device /dev/drbd11; meta-disk internal; handlers { before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh"; after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh"; } net { sndbuf-size 512k; on-congestion pull-ahead; congestion-fill 500k; } syncer { rate 1000k; } stacked-on-top-of data1_lower { address 10.50.150.101:7789; } on openfiler3 { disk /dev/backingvg/256data1backing; address 10.50.250.4:7789; } } ... Here the log right around the time it failed: kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( WFConnection -> WFReportParams ) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting asender thread (from drbd10_receiver [4007]) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: data-integrity-alg: <not-used> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: drbd_sync_handshake: kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0 flags:0 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer 6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0 flags:0 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: uuid_compare()=1 by rule 70 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0% kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0% kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper command: /sbin/drbdadm before-resync-source minor-10 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync as SyncSource (will sync 0 KB [0 bits set]). kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done (total 1 sec; paused 0 sec; 0 K/sec) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7 kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE of 0 pages took 0 jiffies kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done (total 98 sec; paused 0 sec; 0 K/sec) kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs 42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE of 0 pages took 0 jiffies kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits) marked out-of-sync by on disk bit-map. kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done (total 117 sec; paused 0 sec; 0 K/sec) kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs 226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1 kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE of 0 pages took 0 jiffies kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits) marked out-of-sync by on disk bit-map. kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295 kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294 kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293 kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292 kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291 kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290 kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289 kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288 kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287 kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286 kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285 kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284 kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283 kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282 kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281 kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280 kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279 kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278 kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277 kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14: [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276 these are the only messages show until i reset the link between nodes by doing drbdadm down all on the dr node. -- View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33424203.html Sent from the DRBD - User mailing list archive at Nabble.com.