Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed 19 Aug 2015 09:37:25, Viktor Villafuerte wrote: > Hi Lars, > > thanks for your answer. Please, see more details below > > On Tue 18 Aug 2015 18:59:18, Lars Ellenberg wrote: > > On Fri, Aug 14, 2015 at 05:00:48PM +1000, Viktor Villafuerte wrote: > > > Hi all, > > > > > > I've got very similar problem to > > > > > > http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html > > > > > > that. > > > > > > > > > OS: Red Hat Enterprise Linux Server release 6.7 (Santiago) > > > kernel: kernel-2.6.32-573.1.1.el6.x86_64 > > > DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64 > > > > > > > > > Two node cluster where after upgrade to the lastest kernel + Corosync > > > Pacemaker DRBD I've encoutered problems. > > > > Did you upgrade both nodes yet? > > Yes both were upgraded. When first failing over Corosync + friends onto > the newly upgraded node (from the not upgraded node) only partial sync > happened and no problems were found. After upgrading the second node > full resync was requested and I never recovered from that. > > > > > > Secondary does a full resync, seems to successfully finish it but then > > > 'PingAck did not arrive in time' happens and the resync repeats. This > > > keeps happening in a loop thereafter. > > > > You could increase the "ping timeout" for now, > > and see if that gets you over this odd issue. > > I'll give this a go thanks. I assume that will require restart..? I've increased the ping-timout value to 5seconds just before the sync finished and only on the SyncTarget side. After the sync finished it didn't start again but ended up in this state of things - logs below (see more explanation after the logs) SyncSource Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: PingAck did not arrive in time. Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) Aug 19 12:29:27 dbpg02 crmd[6415]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Aug 19 12:29:27 dbpg02 pengine[6414]: notice: unpack_config: On loss of CCM Quorum: Ignore Aug 19 12:29:27 dbpg02 pengine[6414]: notice: process_pe_message: Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-77.bz2 Aug 19 12:29:27 dbpg02 crmd[6415]: notice: run_graph: Transition 57 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-77.bz2): Complete Aug 19 12:29:27 dbpg02 crmd[6415]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: ack_receiver terminated Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected ) Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver terminated Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Restarting receiver thread Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver (re)started Aug 19 12:29:28 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:29:32 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:35 dbpg02 kernel: device eth1 entered promiscuous mode Aug 19 12:29:36 dbpg02 crmd[6415]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Aug 19 12:29:36 dbpg02 pengine[6414]: notice: unpack_config: On loss of CCM Quorum: Ignore Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:29:36 dbpg02 pengine[6414]: notice: process_pe_message: Calculated Transition 58: /var/lib/pacemaker/pengine/pe-input-78.bz2 Aug 19 12:29:36 dbpg02 crmd[6415]: notice: run_graph: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete Aug 19 12:29:36 dbpg02 crmd[6415]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:29:37 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:29:50 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:29:54 dbpg02 pacemakerd[8284]: notice: crm_add_logfile: Additional logging available in /var/log/corosync/corosync.log Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:29:55 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:30:00 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:30:05 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: Connection closed Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:30:10 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:30:21 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11 Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: conn( WFConnection -> BrokenPipe ) Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: short read (expected size 8) Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: Connection closed .. .. repeat sock_recvmsg.. .. SyncTarget Aug 19 12:29:27 dbpg01 kernel: block drbd0: Resync done (total 24958 sec; paused 0 sec; 32264 K/sec) Aug 19 12:29:27 dbpg01 kernel: block drbd0: updated UUIDs BFF7A490315C8E1C:0000000000000000:08956592516D4BB2:08946592516D4BB3 Aug 19 12:29:27 dbpg01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 Aug 19 12:29:27 dbpg01 crm-unfence-peer.sh[5847]: invoked for drbd0 Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: sock was shut down by peer Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: short read (expected size 16) Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: meta connection shut down by peer. Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: ack_receiver terminated Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Connection closed Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver terminated Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Restarting receiver thread Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver (re)started Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 19 12:29:32 dbpg01 kernel: device eth1 entered promiscuous mode Aug 19 12:29:36 dbpg01 attrd[4345]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd0 (1000) Aug 19 12:29:36 dbpg01 attrd[4345]: notice: attrd_perform_update: Sent update 14: master-drbd0=1000 Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:29:56 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:01 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:11 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:35 dbpg01 crmd[6239]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:04 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:12 dbpg01 kernel: device eth1 left promiscuous mode Aug 19 12:31:30 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:31:49 dbpg01 kernel: drbd drbd0: initial packet S crossed Aug 19 12:32:06 dbpg01 kernel: drbd drbd0: initial packet S crossed .. .. repeat initial packet S crossed .. However the sync didn't start again from 0% which was a (good) change. I thought that initial packet is crossing but nothing happenes after so maybe the timeout is too long. So I reset this to 1s after which the secondary caught on, did a partial sync of whatever was oos (about 5 seconds) and now both the nodes are UpToDate :) Would you have any possible explanation for this? I did read man pages and about timeouts etc but I wonder why sometimes when the connection, after the sync, is broken it does full sync and this time it did only partial sync? The connection was broken just the same (according to logs). > > > > > > Please note that I could not observe any network problems, no problems > > > with the resync or any other services, except the very end of the > > > resync. > > > > > > I've tried reboot the secondary (target) node, invalidating the data > > > and forcing full resync by hand, wating about 4 rounds of the loop :) > > > But to no avail. > > > > > > Currently I'm in process of downgrading the kernel to see what happens > > > but as it's a production server this will take little time (outage) :/ > > Downgraded both the nodes to kernel-2.6.32-504.23.4.el6.x86_64 but no > change as the sync just did one turn over :/ > > > > > > > I'm also fairly new to DRBD so please point out anything obvious I may > > > be missing. > > > > > > thanks > > > > > > v > > > > -- > > : Lars Ellenberg > > : http://www.LINBIT.com | Your Way to High Availability > > : DRBD, Linux-HA and Pacemaker support and consulting > > > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > > __ > > please don't Cc me, but send to list -- I'm subscribed > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > -- > Regards > > Viktor Villafuerte > Optus Internet Engineering > t: +61 2 80825265 > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- Regards Viktor Villafuerte Optus Internet Engineering t: +61 2 80825265