[DRBD-user] DRBD keeps doing full resync

Wed Aug 19 06:35:26 CEST 2015

On Wed 19 Aug 2015 09:37:25, Viktor Villafuerte wrote:
> Hi Lars,
> 
> thanks for your answer. Please, see more details below
> 
> On Tue 18 Aug 2015 18:59:18, Lars Ellenberg wrote:
> > On Fri, Aug 14, 2015 at 05:00:48PM +1000, Viktor Villafuerte wrote:
> > > Hi all,
> > > 
> > > I've got very similar problem to
> > > 
> > > http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html
> > > 
> > > that.
> > > 
> > > 
> > > OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
> > > kernel: kernel-2.6.32-573.1.1.el6.x86_64
> > > DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64
> > > 
> > > 
> > > Two node cluster where after upgrade to the lastest kernel + Corosync
> > > Pacemaker DRBD I've encoutered problems.
> > 
> > Did you upgrade both nodes yet?
> 
> Yes both were upgraded. When first failing over Corosync + friends onto
> the newly upgraded node (from the not upgraded node) only partial sync
> happened and no problems were found. After upgrading the second node
> full resync was requested and I never recovered from that.
> 
> > 
> > > Secondary does a full resync, seems to successfully finish it but then
> > > 'PingAck did not arrive in time' happens and the resync repeats. This
> > > keeps happening in a loop thereafter.
> > 
> > You could increase the "ping timeout" for now,
> > and see if that gets you over this odd issue.
> 
> I'll give this a go thanks. I assume that will require restart..?

I've increased the ping-timout value to 5seconds just before the sync
finished and only on the SyncTarget side.

After the sync finished it didn't start again but ended up in this state
of things - logs below

(see more explanation after the logs)

SyncSource

Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: PingAck did not arrive in
time.
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown )
conn( SyncSource -> NetworkFailure ) 
Aug 19 12:29:27 dbpg02 crmd[6415]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 19 12:29:27 dbpg02 pengine[6414]:   notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 19 12:29:27 dbpg02 pengine[6414]:   notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Aug 19 12:29:27 dbpg02 pengine[6414]:   notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 19 12:29:27 dbpg02 pengine[6414]:   notice: unpack_config: On loss
of CCM Quorum: Ignore
Aug 19 12:29:27 dbpg02 pengine[6414]:   notice: process_pe_message:
Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-77.bz2
Aug 19 12:29:27 dbpg02 crmd[6415]:   notice: run_graph: Transition 57
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-77.bz2): Complete
Aug 19 12:29:27 dbpg02 crmd[6415]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: conn( NetworkFailure ->
Unconnected ) 
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 19 12:29:28 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:29:32 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:35 dbpg02 kernel: device eth1 entered promiscuous mode
Aug 19 12:29:36 dbpg02 crmd[6415]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 19 12:29:36 dbpg02 pengine[6414]:   notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 19 12:29:36 dbpg02 pengine[6414]:   notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Aug 19 12:29:36 dbpg02 pengine[6414]:   notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 19 12:29:36 dbpg02 pengine[6414]:   notice: unpack_config: On loss
of CCM Quorum: Ignore
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:29:36 dbpg02 pengine[6414]:   notice: process_pe_message:
Calculated Transition 58: /var/lib/pacemaker/pengine/pe-input-78.bz2
Aug 19 12:29:36 dbpg02 crmd[6415]:   notice: run_graph: Transition 58
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete
Aug 19 12:29:36 dbpg02 crmd[6415]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:29:37 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:29:50 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:54 dbpg02 pacemakerd[8284]:   notice: crm_add_logfile:
Additional logging available in /var/log/corosync/corosync.log
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:29:55 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:30:00 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:30:05 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:30:10 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:30:21 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe ) 
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: Connection closed
..
.. repeat sock_recvmsg..
..

SyncTarget

Aug 19 12:29:27 dbpg01 kernel: block drbd0: Resync done (total 24958
sec; paused 0 sec; 32264 K/sec)
Aug 19 12:29:27 dbpg01 kernel: block drbd0: updated UUIDs
BFF7A490315C8E1C:0000000000000000:08956592516D4BB2:08946592516D4BB3
Aug 19 12:29:27 dbpg01 kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate ) 
Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0
Aug 19 12:29:27 dbpg01 crm-unfence-peer.sh[5847]: invoked for drbd0
Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: sock was shut down by peer
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: short read (expected size 16)
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: meta connection shut down by
peer.
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Connection closed
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected ) 
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( Unconnected ->
WFConnection ) 
Aug 19 12:29:32 dbpg01 kernel: device eth1 entered promiscuous mode
Aug 19 12:29:36 dbpg01 attrd[4345]:   notice: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd0 (1000)
Aug 19 12:29:36 dbpg01 attrd[4345]:   notice: attrd_perform_update: Sent
update 14: master-drbd0=1000
Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:29:56 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:01 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:11 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:35 dbpg01 crmd[6239]:   notice: crm_add_logfile: Additional
logging available in /var/log/pacemaker.log
Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:04 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:12 dbpg01 kernel: device eth1 left promiscuous mode
Aug 19 12:31:30 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:49 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:32:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
..
.. repeat initial packet S crossed
..

However the sync didn't start again from 0% which was a (good) change. I
thought that initial packet is crossing but nothing happenes after so
maybe the timeout is too long. So I reset this to 1s after which the
secondary caught on, did a partial sync of whatever was oos (about 5
seconds) and now both the nodes are UpToDate :)

Would you have any possible explanation for this? I did read man pages
and about timeouts etc but I wonder why sometimes when the connection,
after the sync, is broken it does full sync and this time it did only
partial sync? The connection was broken just the same (according to
logs).

> 
> > 
> > > Please note that I could not observe any network problems, no problems
> > > with the resync or any other services, except the very end of the
> > > resync.
> > > 
> > > I've tried reboot the secondary (target) node, invalidating the data
> > > and forcing full resync by hand, wating about 4 rounds of the loop :)
> > > But to no avail.
> > > 
> > > Currently I'm in process of downgrading the kernel to see what happens
> > > but as it's a production server this will take little time (outage) :/
> 
> Downgraded both the nodes to kernel-2.6.32-504.23.4.el6.x86_64 but no
> change as the sync just did one turn over :/
> 
> > > 
> > > I'm also fairly new to DRBD so please point out anything obvious I may
> > > be missing.
> > > 
> > > thanks
> > > 
> > > v
> > 
> > -- 
> > : Lars Ellenberg
> > : http://www.LINBIT.com | Your Way to High Availability
> > : DRBD, Linux-HA  and  Pacemaker support and consulting
> > 
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > __
> > please don't Cc me, but send to list   --   I'm subscribed
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> -- 
> Regards
> 
> Viktor Villafuerte
> Optus Internet Engineering
> t: +61 2 80825265
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265