[DRBD-user] DRBD keeps doing full resync

Viktor Villafuerte viktor.villafuerte at optusnet.com.au
Fri Aug 14 09:00:48 CEST 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

I've got very similar problem to

http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html

that.


OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64


Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.

Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.

Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.

I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.

Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/

I'm also fairly new to DRBD so please point out anything obvious I may
be missing.

thanks

v


Logs:

primary node (SyncSource)

Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) 
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected ) 
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) 
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465])
Aug 12 15:22:48 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg02 kernel: block drbd0: self 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: was SyncSource, missed the resync finished event, corrected myself:
Aug 12 15:22:49 dbpg02 kernel: block drbd0: self 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:162 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: uuid_compare()=1 by rule 34
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent ) 
Aug 12 15:22:49 dbpg02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Aug 12 15:22:50 dbpg02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) 
Aug 12 15:22:50 dbpg02 kernel: block drbd0: Began resync as SyncSource (will sync 672 KB [168 bits set]).
Aug 12 15:22:51 dbpg02 kernel: block drbd0: updated sync UUID 453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197:FAE6FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 672 K/sec)
Aug 12 15:22:52 dbpg02 kernel: block drbd0: updated UUIDs 453310593995C7C3:0000000000000000:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
Aug 12 15:22:52 dbpg02 crmd[5746]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:53 dbpg02 kernel: block drbd0: new current UUID EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) 
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: invoked for drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: INFO peer is reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-drbd0-ms-drbd0'
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-20.bz2
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: run_graph: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-20.bz2): Complete
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0 exit code 4 (0x400)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: fence-peer helper returned 4 (peer was fenced)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: pdsk( DUnknown -> Outdated ) 
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) 
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465])
Aug 12 15:22:55 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg02 kernel: block drbd0: self EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: peer 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: uuid_compare()=2 by rule 80
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Becoming sync source due to disk states.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: bitmap WRITE of 6144 pages took 218 jiffies
Aug 12 15:22:57 dbpg02 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map.



secondary node (SyncTarget)


Aug 12 15:22:45 dbpg01 kernel: block drbd0: Resync done (total 8 sec; paused 0 sec; 13212 K/sec)
Aug 12 15:22:45 dbpg01 kernel: block drbd0: updated UUIDs 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:45 dbpg01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Aug 12 15:22:45 dbpg01 crm-unfence-peer.sh[7419]: invoked for drbd0
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) 
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) 
Aug 12 15:22:48 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326])
Aug 12 15:22:48 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg01 kernel: block drbd0: self 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: was SyncTarget, peer missed the resync finished event, corrected peer:
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:152 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: uuid_compare()=-1 by rule 35
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) 
Aug 12 15:22:49 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug 12 15:22:51 dbpg01 kernel: block drbd0: updated sync uuid 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:51 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) 
Aug 12 15:22:51 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 672 KB [168 bits set]).
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected ) 
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:54 dbpg01 crmd[6750]:   notice: handle_request: Current ping state: S_NOT_DC
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) 
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326])
Aug 12 15:22:55 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg01 kernel: block drbd0: self 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: peer EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: uuid_compare()=-2 by rule 60
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Becoming sync target due to disk states.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: bitmap WRITE of 6144 pages took 298 jiffies
Aug 12 15:22:56 dbpg01 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map.
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd0 (<null>)
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_perform_update: Sent delete 19: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_perform_update: Sent delete 21: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status
Aug 12 15:22:57 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug 12 15:22:57 dbpg01 crmd[7488]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 12 15:22:57 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug 12 15:22:59 dbpg01 kernel: block drbd0: updated sync uuid 453410593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:59 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug 12 15:22:59 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 805281756 KB [201320439 bits set]).





-- 
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265



More information about the drbd-user mailing list