Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I've got very similar problem to http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html that. OS: Red Hat Enterprise Linux Server release 6.7 (Santiago) kernel: kernel-2.6.32-573.1.1.el6.x86_64 DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64 Two node cluster where after upgrade to the lastest kernel + Corosync Pacemaker DRBD I've encoutered problems. Secondary does a full resync, seems to successfully finish it but then 'PingAck did not arrive in time' happens and the resync repeats. This keeps happening in a loop thereafter. Please note that I could not observe any network problems, no problems with the resync or any other services, except the very end of the resync. I've tried reboot the secondary (target) node, invalidating the data and forcing full resync by hand, wating about 4 rounds of the loop :) But to no avail. Currently I'm in process of downgrading the kernel to see what happens but as it's a production server this will take little time (outage) :/ I'm also fairly new to DRBD so please point out anything obvious I may be missing. thanks v Logs: primary node (SyncSource) Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: PingAck did not arrive in time. Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: ack_receiver terminated Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Connection closed Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected ) Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver terminated Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Restarting receiver thread Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver (re)started Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101 Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465]) Aug 12 15:22:48 dbpg02 kernel: block drbd0: drbd_sync_handshake: Aug 12 15:22:48 dbpg02 kernel: block drbd0: self 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0 Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0 Aug 12 15:22:49 dbpg02 kernel: block drbd0: was SyncSource, missed the resync finished event, corrected myself: Aug 12 15:22:49 dbpg02 kernel: block drbd0: self 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:162 flags:0 Aug 12 15:22:49 dbpg02 kernel: block drbd0: uuid_compare()=1 by rule 34 Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent ) Aug 12 15:22:49 dbpg02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% Aug 12 15:22:50 dbpg02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Aug 12 15:22:50 dbpg02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Aug 12 15:22:50 dbpg02 kernel: block drbd0: Began resync as SyncSource (will sync 672 KB [168 bits set]). Aug 12 15:22:51 dbpg02 kernel: block drbd0: updated sync UUID 453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197:FAE6FAB0FA09B197 Aug 12 15:22:52 dbpg02 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 672 K/sec) Aug 12 15:22:52 dbpg02 kernel: block drbd0: updated UUIDs 453310593995C7C3:0000000000000000:803193A96DCAEE9C:FAE7FAB0FA09B197 Aug 12 15:22:52 dbpg02 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Aug 12 15:22:52 dbpg02 crmd[5746]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: sock was shut down by peer Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: short read (expected size 16) Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: meta connection shut down by peer. Aug 12 15:22:53 dbpg02 kernel: block drbd0: new current UUID EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: ack_receiver terminated Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Connection closed Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0 Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver terminated Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Restarting receiver thread Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver (re)started Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: invoked for drbd0 Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101 Aug 12 15:22:54 dbpg02 crmd[2719]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: INFO peer is reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-drbd0-ms-drbd0' Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Aug 12 15:22:54 dbpg02 pengine[2718]: notice: unpack_config: On loss of CCM Quorum: Ignore Aug 12 15:22:54 dbpg02 pengine[2718]: notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-20.bz2 Aug 12 15:22:54 dbpg02 crmd[2719]: notice: run_graph: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-20.bz2): Complete Aug 12 15:22:54 dbpg02 crmd[2719]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0 exit code 4 (0x400) Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: fence-peer helper returned 4 (peer was fenced) Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: pdsk( DUnknown -> Outdated ) Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465]) Aug 12 15:22:55 dbpg02 kernel: block drbd0: drbd_sync_handshake: Aug 12 15:22:55 dbpg02 kernel: block drbd0: self EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0 Aug 12 15:22:56 dbpg02 kernel: block drbd0: peer 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0 Aug 12 15:22:56 dbpg02 kernel: block drbd0: uuid_compare()=2 by rule 80 Aug 12 15:22:56 dbpg02 kernel: block drbd0: Becoming sync source due to disk states. Aug 12 15:22:56 dbpg02 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. Aug 12 15:22:56 dbpg02 kernel: block drbd0: bitmap WRITE of 6144 pages took 218 jiffies Aug 12 15:22:57 dbpg02 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map. secondary node (SyncTarget) Aug 12 15:22:45 dbpg01 kernel: block drbd0: Resync done (total 8 sec; paused 0 sec; 13212 K/sec) Aug 12 15:22:45 dbpg01 kernel: block drbd0: updated UUIDs 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 Aug 12 15:22:45 dbpg01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 Aug 12 15:22:45 dbpg01 crm-unfence-peer.sh[7419]: invoked for drbd0 Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: sock was shut down by peer Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: short read (expected size 16) Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: meta connection shut down by peer. Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: ack_receiver terminated Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Connection closed Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver terminated Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Restarting receiver thread Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver (re)started Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101 Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) Aug 12 15:22:48 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326]) Aug 12 15:22:48 dbpg01 kernel: block drbd0: drbd_sync_handshake: Aug 12 15:22:48 dbpg01 kernel: block drbd0: self 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0 Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0 Aug 12 15:22:49 dbpg01 kernel: block drbd0: was SyncTarget, peer missed the resync finished event, corrected peer: Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:152 flags:0 Aug 12 15:22:49 dbpg01 kernel: block drbd0: uuid_compare()=-1 by rule 35 Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) Aug 12 15:22:49 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% Aug 12 15:22:50 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% Aug 12 15:22:50 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) Aug 12 15:22:51 dbpg01 kernel: block drbd0: updated sync uuid 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Aug 12 15:22:51 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Aug 12 15:22:51 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 672 KB [168 bits set]). Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: PingAck did not arrive in time. Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: ack_receiver terminated Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0 Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Connection closed Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected ) Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver terminated Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Restarting receiver thread Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver (re)started Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101 Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level Aug 12 15:22:54 dbpg01 crmd[6750]: notice: handle_request: Current ping state: S_NOT_DC Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams ) Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326]) Aug 12 15:22:55 dbpg01 kernel: block drbd0: drbd_sync_handshake: Aug 12 15:22:55 dbpg01 kernel: block drbd0: self 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0 Aug 12 15:22:56 dbpg01 kernel: block drbd0: peer EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0 Aug 12 15:22:56 dbpg01 kernel: block drbd0: uuid_compare()=-2 by rule 60 Aug 12 15:22:56 dbpg01 kernel: block drbd0: Becoming sync target due to disk states. Aug 12 15:22:56 dbpg01 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. Aug 12 15:22:56 dbpg01 kernel: block drbd0: bitmap WRITE of 6144 pages took 298 jiffies Aug 12 15:22:56 dbpg01 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map. Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd0 (<null>) Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_perform_update: Sent delete 19: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_perform_update: Sent delete 21: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status Aug 12 15:22:57 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Aug 12 15:22:57 dbpg01 crmd[7488]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Aug 12 15:22:57 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Aug 12 15:22:58 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Aug 12 15:22:58 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) Aug 12 15:22:59 dbpg01 kernel: block drbd0: updated sync uuid 453410593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Aug 12 15:22:59 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) Aug 12 15:22:59 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 805281756 KB [201320439 bits set]). -- Regards Viktor Villafuerte Optus Internet Engineering t: +61 2 80825265