Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Similar cases occurred twice on the sync replication mode. 1. Env Ver: 9.0.1-1, CentOS 7 Conf: C mode, 2 nodes Test step: 1) node1: primary 2) node2: status ok 3) node1: mount /dev/drbd1 /mnt 4) node1: cp 1Gfile /mnt/1Gfile 5) node1: disconnect by disable network icon on vmware, during copy 6) node2: "Connecting" status 7) node1: VM freezed 2. First test log => result was "vm hang" --------------------------------------------------- Feb 24 09:50:12 drbd9-01 kernel: drbd r0: State change 2572574344: primary_nodes=0, weak_nodes=0 Feb 24 09:50:12 drbd9-01 kernel: drbd r0: Committing cluster-wide state change 2572574344 (1ms) Feb 24 09:50:12 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1: peer does not support WRITE_SAME Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (250294391A486048) for nodes 2 3 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: drbd_sync_handshake: Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: self 91370AC369EB7EBC:0000000000000000:250294391A486048:64E351232E3278AC bits:0 flags:0 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: peer 91370AC369EB7EBC:0000000000000000:250294391A486048:0000000000000000 bits:0 flags:120 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: uuid_compare()=0 by rule 38 Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( DUnknown -> UpToDate ) repl( Off -> Established ) Feb 24 09:50:13 drbd9-01 kernel: drbd r0: Preparing cluster-wide state change 3751236876 (0->-1 3/1) Feb 24 09:50:13 drbd9-01 kernel: drbd r0: State change 3751236876: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC Feb 24 09:50:13 drbd9-01 kernel: drbd r0: Committing cluster-wide state change 3751236876 (0ms) 1) primary Feb 24 09:50:13 drbd9-01 kernel: drbd r0: role( Secondary -> Primary ) Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (250294391A486048) for nodes 2 3 Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 Feb 24 09:50:21 drbd9-01 kernel: loop: module loaded 2) mount Feb 24 09:50:29 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file system using the ext4 subsystem Feb 24 09:50:30 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with ordered data mode. Opts: (null) 3) copying 4) Disable Vmware network icon Feb 24 09:50:47 drbd9-01 chronyd[894]: Selected source 211.233.84.186 Feb 24 09:50:48 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down 5) drbd detects network down Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive in time. Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Feb 24 09:50:59 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv thread Feb 24 09:51:04 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID: 4FBB71EFAFB55C63 weak: FFFFFFFFFFFFFFFE Feb 24 09:51:06 drbd9-01 gnome-session: ** (gnome-system-monitor:2888): WARNING **: SELinux was found but is not enabled. 5) Vm freezed!!!! 6) Reset Feb 24 09:58:46 drbd9-01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="901" x-info="http://www.rsyslog.com"] start Feb 24 09:58:43 drbd9-01 journal: Runtime journal is using 6.1M (max 48.9M, leaving 73.4M of free 483.6M, current limit 48.9M). Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpuset Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpu Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpuacct Feb 24 09:58:43 drbd9-01 kernel: Linux version 3.10.0-229.7.2.el7.x86_64 ( builder at kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015 3. Second test log => result was "vm hang"--------------------------------------------------- Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command: /sbin/drbdadm before-resync-source exit code 0 (0x0) Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource ) Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Began resync as SyncSource (will sync 520192 KB [130048 bits set]). Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: updated UUIDs 4FBB71EFAFB55C62:0000000000000000:91370AC369EB7EBC:250294391A486048 Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (91370AC369EB7EBC) for nodes 2 3 Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Resync done (total 36 sec; paused 0 sec; 14448 K/sec) Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established ) Feb 24 10:15:08 drbd9-01 kernel: drbd r0: Preparing cluster-wide state change 1752829256 (0->-1 3/1) Feb 24 10:15:08 drbd9-01 kernel: drbd r0: State change 1752829256: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC Feb 24 10:15:08 drbd9-01 kernel: drbd r0: Committing cluster-wide state change 1752829256 (1ms) 1) primary Feb 24 10:15:08 drbd9-01 kernel: drbd r0: role( Secondary -> Primary ) Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (91370AC369EB7EBC) for nodes 2 3 Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 2) mountFeb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file system using the ext4 subsystem Feb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): recovery complete Feb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with ordered data mode. Opts: (null) 3) Disconnect during copy Feb 24 10:15:36 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive in time. Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Feb 24 10:15:46 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv thread Feb 24 10:15:47 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID: 12FA4E882B5DB5FD weak: FFFFFFFFFFFFFFFE 4) Vm freezed!!! 5) Reset Feb 24 10:18:42 drbd9-01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="897" x-info="http://www.rsyslog.com"] start Feb 24 10:18:39 drbd9-01 journal: Runtime journal is using 6.1M (max 48.9M, leaving 73.4M of free 483.6M, current limit 48.9M). Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpuset Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpu Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpuacct Feb 24 10:18:39 drbd9-01 kernel: Linux version 3.10.0-229.7.2.el7.x86_64 ( builder at kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015 4. Third test log => result was "Normal" !!!!---------------------------------------------------- Feb 24 10:25:51 drbd9-01 kernel: drbd r0: Preparing cluster-wide state change 4126890108 (0->-1 3/1) Feb 24 10:25:51 drbd9-01 kernel: drbd r0: State change 4126890108: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC Feb 24 10:25:51 drbd9-01 kernel: drbd r0: Committing cluster-wide state change 4126890108 (0ms) 1) Primary Feb 24 10:25:51 drbd9-01 kernel: drbd r0: role( Secondary -> Primary ) Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (4FBB71EFAFB55C62) for nodes 2 3 Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 2) mount Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file system using the ext4 subsystem Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): recovery complete Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with ordered data mode. Opts: (null) 3) Disconnect during copy Feb 24 10:26:33 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive in time. Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Feb 24 10:26:43 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv thread Feb 24 10:26:57 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID: C4E750298906134D weak: FFFFFFFFFFFFFFFE 4) It's normal case. No Vm freeze!! Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: Connection closed Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: conn( NetworkFailure -> Unconnected ) Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: Restarting receiver thread Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: conn( Unconnected -> Connecting ) Feb 24 10:30:02 drbd9-01 systemd: Starting Session 3 of user root. Feb 24 10:30:02 drbd9-01 systemd: Started Session 3 of user root. 5) Connct by enabling VM network icon Feb 24 10:31:12 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Up 10000 Mbps Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Handshake successful: Agreed network protocol version 111 Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Starting ack_recv thread (from drbd_r_r0 [2835]) Feb 24 10:31:13 drbd9-01 kernel: drbd r0: Preparing cluster-wide state change 478233498 (0->1 499/145) Feb 24 10:31:13 drbd9-01 kernel: drbd r0: State change 478233498: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC Feb 24 10:31:13 drbd9-01 kernel: drbd r0: Committing cluster-wide state change 478233498 (1ms) Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1: peer does not support WRITE_SAME Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from node 1 found equal UUID (4FBB71EFAFB55C62) for nodes 2 3 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=2 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have 64E351232E3278AC for node_id=3 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: drbd_sync_handshake: Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: self C4E750298906134D:12FA4E882B5DB5FC:4FBB71EFAFB55C62:91370AC369EB7EBC bits:204521 flags:0 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: peer 12FA4E882B5DB5FC:0000000000000000:4FBB71EFAFB55C62:0000000000000000 bits:0 flags:100 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: uuid_compare()=2 by rule 70 Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( DUnknown -> Consistent ) repl( Off -> WFBitMapS ) Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 124(1), total 124; compression: 100.0% Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Consistent -> Outdated ) Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 124(1), total 124; compression: 100.0% Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command: /sbin/drbdadm before-resync-source Feb 24 10:31:13 drbd9-01 drbdadm[3274]: Don't know which config file belongs to resource r0, trying default ones... Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command: /sbin/drbdadm before-resync-source exit code 0 (0x0) Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource ) Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Began resync as SyncSource (will sync 818084 KB [204521 bits set]). 2016-02-12 16:02 GMT+09:00 김재헌 <jhkim at mantech.co.kr>: > Hi, > > This problem is reproduced on your latest 9.0.1-1 version also. > > 1. Test version > - CentOS-7 Linux 3.10.0-229.7.2.el7.x86_64 > - [root at drbd9-02 etc]# cat /proc/drbd > version: 9.0.1-1 (api:2/proto:86-111) > GIT-hash: f57acfc22d29a95697e683fb6bbacd9a1ad4348e build by > root at drbd9-02, 2016-02-09 09:46:20 > Transports (api:14): tcp (1.0.0) > > > 2. Test scenario > 1) setup primary-seconary > 2) set Async replication mode(sndbuf=1M, congestion-fill=256KB) > 3) copy big file(4GB) > 4) disconnect by disable vmware's network icon while copying > 5) Result: > - VM hang sometimes!!!! > > > 3. Log (/var/log/messages) analysys: > 1) promote to primary > [snip] > 4. Question > - Please see above 10), 11) points, I think It's deadlock. > - the worker and receiver threads may be deadlocked by race condition for > data socket. > > Thanks > > > Please carefully compare the first and second hang logs with the third normal log. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160224/0d927b29/attachment.htm>