[DRBD-user] [CASE-14] primary node hang by VM-net-disconnect during big file copy

김재헌 jhkim at mantech.co.kr
Wed Feb 24 09:43:00 CET 2016

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

Similar cases occurred twice on the sync replication mode.


1. Env

Ver: 9.0.1-1, CentOS 7

Conf: C mode, 2 nodes

Test step:

1) node1: primary
2) node2: status ok
3) node1: mount /dev/drbd1 /mnt
4) node1: cp 1Gfile /mnt/1Gfile
5) node1: disconnect by disable network icon on vmware, during copy
6) node2: "Connecting" status
7) node1: VM freezed



2. First test log => result was "vm hang"
---------------------------------------------------

Feb 24 09:50:12 drbd9-01 kernel: drbd r0: State change 2572574344:
primary_nodes=0, weak_nodes=0
Feb 24 09:50:12 drbd9-01 kernel: drbd r0: Committing cluster-wide state
change 2572574344 (1ms)
Feb 24 09:50:12 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connecting ->
Connected ) peer( Unknown -> Secondary )
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1: peer does not support
WRITE_SAME
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (250294391A486048) for nodes 2 3
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02:
drbd_sync_handshake:
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: self
91370AC369EB7EBC:0000000000000000:250294391A486048:64E351232E3278AC bits:0
flags:0
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: peer
91370AC369EB7EBC:0000000000000000:250294391A486048:0000000000000000 bits:0
flags:120
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: uuid_compare()=0
by rule 38
Feb 24 09:50:12 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( DUnknown
-> UpToDate ) repl( Off -> Established )
Feb 24 09:50:13 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 3751236876 (0->-1 3/1)
Feb 24 09:50:13 drbd9-01 kernel: drbd r0: State change 3751236876:
primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Feb 24 09:50:13 drbd9-01 kernel: drbd r0: Committing cluster-wide state
change 3751236876 (0ms)

1) primary

Feb 24 09:50:13 drbd9-01 kernel: drbd r0: role( Secondary -> Primary )
Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (250294391A486048) for nodes 2 3
Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 09:50:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3
Feb 24 09:50:21 drbd9-01 kernel: loop: module loaded

2) mount

Feb 24 09:50:29 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file system
using the ext4 subsystem
Feb 24 09:50:30 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with
ordered data mode. Opts: (null)

3) copying

4) Disable Vmware network icon

Feb 24 09:50:47 drbd9-01 chronyd[894]: Selected source 211.233.84.186
Feb 24 09:50:48 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down

5) drbd detects network down

Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive
in time.
Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected ->
NetworkFailure ) peer( Secondary -> Unknown )
Feb 24 09:50:59 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate
-> DUnknown ) repl( Established -> Off )
Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated
Feb 24 09:50:59 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv
thread

Feb 24 09:51:04 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID:
4FBB71EFAFB55C63 weak: FFFFFFFFFFFFFFFE
Feb 24 09:51:06 drbd9-01 gnome-session: ** (gnome-system-monitor:2888):
WARNING **: SELinux was found but is not enabled.

5) Vm freezed!!!!

6) Reset

Feb 24 09:58:46 drbd9-01 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7" x-pid="901" x-info="http://www.rsyslog.com"] start
Feb 24 09:58:43 drbd9-01 journal: Runtime journal is using 6.1M (max 48.9M,
leaving 73.4M of free 483.6M, current limit 48.9M).
Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpuset
Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpu
Feb 24 09:58:43 drbd9-01 kernel: Initializing cgroup subsys cpuacct
Feb 24 09:58:43 drbd9-01 kernel: Linux version 3.10.0-229.7.2.el7.x86_64 (
builder at kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat
4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015





3.  Second test log => result was "vm
hang"---------------------------------------------------

Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command:
/sbin/drbdadm before-resync-source exit code 0 (0x0)
Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated
-> Inconsistent ) repl( WFBitMapS -> SyncSource )
Feb 24 10:14:16 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Began resync as
SyncSource (will sync 520192 KB [130048 bits set]).
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: updated UUIDs
4FBB71EFAFB55C62:0000000000000000:91370AC369EB7EBC:250294391A486048
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (91370AC369EB7EBC) for nodes 2 3
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Resync done
(total 36 sec; paused 0 sec; 14448 K/sec)
Feb 24 10:14:53 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk(
Inconsistent -> UpToDate ) repl( SyncSource -> Established )
Feb 24 10:15:08 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 1752829256 (0->-1 3/1)
Feb 24 10:15:08 drbd9-01 kernel: drbd r0: State change 1752829256:
primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Feb 24 10:15:08 drbd9-01 kernel: drbd r0: Committing cluster-wide state
change 1752829256 (1ms)

1) primary

Feb 24 10:15:08 drbd9-01 kernel: drbd r0: role( Secondary -> Primary )
Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (91370AC369EB7EBC) for nodes 2 3
Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 10:15:08 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3

2) mountFeb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3
file system using the ext4 subsystem
Feb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): recovery complete
Feb 24 10:15:15 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with
ordered data mode. Opts: (null)

3) Disconnect during copy

Feb 24 10:15:36 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down
Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive
in time.
Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected ->
NetworkFailure ) peer( Secondary -> Unknown )
Feb 24 10:15:46 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate
-> DUnknown ) repl( Established -> Off )
Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated
Feb 24 10:15:46 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv
thread
Feb 24 10:15:47 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID:
12FA4E882B5DB5FD weak: FFFFFFFFFFFFFFFE

4) Vm freezed!!!

5) Reset

Feb 24 10:18:42 drbd9-01 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7" x-pid="897" x-info="http://www.rsyslog.com"] start
Feb 24 10:18:39 drbd9-01 journal: Runtime journal is using 6.1M (max 48.9M,
leaving 73.4M of free 483.6M, current limit 48.9M).
Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpuset
Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpu
Feb 24 10:18:39 drbd9-01 kernel: Initializing cgroup subsys cpuacct
Feb 24 10:18:39 drbd9-01 kernel: Linux version 3.10.0-229.7.2.el7.x86_64 (
builder at kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat
4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015





4.  Third test log  => result was "Normal"
!!!!----------------------------------------------------

Feb 24 10:25:51 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 4126890108 (0->-1 3/1)
Feb 24 10:25:51 drbd9-01 kernel: drbd r0: State change 4126890108:
primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Feb 24 10:25:51 drbd9-01 kernel: drbd r0: Committing cluster-wide state
change 4126890108 (0ms)

1) Primary

Feb 24 10:25:51 drbd9-01 kernel: drbd r0: role( Secondary -> Primary )
Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (4FBB71EFAFB55C62) for nodes 2 3
Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 10:25:51 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3

2) mount

Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file system
using the ext4 subsystem
Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): recovery complete
Feb 24 10:26:09 drbd9-01 kernel: EXT4-fs (drbd1): mounted filesystem with
ordered data mode. Opts: (null)

3) Disconnect during copy

Feb 24 10:26:33 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Down
Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: PingAck did not arrive
in time.
Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connected ->
NetworkFailure ) peer( Secondary -> Unknown )
Feb 24 10:26:43 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( UpToDate
-> DUnknown ) repl( Established -> Off )
Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: ack_receiver terminated
Feb 24 10:26:43 drbd9-01 kernel: drbd r0 drbd9-02: Terminating ack_recv
thread
Feb 24 10:26:57 drbd9-01 kernel: drbd r0/0 drbd1: new current UUID:
C4E750298906134D weak: FFFFFFFFFFFFFFFE

4) It's normal case. No Vm freeze!!

Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: Connection closed
Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: conn( NetworkFailure ->
Unconnected )
Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: Restarting receiver
thread
Feb 24 10:26:57 drbd9-01 kernel: drbd r0 drbd9-02: conn( Unconnected ->
Connecting )
Feb 24 10:30:02 drbd9-01 systemd: Starting Session 3 of user root.
Feb 24 10:30:02 drbd9-01 systemd: Started Session 3 of user root.

5) Connct by enabling VM network icon

Feb 24 10:31:12 drbd9-01 kernel: vmxnet3 0000:04:00.0 eth1: NIC Link is Up
10000 Mbps
Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Handshake successful:
Agreed network protocol version 111
Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Feature flags enabled on
protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: Starting ack_recv thread
(from drbd_r_r0 [2835])
Feb 24 10:31:13 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 478233498 (0->1 499/145)
Feb 24 10:31:13 drbd9-01 kernel: drbd r0: State change 478233498:
primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Feb 24 10:31:13 drbd9-01 kernel: drbd r0: Committing cluster-wide state
change 478233498 (1ms)
Feb 24 10:31:13 drbd9-01 kernel: drbd r0 drbd9-02: conn( Connecting ->
Connected ) peer( Unknown -> Secondary )
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1: peer does not support
WRITE_SAME
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
node 1 found equal UUID (4FBB71EFAFB55C62) for nodes 2 3
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=2
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
64E351232E3278AC for node_id=3
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02:
drbd_sync_handshake:
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: self
C4E750298906134D:12FA4E882B5DB5FC:4FBB71EFAFB55C62:91370AC369EB7EBC
bits:204521 flags:0
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: peer
12FA4E882B5DB5FC:0000000000000000:4FBB71EFAFB55C62:0000000000000000 bits:0
flags:100
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: uuid_compare()=2
by rule 70
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( DUnknown
-> Consistent ) repl( Off -> WFBitMapS )
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: send bitmap
stats [Bytes(packets)]: plain 0(0), RLE 124(1), total 124; compression:
100.0%
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Consistent
-> Outdated )
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: receive bitmap
stats [Bytes(packets)]: plain 0(0), RLE 124(1), total 124; compression:
100.0%
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command:
/sbin/drbdadm before-resync-source
Feb 24 10:31:13 drbd9-01 drbdadm[3274]: Don't know which config file
belongs to resource r0, trying default ones...
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: helper command:
/sbin/drbdadm before-resync-source exit code 0 (0x0)
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated
-> Inconsistent ) repl( WFBitMapS -> SyncSource )
Feb 24 10:31:13 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: Began resync as
SyncSource (will sync 818084 KB [204521 bits set]).


2016-02-12 16:02 GMT+09:00 김재헌 <jhkim at mantech.co.kr>:

> Hi,
>
> This problem is reproduced on your latest 9.0.1-1 version also.
>
> 1. Test version
>  - CentOS-7 Linux 3.10.0-229.7.2.el7.x86_64
> - [root at drbd9-02 etc]# cat /proc/drbd
>       version: 9.0.1-1 (api:2/proto:86-111)
>       GIT-hash: f57acfc22d29a95697e683fb6bbacd9a1ad4348e build by
> root at drbd9-02, 2016-02-09 09:46:20
>       Transports (api:14): tcp (1.0.0)
>
>
> 2. Test scenario
>   1) setup primary-seconary
>   2) set Async replication mode(sndbuf=1M, congestion-fill=256KB)
>   3) copy big file(4GB)
>   4) disconnect by disable vmware's network icon while copying
>   5) Result:
>         - VM hang sometimes!!!!
>
>
> 3. Log (/var/log/messages) analysys:
> 1) promote to primary
>


[snip]



> 4. Question
>  - Please see above 10), 11) points, I think It's deadlock.
>  - the worker and receiver threads may be deadlocked by race condition for
> data socket.
>
> Thanks
>
>
>



Please carefully compare the first and second hang logs with the third
normal log.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160224/0d927b29/attachment.htm>


More information about the drbd-user mailing list