[Drbd-dev] [CASE-7] Sometimes 600 sec timeout occured, please check 2PC(2 Phase Commit) protocol

김재헌 jhkim at mantech.co.kr
Fri Feb 26 13:52:36 CET 2016


Hi,

Let me re-post my old mail to this newgroup because similar errors are
produced.


1. Old mail

subject: Can dradm-down terminate with 600 seconds timeout?

Dear Lars,
Problem:
   - On Linux, sometimes drbdadm-down-command be hanged as follows;
      "Command 'drbdsetup down r0' did not terminate within 600 seconds"

Env:
   - Linux version:  CentOS-7 Linux 3.10.0-229.7.2.el7.x86_64
   - DRBD Engine version: 9.0.0 (api:2/proto:86-110) GIT-hash:
e0e3bc833053a0ed6b4b593b32e40223a9373fb7  Transports (api:8): tcp (1.0.0)

Test Cluster:
   - 4 node Linux VMs, mesh network on Vmware

Test scenario:
   1) drbdadm-up 4 nodes all
   2) Promote one node to primary
   3) Pause all VMs
       ***Notes: Please use "Pause menu" when stops the node on Vmware.
"Suspend " menu some differs from "Pause".
   4) Resume all VMs
   5) Somtimes, more than one node output 2PC(2 Phase commit) messages
periodically.(See attached information)
   6) At this point, If we run drbdadm-down-command on these nodes, they
will be hanged occasionally with error message as follows;
       "Command 'drbdsetup down r0' did not terminate within 600 seconds"

Attached:

node1:

Nov 12 12:58:14 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 3110839547 (0->2 499/145)
Nov 12 12:58:14 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 3110839547 (30000ms) rv = -23
Nov 12 12:58:14 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 1556788097 because of state change 1200995267
Nov 12 12:58:14 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 1556788097.
Nov 12 12:58:44 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 1200995267 (0->3 499/145)
Nov 12 12:58:44 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 1200995267 (30001ms) rv = -23
Nov 12 12:58:44 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 407232822 because of state change 2876480921
Nov 12 12:58:44 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 407232822.
Nov 12 12:59:14 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 2876480921 (0->2 499/145)
Nov 12 12:59:14 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 2876480921 (30001ms) rv = -23
Nov 12 12:59:44 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 3839850480 (0->3 499/145)
Nov 12 12:59:44 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 3839850480 (30000ms) rv = -23
Nov 12 13:00:14 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 2477663420 (0->2 499/145)
Nov 12 13:00:14 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 2477663420 (30000ms) rv = -23
Nov 12 13:00:14 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 849320515 because of state change 2698322727
Nov 12 13:00:14 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 849320515.
Nov 12 13:00:44 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 2698322727 (0->3 499/145)
Nov 12 13:00:44 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 2698322727 (30000ms) rv = -23
Nov 12 13:00:44 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 743585927 because of state change 2750548712
Nov 12 13:00:44 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 743585927.
Nov 12 13:01:14 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 2750548712 (0->2 499/145)
Nov 12 13:01:14 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 2750548712 (30000ms) rv = -23
Nov 12 13:01:14 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 1367793350 because of state change 2283776505
Nov 12 13:01:14 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 1367793350.
Nov 12 13:01:44 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 2283776505 (0->3 499/145)
Nov 12 13:01:44 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 2283776505 (30000ms) rv = -23
Nov 12 13:01:44 drbd9-01 kernel: drbd r0 drbd9-02: Rejecting concurrent
remote state change 3401865695 because of state change 2283776505
Nov 12 13:01:44 drbd9-01 kernel: drbd r0 drbd9-02: Ignoring P_TWOPC_ABORT
packet 3401865695.
Nov 12 13:02:14 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 52162850 (0->2 499/145)
Nov 12 13:02:14 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 52162850 (30001ms) rv = -23
Nov 12 13:02:44 drbd9-01 kernel: drbd r0: Preparing cluster-wide state
change 452962539 (0->3 499/145)
Nov 12 13:02:44 drbd9-01 kernel: drbd r0: Aborting cluster-wide state
change 452962539 (30000ms) rv = -23


node3:

Nov 12 13:00:14 drbd9-03 kernel: drbd r0 drbd9-01: Ignoring P_TWOPC_ABORT
packet 2477663420.
Nov 12 13:00:14 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 849320515.
Nov 12 13:00:22 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:00:33 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:00:40 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:00:44 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 2698322727.
Nov 12 13:00:44 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 743585927.
Nov 12 13:00:48 drbd9-03 kernel: drbd_r_r0 D ffff88003d613680 0 36203 2
0x00000084
Nov 12 13:00:52 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:03 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:14 drbd9-03 kernel: drbd r0 drbd9-01: Ignoring P_TWOPC_ABORT
packet 2750548712.
Nov 12 13:01:14 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 1367793350.
Nov 12 13:01:15 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:26 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:36 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:44 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 2283776505.
Nov 12 13:01:44 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 3401865695.
Nov 12 13:01:47 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:01:56 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:02:08 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:02:14 drbd9-03 kernel: drbd r0 drbd9-01: Ignoring P_TWOPC_ABORT
packet 52162850.
Nov 12 13:02:19 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:02:29 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:02:40 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:02:44 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 452962539.
Nov 12 13:02:48 drbd9-03 kernel: drbd_r_r0 D ffff88003d613680 0 36203 2
0x00000084
Nov 12 13:02:52 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:03:03 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:03:12 drbd9-03 kernel: drbd r0 tcp:drbd9-05: Closing unexpected
connection from 100.100.10.21 to port 7000
Nov 12 13:03:14 drbd9-03 kernel: drbd r0 drbd9-01: Ignoring P_TWOPC_ABORT
packet 4060565678.
Nov 12 13:03:14 drbd9-03 kernel: drbd r0 drbd9-04: Ignoring P_TWOPC_ABORT
packet 3441716583.



2. New question

Ver:
  - drbd-9.0.1-1
  - CentOS 7

Env:
  - configure 4 nodes but use 3 nodes only

Test:
 1) setup node1, 2, 3
 2) node1: primary
 3) node1: mount /dev/drbd1  /mnt
 4) node1 : invalidate-remote
 5) wait for fullsync for all nodes
 6) test network disconnect/connect for all nodes by disable/enable VM's
network icon
 7) node2: down -> up
 8) node3: down is pending until 600 sec timeout!!!

Notes:
  - There is no file copy test.

Logs:
 - node3 log has important messages as follows;

Feb 26 18:36:28 drbd9-03 kernel: INFO: task drbd_r_r0:3114 blocked for more
than 120 seconds.
Feb 26 18:36:28 drbd9-03 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 18:36:28 drbd9-03 kernel: drbd_r_r0       D ffff88003d613680     0
 3114      2 0x00000080
Feb 26 18:36:28 drbd9-03 kernel: ffff8800369a3b70 0000000000000046
ffff88001c8ab8e0 ffff8800369a3fd8
Feb 26 18:36:28 drbd9-03 kernel: ffff8800369a3fd8 ffff8800369a3fd8
ffff88001c8ab8e0 ffff880039f5e960
Feb 26 18:36:28 drbd9-03 kernel: 7fffffffffffffff ffff88001c8ab8e0
0000000000000002 0000000000000000
Feb 26 18:36:28 drbd9-03 kernel: Call Trace:
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff816096a9>] schedule+0x29/0x70
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff816075f9>]
schedule_timeout+0x209/0x2d0
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff810a9512>] ?
default_wake_function+0x12/0x20
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8109825b>] ?
autoremove_wake_function+0x2b/0x40
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff810b1c7c>] ?
update_curr+0xcc/0x150
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff81608b72>]
__down_common+0xd2/0x14a
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff810b2a6e>] ?
dequeue_task_fair+0x40e/0x620
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff81608c07>] __down+0x1d/0x1f
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8109d201>] down+0x41/0x50
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa04b23c4>]
state_change_lock+0x54/0x60 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa04b7700>] ?
__change_peer_disk_states+0x70/0x70 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa04b5cc6>]
change_cluster_wide_state+0x46/0xe90 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8108007e>] ?
try_to_del_timer_sync+0x5e/0x90
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff81080102>] ?
del_timer_sync+0x52/0x60
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8160756d>] ?
schedule_timeout+0x17d/0x2d0
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8107ed90>] ?
__internal_add_timer+0x130/0x130
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa04b7336>]
change_cstate+0x86/0xc0 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa0486de1>]
drbd_receiver+0x151/0x630 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa0492521>]
drbd_thread_setup+0x61/0x120 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffffa04924c0>] ?
w_complete+0x20/0x20 [drbd]
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff8109726f>] kthread+0xcf/0xe0
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff810971a0>] ?
kthread_create_on_node+0x140/0x140
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff81614158>]
ret_from_fork+0x58/0x90
Feb 26 18:36:28 drbd9-03 kernel: [<ffffffff810971a0>] ?
kthread_create_on_node+0x140/0x140


Let me attach full log files here:
 - node1: http://pastebin.com/742jzerd
 - node2: http://pastebin.com/dFaaFkeK
 - node3: http://pastebin.com/irk4ykgN

Please see my comments interspersed in above log files.
I think it's very critical problem.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-dev/attachments/20160226/4cf04a1c/attachment.htm>


More information about the drbd-dev mailing list