[DRBD-user] drbdsetup occasionally hangs

Andreas Hofmeister andi at collax.com
Wed Sep 19 16:24:01 CEST 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

We have drbd 8.3.11 or 8.3.13 dual-primary on a pacemaker cluster 
running on kernel 3.0.41.

The cluster  just does its work, nothing is stopped or started and then, 
after a week or so, we get a drbsetup locking-up (associated with below 
kernel trace) when we want to administer a resource.

Usually only one resource of several resources is affected, sometimes 
even two.

We have seen several such traces, with different drbdsetup sub-commands, 
all ending at the same place.

Could this be the problem addressed by

http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=c586d79e49135831dbe0629e2d9a7b3739c615ef
Fix comparison of is_valid_transition()'s return code

in 8.4 ?

We fiddled that patch into a 8.3.13, which is currently running on a 
test machine, but since the problem only appears now and then it is hard 
to say if the problem is gone.

Has anyone an idea how to get into this state ?

TIA
   Andi

---8<---
<03>2012 Sep 10 17:17:01 cnode1 [609601.848157] INFO: task
drbdsetup:5670 blocked for more than 120 seconds.
<03>2012 Sep 10 17:17:01 cnode1 [609601.848160] \"echo 0 >
/proc/sys/kernel/hung_task_timeout_secs\" disables this message.
<06>2012 Sep 10 17:17:01 cnode1 [609601.848162] drbdsetup       D
0000000000000000     0  5670      1 0x00000004
<04>2012 Sep 10 17:17:01 cnode1 [609601.848166]  ffff88000f423968
0000000000000082 ffff88003ffd7c00 ffff88000f423fd8
<04>2012 Sep 10 17:17:01 cnode1 [609601.848170]  ffff88000f423838
0000000000012340 0000000000012340 0000000000012340
<04>2012 Sep 10 17:17:01 cnode1 [609601.848173]  0000000000012340
0000000000012340 ffff88000ee045c0 0000000000012340
<04>2012 Sep 10 17:17:01 cnode1 [609601.848177] Call Trace:
<04>2012 Sep 10 17:17:01 cnode1 [609601.852026]  [<ffffffff8103960c>] ?
spin_unlock_irqrestore+0x9/0xb
<04>2012 Sep 10 17:17:01 cnode1 [609601.880322]  [<ffffffff810416d6>] ?
__wake_up+0x43/0x50
<04>2012 Sep 10 17:17:01 cnode1 [609601.884293]  [<ffffffffa03a745f>] ?
put_ldev+0x85/0x8a [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916943]  [<ffffffffa03a7be5>] ?
is_valid_state+0x73/0x1e3 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916953]  [<ffffffffa03a698f>] ?
spin_unlock_irqrestore+0x9/0xb [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916969]  [<ffffffffa03a7e22>] ?
_req_st_cond+0xcd/0xdf [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919191]  [<ffffffff815ad428>]
schedule+0x44/0x46
<04>2012 Sep 10 17:17:01 cnode1 [609601.919208]  [<ffffffffa03aadb2>]
drbd_req_state+0x1b6/0x2df [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919224]  [<ffffffff8105f3cc>] ?
wake_up_bit+0x23/0x23
<04>2012 Sep 10 17:17:01 cnode1 [609601.919241]  [<ffffffffa03aaefd>]
_drbd_request_state+0x22/0xb2 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919252]  [<ffffffff810bbcb6>] ?
zone_statistics+0x77/0x7e
<04>2012 Sep 10 17:17:01 cnode1 [609601.920356]  [<ffffffff810ab9da>] ?
set_page_refcounted+0xd/0x1a
<04>2012 Sep 10 17:17:01 cnode1 [609601.920401]  [<ffffffff810ade41>] ?
get_page_from_freelist+0x58b/0x64d
<04>2012 Sep 10 17:17:01 cnode1 [609601.920446]  [<ffffffffa03b1895>]
drbd_nl_invalidate+0xa1/0x133 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.920462]  [<ffffffffa03b1c1d>]
drbd_connector_callback+0x104/0x195 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.924378]  [<ffffffffa026446a>]
cn_rx_skb+0xb0/0xd2 [cn]
<04>2012 Sep 10 17:17:01 cnode1 [609601.936338]  [<ffffffff81514227>]
netlink_unicast+0xe2/0x14b
<04>2012 Sep 10 17:17:01 cnode1 [609601.963889]  [<ffffffff814f1ea6>] ?
memcpy_fromiovec+0x42/0x73
<04>2012 Sep 10 17:17:01 cnode1 [609601.963897]  [<ffffffff8151545c>]
netlink_sendmsg+0x230/0x250
<04>2012 Sep 10 17:17:01 cnode1 [609601.963909]  [<ffffffff814e71c1>]
__sock_sendmsg_nosec+0x55/0x62
<04>2012 Sep 10 17:17:01 cnode1 [609601.963913]  [<ffffffff814e8456>]
__sock_sendmsg+0x39/0x42
<04>2012 Sep 10 17:17:01 cnode1 [609601.963917]  [<ffffffff814e8c2e>]
sock_sendmsg+0xa3/0xbc
<04>2012 Sep 10 17:17:01 cnode1 [609601.963920]  [<ffffffff810c1137>] ?
handle_pte_fault+0x2ef/0x843
<04>2012 Sep 10 17:17:01 cnode1 [609601.963924]  [<ffffffff810c1e32>] ?
handle_mm_fault+0x19c/0x1b3
<04>2012 Sep 10 17:17:01 cnode1 [609601.963936]  [<ffffffff810eedbe>] ?
fget_light+0x2f/0x7c
<04>2012 Sep 10 17:17:01 cnode1 [609601.963939]  [<ffffffff814e8c71>] ?
sockfd_lookup_light+0x1b/0x53
<04>2012 Sep 10 17:17:01 cnode1 [609601.963943]  [<ffffffff814e91b6>]
sys_sendto+0xfa/0x11f
<04>2012 Sep 10 17:17:01 cnode1 [609601.963946]  [<ffffffff8151355b>] ?
netlink_table_ungrab+0x2e/0x30
<04>2012 Sep 10 17:17:01 cnode1 [609601.963949]  [<ffffffff81515609>] ?
netlink_bind+0x106/0x11c
<04>2012 Sep 10 17:17:01 cnode1 [609601.963952]  [<ffffffff814e9c33>] ?
sys_bind+0x7d/0x91
<04>2012 Sep 10 17:17:01 cnode1 [609601.963955]  [<ffffffff810ebd14>] ?
spin_lock+0x9/0xb
<04>2012 Sep 10 17:17:01 cnode1 [609601.963960]  [<ffffffff815b3a92>]
system_call_fastpath+0x16/0x1b
--->8---





More information about the drbd-user mailing list