Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, We have drbd 8.3.11 or 8.3.13 dual-primary on a pacemaker cluster running on kernel 3.0.41. The cluster just does its work, nothing is stopped or started and then, after a week or so, we get a drbsetup locking-up (associated with below kernel trace) when we want to administer a resource. Usually only one resource of several resources is affected, sometimes even two. We have seen several such traces, with different drbdsetup sub-commands, all ending at the same place. Could this be the problem addressed by http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=c586d79e49135831dbe0629e2d9a7b3739c615ef Fix comparison of is_valid_transition()'s return code in 8.4 ? We fiddled that patch into a 8.3.13, which is currently running on a test machine, but since the problem only appears now and then it is hard to say if the problem is gone. Has anyone an idea how to get into this state ? TIA Andi ---8<--- <03>2012 Sep 10 17:17:01 cnode1 [609601.848157] INFO: task drbdsetup:5670 blocked for more than 120 seconds. <03>2012 Sep 10 17:17:01 cnode1 [609601.848160] \"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\" disables this message. <06>2012 Sep 10 17:17:01 cnode1 [609601.848162] drbdsetup D 0000000000000000 0 5670 1 0x00000004 <04>2012 Sep 10 17:17:01 cnode1 [609601.848166] ffff88000f423968 0000000000000082 ffff88003ffd7c00 ffff88000f423fd8 <04>2012 Sep 10 17:17:01 cnode1 [609601.848170] ffff88000f423838 0000000000012340 0000000000012340 0000000000012340 <04>2012 Sep 10 17:17:01 cnode1 [609601.848173] 0000000000012340 0000000000012340 ffff88000ee045c0 0000000000012340 <04>2012 Sep 10 17:17:01 cnode1 [609601.848177] Call Trace: <04>2012 Sep 10 17:17:01 cnode1 [609601.852026] [<ffffffff8103960c>] ? spin_unlock_irqrestore+0x9/0xb <04>2012 Sep 10 17:17:01 cnode1 [609601.880322] [<ffffffff810416d6>] ? __wake_up+0x43/0x50 <04>2012 Sep 10 17:17:01 cnode1 [609601.884293] [<ffffffffa03a745f>] ? put_ldev+0x85/0x8a [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.916943] [<ffffffffa03a7be5>] ? is_valid_state+0x73/0x1e3 [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.916953] [<ffffffffa03a698f>] ? spin_unlock_irqrestore+0x9/0xb [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.916969] [<ffffffffa03a7e22>] ? _req_st_cond+0xcd/0xdf [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.919191] [<ffffffff815ad428>] schedule+0x44/0x46 <04>2012 Sep 10 17:17:01 cnode1 [609601.919208] [<ffffffffa03aadb2>] drbd_req_state+0x1b6/0x2df [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.919224] [<ffffffff8105f3cc>] ? wake_up_bit+0x23/0x23 <04>2012 Sep 10 17:17:01 cnode1 [609601.919241] [<ffffffffa03aaefd>] _drbd_request_state+0x22/0xb2 [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.919252] [<ffffffff810bbcb6>] ? zone_statistics+0x77/0x7e <04>2012 Sep 10 17:17:01 cnode1 [609601.920356] [<ffffffff810ab9da>] ? set_page_refcounted+0xd/0x1a <04>2012 Sep 10 17:17:01 cnode1 [609601.920401] [<ffffffff810ade41>] ? get_page_from_freelist+0x58b/0x64d <04>2012 Sep 10 17:17:01 cnode1 [609601.920446] [<ffffffffa03b1895>] drbd_nl_invalidate+0xa1/0x133 [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.920462] [<ffffffffa03b1c1d>] drbd_connector_callback+0x104/0x195 [drbd] <04>2012 Sep 10 17:17:01 cnode1 [609601.924378] [<ffffffffa026446a>] cn_rx_skb+0xb0/0xd2 [cn] <04>2012 Sep 10 17:17:01 cnode1 [609601.936338] [<ffffffff81514227>] netlink_unicast+0xe2/0x14b <04>2012 Sep 10 17:17:01 cnode1 [609601.963889] [<ffffffff814f1ea6>] ? memcpy_fromiovec+0x42/0x73 <04>2012 Sep 10 17:17:01 cnode1 [609601.963897] [<ffffffff8151545c>] netlink_sendmsg+0x230/0x250 <04>2012 Sep 10 17:17:01 cnode1 [609601.963909] [<ffffffff814e71c1>] __sock_sendmsg_nosec+0x55/0x62 <04>2012 Sep 10 17:17:01 cnode1 [609601.963913] [<ffffffff814e8456>] __sock_sendmsg+0x39/0x42 <04>2012 Sep 10 17:17:01 cnode1 [609601.963917] [<ffffffff814e8c2e>] sock_sendmsg+0xa3/0xbc <04>2012 Sep 10 17:17:01 cnode1 [609601.963920] [<ffffffff810c1137>] ? handle_pte_fault+0x2ef/0x843 <04>2012 Sep 10 17:17:01 cnode1 [609601.963924] [<ffffffff810c1e32>] ? handle_mm_fault+0x19c/0x1b3 <04>2012 Sep 10 17:17:01 cnode1 [609601.963936] [<ffffffff810eedbe>] ? fget_light+0x2f/0x7c <04>2012 Sep 10 17:17:01 cnode1 [609601.963939] [<ffffffff814e8c71>] ? sockfd_lookup_light+0x1b/0x53 <04>2012 Sep 10 17:17:01 cnode1 [609601.963943] [<ffffffff814e91b6>] sys_sendto+0xfa/0x11f <04>2012 Sep 10 17:17:01 cnode1 [609601.963946] [<ffffffff8151355b>] ? netlink_table_ungrab+0x2e/0x30 <04>2012 Sep 10 17:17:01 cnode1 [609601.963949] [<ffffffff81515609>] ? netlink_bind+0x106/0x11c <04>2012 Sep 10 17:17:01 cnode1 [609601.963952] [<ffffffff814e9c33>] ? sys_bind+0x7d/0x91 <04>2012 Sep 10 17:17:01 cnode1 [609601.963955] [<ffffffff810ebd14>] ? spin_lock+0x9/0xb <04>2012 Sep 10 17:17:01 cnode1 [609601.963960] [<ffffffff815b3a92>] system_call_fastpath+0x16/0x1b --->8---