[DRBD-user] Stalled online verification - drbd0_worker sock_sendmsg time expired

nemanja.dundjerski at vlatacom.com nemanja.dundjerski at vlatacom.com
Wed Apr 20 12:48:01 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello list,

I am experiencing online verification stall on our two-node DRBD (RHCS)
cluster, running RHEL 6.0 (2.6.32-71.el6.x86_64), and DRDB 8.3.10
(api:88).

SATA disks are configured as RAID 1, 1TB, and have one LVM device as a
DRBD resource, which is formatted as ext4. The servers are connected
directly (bypassing the switch), using two GigE network cards in bonded
load balancing (round-robin) mode, on both servers.

When I run the online verification, I see a fair amount of drbd0_worker
sock_sendmsg time expired messages – my first guess is that drbd0 worker
process got  stuck while trying to send some blocks?

I've noticed this originally previous morning, and the output of
/proc/drbd was:

----------%<----------
# cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at node01,
2011-02-23 17:38:20
 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:295912 nr:94208 dw:390256 dr:490915949 al:38 bm:19 lo:1 pe:61424
ua:2048 ap:6 ep:1 wo:b oos:0
        [=========>..........] verified: 51.9% (444468/923796)M
        finish: 434147:56:06 speed: 0 (124) want: 112,640 K/sec (stalled)
----------%<----------

I had to restart the servers to resolve the problem...

I have run the verification once again, but the problems are the same. I
even can't do a 'ls' on the DRBD partition (the command just hangs). So
today, when I've run the verification again, the logs on primary node
print the following:

----------%<----------
Apr 20 11:03:01 node02 kernel: block drbd0: conn( Connected -> VerifyS )
Apr 20 11:03:01 node02 kernel: block drbd0: Starting Online Verify from
sector 0
Apr 20 11:04:20 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967295
Apr 20 11:04:26 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967294
.
.
.
Apr 20 11:07:50 node02 kernel: INFO: task jbd2/drbd0-8:3571 blocked for
more than 120 seconds.
Apr 20 11:07:50 node02 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 20 11:07:50 node02 kernel: jbd2/drbd0-8  D ffff88013fc24300     0 
3571      2 0x00000080
Apr 20 11:07:50 node02 kernel: ffff88013957fd10 0000000000000046
0000000000000000 0000000000000008
Apr 20 11:07:50 node02 kernel: 0000000000016980 0000000000016980
ffff88002be10d80 00000001001f1d4f
Apr 20 11:07:50 node02 kernel: ffff880128149a98 ffff88013957ffd8
0000000000010518 ffff880128149a98
Apr 20 11:07:50 node02 kernel: Call Trace:
Apr 20 11:07:50 node02 kernel: [<ffffffffa00d88f0>]
jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffff810116e0>] ? __switch_to+0xd0/0x320
Apr 20 11:07:50 node02 kernel: [<ffffffff8107d5ac>] ?
lock_timer_base+0x3c/0x70
Apr 20 11:07:50 node02 kernel: [<ffffffff81091ca0>] ?
autoremove_wake_function+0x0/0x40
Apr 20 11:07:50 node02 kernel: [<ffffffffa00df0b8>] kjournald2+0xb8/0x220
[jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffff81091ca0>] ?
autoremove_wake_function+0x0/0x40
Apr 20 11:07:50 node02 kernel: [<ffffffffa00df000>] ? kjournald2+0x0/0x220
[jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffff81091936>] kthread+0x96/0xa0
Apr 20 11:07:50 node02 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Apr 20 11:07:50 node02 kernel: [<ffffffff810918a0>] ? kthread+0x0/0xa0
Apr 20 11:07:50 node02 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Apr 20 11:07:50 node02 kernel: INFO: task touch:4825 blocked for more than
120 seconds.
Apr 20 11:07:50 node02 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 20 11:07:50 node02 kernel: touch         D ffffffff8119dcf0     0 
4825   4770 0x00000080
Apr 20 11:07:50 node02 kernel: ffff88012623fa08 0000000000000086
ffff88012623f988 00000000a0192deb
Apr 20 11:07:50 node02 kernel: ffff880136b91400 ffff8801362e94e0
ffff8801369b1c00 0000000000000002
Apr 20 11:07:50 node02 kernel: ffff880138a045f8 ffff88012623ffd8
0000000000010518 ffff880138a045f8
Apr 20 11:07:50 node02 kernel: Call Trace:
Apr 20 11:07:50 node02 kernel: [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
Apr 20 11:07:50 node02 kernel: [<ffffffff8119dcf0>] ? sync_buffer+0x0/0x50
Apr 20 11:07:50 node02 kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Apr 20 11:07:50 node02 kernel: [<ffffffff8119dd30>] sync_buffer+0x40/0x50
Apr 20 11:07:50 node02 kernel: [<ffffffff814c914a>]
__wait_on_bit_lock+0x5a/0xc0
Apr 20 11:07:50 node02 kernel: [<ffffffff8119dcf0>] ? sync_buffer+0x0/0x50
Apr 20 11:07:50 node02 kernel: [<ffffffff814c9228>]
out_of_line_wait_on_bit_lock+0x78/0x90
Apr 20 11:07:50 node02 kernel: [<ffffffff81091ce0>] ?
wake_bit_function+0x0/0x50
Apr 20 11:07:50 node02 kernel: [<ffffffff8119ded6>] __lock_buffer+0x36/0x40
Apr 20 11:07:50 node02 kernel: [<ffffffffa00d82b3>]
do_get_write_access+0x473/0x500 [jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffff8119d3fc>] ? __getblk+0x2c/0x2e0
Apr 20 11:07:50 node02 kernel: [<ffffffff81200840>] ?
security_inode_alloc+0x40/0x60
Apr 20 11:07:50 node02 kernel: [<ffffffffa00d8491>]
jbd2_journal_get_write_access+0x31/0x50 [jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffffa012bd48>]
__ext4_journal_get_write_access+0x38/0x80 [ext4]
Apr 20 11:07:50 node02 kernel: [<ffffffffa01036c2>]
ext4_new_inode+0x3a2/0x10e0 [ext4]
Apr 20 11:07:50 node02 kernel: [<ffffffffa00d74e5>] ?
jbd2_journal_start+0xb5/0x100 [jbd2]
Apr 20 11:07:50 node02 kernel: [<ffffffffa0111a60>] ext4_create+0xc0/0x150
[ext4]
Apr 20 11:07:50 node02 kernel: [<ffffffff81179c24>] vfs_create+0xb4/0xe0
Apr 20 11:07:50 node02 kernel: [<ffffffff8117d766>] do_filp_open+0xa46/0xd40
Apr 20 11:07:50 node02 kernel: [<ffffffff81135837>] ?
handle_pte_fault+0xf7/0xad0
Apr 20 11:07:50 node02 kernel: [<ffffffff811892f2>] ? alloc_fd+0x92/0x160
Apr 20 11:07:50 node02 kernel: [<ffffffff81169929>] do_sys_open+0x69/0x140
Apr 20 11:07:50 node02 kernel: [<ffffffff81169a40>] sys_open+0x20/0x30
Apr 20 11:07:50 node02 kernel: [<ffffffff81013172>]
system_call_fastpath+0x16/0x1b
Apr 20 11:07:50 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967260
Apr 20 11:07:56 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967259
Apr 20 11:08:02 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967258
Apr 20 11:08:08 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967257
Apr 20 11:08:14 node02 kernel: block drbd0: [drbd0_worker/1851]
sock_sendmsg time expired, ko = 4294967256
.
.
.
----------%<----------

The logs on secondary print the same "sock_sendmsg time expired" messages,
but without the blocked tasks messages for jbd2 and touch...


/proc/drbd on primary node:

----------%<----------
# cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at node02,
2011-02-23 17:37:52
 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:2120 nr:0 dw:2140 dr:6689633 al:7 bm:0 lo:1 pe:9356 ua:2048 ap:5
ep:1 wo:b oos:0
        [>....................] verified:  0.8% (917276/923796)M
        finish: 642:35:47 speed: 404 (1,464) want: 112,640 K/sec
----------%<----------

DRDB configuration:

----------%<----------
global {
  usage-count no;
}
common {
  protocol C;
}
resource shared_partition {
        handlers {
          split-brain "/lib/drbd/notify-split-brain.sh root";
        }
        syncer {
          rate 110M;
          verify-alg sha1;
        }
  on node01. {
    device    /dev/drbd0;
    disk      /dev/vg_node01/lv_shared;
    address   10.2.30.141:7789;
    meta-disk internal;
  }
  on node02. {
    device    /dev/drbd0;
    disk      /dev/vg_node02/lv_shared;
    address   10.2.30.142:7789;
    meta-disk internal;
  }
}
----------%<----------

Processes on primary:

----------%<----------
# ps -eo pid,args,state,wchan | grep drbd
 1851 [drbd0_worker]              S sk_stream_wait_memory
 1859 [drbd0_receiver]            S drbd_pp_alloc
 1878 [drbd0_asender]             S sk_wait_data
 3571 [jbd2/drbd0-8]              D jbd2_journal_commit_transaction
 5872 grep drbd                   R -
----------%<----------

Processes on secondary:

----------%<----------
# ps -e -o pid,args,stat,wchan | grep drbd
 1862 [drbd0_worker]              S    sk_stream_wait_memory
 1870 [drbd0_receiver]            S    drbd_pp_alloc
 1889 [drbd0_asender]             S    sk_wait_data
 5159 grep drbd                   S+   pipe_wait
----------%<----------

I have also tested the network throughput via bonded interface (which DRBD
uses for synchronization and verification), and it is around 40MBps.


Anyone has an idea what is behind this problem?



More information about the drbd-user mailing list