[DRBD-user] Dual-primary/ Very slow synchronization

Daniel Meszaros spam at meszi.de
Tue May 24 15:02:01 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Am 20.05.2011 10:22, schrieb Lars Ellenberg:
> According to you, it "had been working" before with the exact same
> hardware and configuration.
>
> Now if it does not anymore, then I strongly suspect network problems.
I am currently checking the network switches together with the DELL 
support. Just to sort out a possible responsibility of the 10GbE switches.

> Did you do some network benchmarks on the replication link recently?
> Packet loss, excessive retransmits, checksum errors, bad cabling?
> Port stats on the switch, any error counters?
> Duplex or other auto-negotiating mismatch?
>
> Use flood ping with both large (>32k) and small packet sizes, iperf,
> your favorite network integrity checker or benchmarking tool...
Up to now I did tests with:

1. netio: up to 230 MByte/s
2. iperf: always below 80MBit/s

...both did not affect the CPU usage stats in any bigger kind.

3. ping: 9k and 18k packages, simultaneously from/to both hosts.

...led to high CPU peaks. Services of the virtual guests not available 
for that time, dmesg log outputs about timeouts - the log entries that I 
posted already.

In the end of this email I attach three logiles of three machines 
starting from the moment when I restarted the resync of the 
split-brain-DRBD.

host1: The machine where the running virtual guests reside. Runs 
Xenserver 5.6.
host2: The machine that lost the DRBD sync. No running virtual machines. 
Runs Xenserver 5.6.
guest: The mailserver as an example for all other virtual guests. Runs 
CentOS 5.5. Resides on host1.

An explanation for the timeline:

13:41:42: DRBD resync started on host2
13:49:25: DRBD resync aborted by disconnecting on host2
13:51:44: DRBD disconnection still not finished. So I tried stopping the 
DRBD service on host2. After that has also been refused, I rebooted host2.

While the DRBD resync attempt was in progress the mailserver was not 
accessible in the network, neither its mail services nor the webmail 
service or the ssh console. In that time top showed just very few CPU 
usage on host1 but around 95% on host2 (drbd_receiver process).

Any further ideas?

I am already thinking about installing a fresh CentOS 5.6 or Debian 6 on 
host2, setting up DRBD as Standalone, installing Xen and moving (system 
as image, data as copy) onto that system ... then setting up host1 
similarly and syncing everything onto its I'd lose fine things like 
XenMotion but I possibly gained much more control over the resources. 
What's your opinion about that? Possibly Xenserver, my hardware setup 
and my virtualization goals simply do not fit? :-/

CU,
Daniel.

---
host1 /var/log messages

May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection -> 
WFReportParams )
May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [12355])
May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:41:45 localhost kernel: block drbd0: self 
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE 
bits:1420811336 flags:0
May 24 13:41:45 localhost kernel: block drbd0: peer 
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 
bits:3874219255 flags:0
May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=1 by rule 70
May 24 13:41:45 localhost kernel: block drbd0: Becoming sync source due 
to disk states.
May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapS )
May 24 13:41:47 localhost kernel: block drbd0: peer( Secondary -> 
Unknown ) conn( WFBitMapS -> TearDown )
May 24 13:41:47 localhost kernel: block drbd0: asender terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread
May 24 13:41:51 localhost kernel: block drbd0: sock_sendmsg returned -32
May 24 13:41:51 localhost kernel: block drbd0: short sent ReportBitMap 
size=4096 sent=408
May 24 13:41:51 localhost kernel: block drbd0: Connection closed
May 24 13:41:51 localhost kernel: block drbd0: conn( TearDown -> 
Unconnected )
May 24 13:41:51 localhost kernel: block drbd0: receiver terminated
May 24 13:41:51 localhost kernel: block drbd0: Restarting receiver thread
May 24 13:41:51 localhost kernel: block drbd0: receiver (re)started
May 24 13:41:51 localhost kernel: block drbd0: conn( Unconnected -> 
WFConnection )
May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection -> 
WFReportParams )
May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [12355])
May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:42:11 localhost kernel: block drbd0: self 
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE 
bits:1420811336 flags:0
May 24 13:42:11 localhost kernel: block drbd0: peer 
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 
bits:3874219261 flags:1
May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=1 by rule 70
May 24 13:42:11 localhost kernel: block drbd0: Becoming sync source due 
to disk states.
May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapS )
May 24 13:49:25 localhost kernel: block drbd0: peer( Secondary -> 
Unknown ) conn( WFBitMapS -> TearDown )
May 24 13:49:25 localhost kernel: block drbd0: asender terminated
May 24 13:49:25 localhost kernel: block drbd0: Terminating asender thread
May 24 13:52:03 localhost kernel: INFO: task drbd0_receiver:12355 
blocked for more than 120 seconds.
May 24 13:52:03 localhost kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 24 13:52:03 localhost kernel: drbd0_receive D c0120d7a     0 
12355      2
May 24 13:52:03 localhost kernel:        ed769f0c 00000246 eab68900 
c0120d7a 000002d0 00000000 ed768000 ed769ec0
May 24 13:52:03 localhost kernel:        c0134ecc 64260759 00014afe 
ed769edc edd5d938 edd5d7b0 edd5d938 c16bdb00
May 24 13:52:03 localhost kernel:        00000000 ea87d740 ed769ed8 
0e2cd9dc 00000000 00000005 004e6cd1 c0346aaf
May 24 13:52:03 localhost kernel: Call Trace:
May 24 13:52:03 localhost kernel:  [<c0120d7a>] ? wake_up_state+0xa/0x10
May 24 13:52:03 localhost kernel:  [<c0134ecc>] ? 
complete_signal+0x17c/0x1b0
May 24 13:52:03 localhost kernel:  [<c0346aaf>] ? wait_for_common+0x1f/0x1c0
May 24 13:52:03 localhost kernel:  [<c0135b5b>] ? force_sig_info+0x9b/0xb0
May 24 13:52:03 localhost kernel:  [<c03473b5>] 
__mutex_lock_slowpath+0x55/0xd0
May 24 13:52:03 localhost kernel:  [<c03471b7>] mutex_lock+0x17/0x20
May 24 13:52:03 localhost kernel:  [<f062306e>] drbd_free_sock+0x1e/0xa0 
[drbd]
May 24 13:52:03 localhost kernel:  [<f0612b85>] 
drbd_disconnect+0x45/0x7d0 [drbd]
May 24 13:52:03 localhost kernel:  [<f06114c6>] ? 
drbd_recv_header+0x16/0xa0 [drbd]
May 24 13:52:03 localhost kernel:  [<c011eae0>] ? __wake_up+0x40/0x50
May 24 13:52:03 localhost kernel:  [<f0614c21>] drbdd_init+0xb1/0x120 [drbd]
May 24 13:52:03 localhost kernel:  [<f0627d22>] 
drbd_thread_setup+0x132/0x1d0 [drbd]
May 24 13:52:03 localhost kernel:  [<f0627bf0>] ? 
drbd_thread_setup+0x0/0x1d0 [drbd]
May 24 13:52:03 localhost kernel:  [<c010569b>] 
kernel_thread_helper+0x7/0x10
May 24 13:52:03 localhost kernel:  =======================
May 24 13:53:21 localhost kernel: block drbd0: short sent ReportBitMap 
size=4096 sent=1392
May 24 13:53:21 localhost kernel: block drbd0: Connection closed
May 24 13:53:21 localhost kernel: block drbd0: conn( TearDown -> 
Unconnected )
May 24 13:53:21 localhost kernel: block drbd0: receiver terminated
May 24 13:53:21 localhost kernel: block drbd0: Restarting receiver thread
May 24 13:53:21 localhost kernel: block drbd0: receiver (re)started
May 24 13:53:21 localhost kernel: block drbd0: conn( Unconnected -> 
WFConnection )

---
host2 /var/log/messages

May 24 13:41:42 localhost kernel: drbd: initialized. Version: 8.3.8.1 
(api:88/proto:86-94)
May 24 13:41:42 localhost kernel: drbd: GIT-hash: 
0d8589fcc32c874df57c930ca1691399b55ec893 build by 
root at localhost.localdomain, 2010-08-12 07:46:27
May 24 13:41:42 localhost kernel: drbd: registered as block device major 147
May 24 13:41:42 localhost kernel: drbd: minor_table @ 0xeb0d6340
May 24 13:41:42 localhost kernel: block drbd0: Starting worker thread 
(from cqueue [6339])
May 24 13:41:42 localhost kernel: block drbd0: disk( Diskless -> Attaching )
May 24 13:41:42 localhost kernel: klogd 1.4.1, ---------- state change 
----------
May 24 13:41:43 localhost kernel: block drbd0: Found 10 transactions 
(582 active extents) in activity log.
May 24 13:41:43 localhost kernel: block drbd0: Method to ensure write 
ordering: drain
May 24 13:41:43 localhost kernel: block drbd0: max_segment_size ( = BIO 
size ) = 32768
May 24 13:41:43 localhost kernel: block drbd0: Adjusting my ra_pages to 
backing device's (32 -> 256)
May 24 13:41:43 localhost kernel: block drbd0: drbd_bm_resize called 
with capacity == 31204648888
May 24 13:41:43 localhost kernel: block drbd0: resync bitmap: 
bits=3900581111 words=121893160
May 24 13:41:43 localhost kernel: block drbd0: size = 15 TB (15602324444 KB)
May 24 13:41:45 localhost kernel: block drbd0: recounting of set bits 
took additional 98 jiffies
May 24 13:41:45 localhost kernel: block drbd0: 2491 GB (3874219255 bits) 
marked out-of-sync by on disk bit-map.
May 24 13:41:45 localhost kernel: block drbd0: disk( Attaching -> 
Inconsistent )
May 24 13:41:45 localhost kernel: block drbd0: conn( StandAlone -> 
Unconnected )
May 24 13:41:45 localhost kernel: block drbd0: Starting receiver thread 
(from drbd0_worker [6355])
May 24 13:41:45 localhost kernel: block drbd0: receiver (re)started
May 24 13:41:45 localhost kernel: block drbd0: conn( Unconnected -> 
WFConnection )
May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection -> 
WFReportParams )
May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [6367])
May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:41:45 localhost kernel: block drbd0: self 
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 
bits:3874219255 flags:0
May 24 13:41:45 localhost kernel: block drbd0: peer 
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE 
bits:1420811336 flags:0
May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50
May 24 13:41:45 localhost kernel: block drbd0: Becoming sync target due 
to disk states.
May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown -> Primary 
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
peer_isp( 0 -> 1 )
May 24 13:41:47 localhost kernel: block drbd0: peer( Primary -> Unknown 
) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
peer_isp( 1 -> 0 )
May 24 13:41:47 localhost kernel: block drbd0: error receiving 
ReportBitMap, l: 4088!
May 24 13:41:47 localhost kernel: block drbd0: asender terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread
May 24 13:41:47 localhost kernel: block drbd0: Connection closed
May 24 13:41:47 localhost kernel: block drbd0: conn( Disconnecting -> 
StandAlone )
May 24 13:41:47 localhost kernel: block drbd0: receiver terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating receiver thread
May 24 13:42:11 localhost kernel: block drbd0: conn( StandAlone -> 
Unconnected )
May 24 13:42:11 localhost kernel: block drbd0: Starting receiver thread 
(from drbd0_worker [6355])
May 24 13:42:11 localhost kernel: block drbd0: receiver (re)started
May 24 13:42:11 localhost kernel: block drbd0: conn( Unconnected -> 
WFConnection )
May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection -> 
WFReportParams )
May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [6412])
May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:42:11 localhost kernel: block drbd0: self 
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 
bits:3874219261 flags:0
May 24 13:42:11 localhost kernel: block drbd0: peer 
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE 
bits:1420811336 flags:0
May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50
May 24 13:42:11 localhost kernel: block drbd0: Becoming sync target due 
to disk states.
May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown -> Primary 
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
peer_isp( 0 -> 1 )
May 24 13:49:25 localhost kernel: block drbd0: peer( Primary -> Unknown 
) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
peer_isp( 1 -> 0 )
May 24 13:49:35 localhost kernel: block drbd0: process_done_ee() = NOT_OK
May 24 13:49:35 localhost kernel: block drbd0: asender terminated
May 24 13:49:35 localhost kernel: block drbd0: Terminating asender thread

Broadcast message from root (Tue May 24 13:51:44 2011):

The system is going down for reboot NOW!
May 24 13:51:44 localhost shutdown[6941]: shutting down for system reboot
May 24 13:52:02 localhost python: PERFMON: Caught signal 15 - exiting
May 24 13:52:06 localhost mountd[10542]: Caught signal 15, 
un-registering and exiting.
May 24 13:52:07 localhost kernel: nfsd: last server has exited, flushing 
export cache

---
mailserver (virtual guest) /var/log/dmesg

INFO: task mailserver:3114 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mailserver    D ffff8800e5e66860     0  3114      1          3117  3105 
(NOTLB)
  ffff8800d444bc88  0000000000000286  ffff8800ff99e040  ffff880000eec670
  000000000000000a  ffff8800e5e66860  ffffffff804fdb00  000000000001f341
  ffff8800e5e66a48  ffffffff802648f1
Call Trace:
  [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
  [<ffffffff8022937f>] sync_page+0x0/0x43
  [<ffffffff8022937f>] sync_page+0x0/0x43
  [<ffffffff802635d9>] io_schedule+0x3f/0x67
  [<ffffffff802293bd>] sync_page+0x3e/0x43
  [<ffffffff8026371d>] __wait_on_bit_lock+0x36/0x66
  [<ffffffff80241b63>] __lock_page+0x5e/0x64
  [<ffffffff8029e060>] wake_bit_function+0x0/0x23
  [<ffffffff80214338>] filemap_nopage+0x268/0x360
  [<ffffffff80208e8c>] __handle_mm_fault+0x442/0x1445
  [<ffffffff80267d58>] do_page_fault+0xf7b/0x1324
  [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
  [<ffffffff8023fa13>] lock_timer_base+0x1b/0x3c
  [<ffffffff8024d2d6>] try_to_del_timer_sync+0x7f/0x88
  [<ffffffff8026082b>] error_exit+0x0/0x6e

INFO: task mailserver:3136 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mailserver    D ffff8800d796c0c0     0  3136      1          3262  3128 
(NOTLB)
  ffff8800d4b83ef8  0000000000000286  ffff8800e4943680  ffff8800ff9914f8
  000000000000000a  ffff8800d796c0c0  ffffffff804fdb00  000000000000b5e2
  ffff8800d796c2a8  ffffffff8021f04a
Call Trace:
  [<ffffffff8021f04a>] __dentry_open+0x101/0x1dc
  [<ffffffff8020e8da>] vfs_getattr+0x62/0xa9
  [<ffffffff802643ea>] __down_write_nested+0x82/0x9a
  [<ffffffff802c971b>] sys_mmap_pgoff+0x55/0xac
  [<ffffffff8026168d>] ia32_sysret+0x0/0x5





More information about the drbd-user mailing list