Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Am 20.05.2011 10:22, schrieb Lars Ellenberg: > According to you, it "had been working" before with the exact same > hardware and configuration. > > Now if it does not anymore, then I strongly suspect network problems. I am currently checking the network switches together with the DELL support. Just to sort out a possible responsibility of the 10GbE switches. > Did you do some network benchmarks on the replication link recently? > Packet loss, excessive retransmits, checksum errors, bad cabling? > Port stats on the switch, any error counters? > Duplex or other auto-negotiating mismatch? > > Use flood ping with both large (>32k) and small packet sizes, iperf, > your favorite network integrity checker or benchmarking tool... Up to now I did tests with: 1. netio: up to 230 MByte/s 2. iperf: always below 80MBit/s ...both did not affect the CPU usage stats in any bigger kind. 3. ping: 9k and 18k packages, simultaneously from/to both hosts. ...led to high CPU peaks. Services of the virtual guests not available for that time, dmesg log outputs about timeouts - the log entries that I posted already. In the end of this email I attach three logiles of three machines starting from the moment when I restarted the resync of the split-brain-DRBD. host1: The machine where the running virtual guests reside. Runs Xenserver 5.6. host2: The machine that lost the DRBD sync. No running virtual machines. Runs Xenserver 5.6. guest: The mailserver as an example for all other virtual guests. Runs CentOS 5.5. Resides on host1. An explanation for the timeline: 13:41:42: DRBD resync started on host2 13:49:25: DRBD resync aborted by disconnecting on host2 13:51:44: DRBD disconnection still not finished. So I tried stopping the DRBD service on host2. After that has also been refused, I rebooted host2. While the DRBD resync attempt was in progress the mailserver was not accessible in the network, neither its mail services nor the webmail service or the ssh console. In that time top showed just very few CPU usage on host1 but around 95% on host2 (drbd_receiver process). Any further ideas? I am already thinking about installing a fresh CentOS 5.6 or Debian 6 on host2, setting up DRBD as Standalone, installing Xen and moving (system as image, data as copy) onto that system ... then setting up host1 similarly and syncing everything onto its I'd lose fine things like XenMotion but I possibly gained much more control over the resources. What's your opinion about that? Possibly Xenserver, my hardware setup and my virtualization goals simply do not fit? :-/ CU, Daniel. --- host1 /var/log messages May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [12355]) May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:41:45 localhost kernel: block drbd0: self 06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE bits:1420811336 flags:0 May 24 13:41:45 localhost kernel: block drbd0: peer 1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 bits:3874219255 flags:0 May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=1 by rule 70 May 24 13:41:45 localhost kernel: block drbd0: Becoming sync source due to disk states. May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) May 24 13:41:47 localhost kernel: block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> TearDown ) May 24 13:41:47 localhost kernel: block drbd0: asender terminated May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread May 24 13:41:51 localhost kernel: block drbd0: sock_sendmsg returned -32 May 24 13:41:51 localhost kernel: block drbd0: short sent ReportBitMap size=4096 sent=408 May 24 13:41:51 localhost kernel: block drbd0: Connection closed May 24 13:41:51 localhost kernel: block drbd0: conn( TearDown -> Unconnected ) May 24 13:41:51 localhost kernel: block drbd0: receiver terminated May 24 13:41:51 localhost kernel: block drbd0: Restarting receiver thread May 24 13:41:51 localhost kernel: block drbd0: receiver (re)started May 24 13:41:51 localhost kernel: block drbd0: conn( Unconnected -> WFConnection ) May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [12355]) May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:42:11 localhost kernel: block drbd0: self 06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE bits:1420811336 flags:0 May 24 13:42:11 localhost kernel: block drbd0: peer 1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 bits:3874219261 flags:1 May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=1 by rule 70 May 24 13:42:11 localhost kernel: block drbd0: Becoming sync source due to disk states. May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) May 24 13:49:25 localhost kernel: block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> TearDown ) May 24 13:49:25 localhost kernel: block drbd0: asender terminated May 24 13:49:25 localhost kernel: block drbd0: Terminating asender thread May 24 13:52:03 localhost kernel: INFO: task drbd0_receiver:12355 blocked for more than 120 seconds. May 24 13:52:03 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 24 13:52:03 localhost kernel: drbd0_receive D c0120d7a 0 12355 2 May 24 13:52:03 localhost kernel: ed769f0c 00000246 eab68900 c0120d7a 000002d0 00000000 ed768000 ed769ec0 May 24 13:52:03 localhost kernel: c0134ecc 64260759 00014afe ed769edc edd5d938 edd5d7b0 edd5d938 c16bdb00 May 24 13:52:03 localhost kernel: 00000000 ea87d740 ed769ed8 0e2cd9dc 00000000 00000005 004e6cd1 c0346aaf May 24 13:52:03 localhost kernel: Call Trace: May 24 13:52:03 localhost kernel: [<c0120d7a>] ? wake_up_state+0xa/0x10 May 24 13:52:03 localhost kernel: [<c0134ecc>] ? complete_signal+0x17c/0x1b0 May 24 13:52:03 localhost kernel: [<c0346aaf>] ? wait_for_common+0x1f/0x1c0 May 24 13:52:03 localhost kernel: [<c0135b5b>] ? force_sig_info+0x9b/0xb0 May 24 13:52:03 localhost kernel: [<c03473b5>] __mutex_lock_slowpath+0x55/0xd0 May 24 13:52:03 localhost kernel: [<c03471b7>] mutex_lock+0x17/0x20 May 24 13:52:03 localhost kernel: [<f062306e>] drbd_free_sock+0x1e/0xa0 [drbd] May 24 13:52:03 localhost kernel: [<f0612b85>] drbd_disconnect+0x45/0x7d0 [drbd] May 24 13:52:03 localhost kernel: [<f06114c6>] ? drbd_recv_header+0x16/0xa0 [drbd] May 24 13:52:03 localhost kernel: [<c011eae0>] ? __wake_up+0x40/0x50 May 24 13:52:03 localhost kernel: [<f0614c21>] drbdd_init+0xb1/0x120 [drbd] May 24 13:52:03 localhost kernel: [<f0627d22>] drbd_thread_setup+0x132/0x1d0 [drbd] May 24 13:52:03 localhost kernel: [<f0627bf0>] ? drbd_thread_setup+0x0/0x1d0 [drbd] May 24 13:52:03 localhost kernel: [<c010569b>] kernel_thread_helper+0x7/0x10 May 24 13:52:03 localhost kernel: ======================= May 24 13:53:21 localhost kernel: block drbd0: short sent ReportBitMap size=4096 sent=1392 May 24 13:53:21 localhost kernel: block drbd0: Connection closed May 24 13:53:21 localhost kernel: block drbd0: conn( TearDown -> Unconnected ) May 24 13:53:21 localhost kernel: block drbd0: receiver terminated May 24 13:53:21 localhost kernel: block drbd0: Restarting receiver thread May 24 13:53:21 localhost kernel: block drbd0: receiver (re)started May 24 13:53:21 localhost kernel: block drbd0: conn( Unconnected -> WFConnection ) --- host2 /var/log/messages May 24 13:41:42 localhost kernel: drbd: initialized. Version: 8.3.8.1 (api:88/proto:86-94) May 24 13:41:42 localhost kernel: drbd: GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by root at localhost.localdomain, 2010-08-12 07:46:27 May 24 13:41:42 localhost kernel: drbd: registered as block device major 147 May 24 13:41:42 localhost kernel: drbd: minor_table @ 0xeb0d6340 May 24 13:41:42 localhost kernel: block drbd0: Starting worker thread (from cqueue [6339]) May 24 13:41:42 localhost kernel: block drbd0: disk( Diskless -> Attaching ) May 24 13:41:42 localhost kernel: klogd 1.4.1, ---------- state change ---------- May 24 13:41:43 localhost kernel: block drbd0: Found 10 transactions (582 active extents) in activity log. May 24 13:41:43 localhost kernel: block drbd0: Method to ensure write ordering: drain May 24 13:41:43 localhost kernel: block drbd0: max_segment_size ( = BIO size ) = 32768 May 24 13:41:43 localhost kernel: block drbd0: Adjusting my ra_pages to backing device's (32 -> 256) May 24 13:41:43 localhost kernel: block drbd0: drbd_bm_resize called with capacity == 31204648888 May 24 13:41:43 localhost kernel: block drbd0: resync bitmap: bits=3900581111 words=121893160 May 24 13:41:43 localhost kernel: block drbd0: size = 15 TB (15602324444 KB) May 24 13:41:45 localhost kernel: block drbd0: recounting of set bits took additional 98 jiffies May 24 13:41:45 localhost kernel: block drbd0: 2491 GB (3874219255 bits) marked out-of-sync by on disk bit-map. May 24 13:41:45 localhost kernel: block drbd0: disk( Attaching -> Inconsistent ) May 24 13:41:45 localhost kernel: block drbd0: conn( StandAlone -> Unconnected ) May 24 13:41:45 localhost kernel: block drbd0: Starting receiver thread (from drbd0_worker [6355]) May 24 13:41:45 localhost kernel: block drbd0: receiver (re)started May 24 13:41:45 localhost kernel: block drbd0: conn( Unconnected -> WFConnection ) May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [6367]) May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:41:45 localhost kernel: block drbd0: self 1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 bits:3874219255 flags:0 May 24 13:41:45 localhost kernel: block drbd0: peer 06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE bits:1420811336 flags:0 May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50 May 24 13:41:45 localhost kernel: block drbd0: Becoming sync target due to disk states. May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 ) May 24 13:41:47 localhost kernel: block drbd0: peer( Primary -> Unknown ) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 ) May 24 13:41:47 localhost kernel: block drbd0: error receiving ReportBitMap, l: 4088! May 24 13:41:47 localhost kernel: block drbd0: asender terminated May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread May 24 13:41:47 localhost kernel: block drbd0: Connection closed May 24 13:41:47 localhost kernel: block drbd0: conn( Disconnecting -> StandAlone ) May 24 13:41:47 localhost kernel: block drbd0: receiver terminated May 24 13:41:47 localhost kernel: block drbd0: Terminating receiver thread May 24 13:42:11 localhost kernel: block drbd0: conn( StandAlone -> Unconnected ) May 24 13:42:11 localhost kernel: block drbd0: Starting receiver thread (from drbd0_worker [6355]) May 24 13:42:11 localhost kernel: block drbd0: receiver (re)started May 24 13:42:11 localhost kernel: block drbd0: conn( Unconnected -> WFConnection ) May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [6412]) May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:42:11 localhost kernel: block drbd0: self 1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5 bits:3874219261 flags:0 May 24 13:42:11 localhost kernel: block drbd0: peer 06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE bits:1420811336 flags:0 May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50 May 24 13:42:11 localhost kernel: block drbd0: Becoming sync target due to disk states. May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 ) May 24 13:49:25 localhost kernel: block drbd0: peer( Primary -> Unknown ) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 ) May 24 13:49:35 localhost kernel: block drbd0: process_done_ee() = NOT_OK May 24 13:49:35 localhost kernel: block drbd0: asender terminated May 24 13:49:35 localhost kernel: block drbd0: Terminating asender thread Broadcast message from root (Tue May 24 13:51:44 2011): The system is going down for reboot NOW! May 24 13:51:44 localhost shutdown[6941]: shutting down for system reboot May 24 13:52:02 localhost python: PERFMON: Caught signal 15 - exiting May 24 13:52:06 localhost mountd[10542]: Caught signal 15, un-registering and exiting. May 24 13:52:07 localhost kernel: nfsd: last server has exited, flushing export cache --- mailserver (virtual guest) /var/log/dmesg INFO: task mailserver:3114 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mailserver D ffff8800e5e66860 0 3114 1 3117 3105 (NOTLB) ffff8800d444bc88 0000000000000286 ffff8800ff99e040 ffff880000eec670 000000000000000a ffff8800e5e66860 ffffffff804fdb00 000000000001f341 ffff8800e5e66a48 ffffffff802648f1 Call Trace: [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14 [<ffffffff8022937f>] sync_page+0x0/0x43 [<ffffffff8022937f>] sync_page+0x0/0x43 [<ffffffff802635d9>] io_schedule+0x3f/0x67 [<ffffffff802293bd>] sync_page+0x3e/0x43 [<ffffffff8026371d>] __wait_on_bit_lock+0x36/0x66 [<ffffffff80241b63>] __lock_page+0x5e/0x64 [<ffffffff8029e060>] wake_bit_function+0x0/0x23 [<ffffffff80214338>] filemap_nopage+0x268/0x360 [<ffffffff80208e8c>] __handle_mm_fault+0x442/0x1445 [<ffffffff80267d58>] do_page_fault+0xf7b/0x1324 [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14 [<ffffffff8023fa13>] lock_timer_base+0x1b/0x3c [<ffffffff8024d2d6>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff8026082b>] error_exit+0x0/0x6e INFO: task mailserver:3136 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mailserver D ffff8800d796c0c0 0 3136 1 3262 3128 (NOTLB) ffff8800d4b83ef8 0000000000000286 ffff8800e4943680 ffff8800ff9914f8 000000000000000a ffff8800d796c0c0 ffffffff804fdb00 000000000000b5e2 ffff8800d796c2a8 ffffffff8021f04a Call Trace: [<ffffffff8021f04a>] __dentry_open+0x101/0x1dc [<ffffffff8020e8da>] vfs_getattr+0x62/0xa9 [<ffffffff802643ea>] __down_write_nested+0x82/0x9a [<ffffffff802c971b>] sys_mmap_pgoff+0x55/0xac [<ffffffff8026168d>] ia32_sysret+0x0/0x5