[DRBD-user] XEN and drbd: connection unstable/CPU freeze under IO load

Theo Band theo.band at greenpeak.com
Tue Sep 14 13:57:11 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


 I have a simple setup of drbd83 in primary/secondary between two
machines. Both machines run CentOS5.5 and drbd, and I can reproduce the
problem on any combination of two hosts out of four that I have
currently installed.

 I see two kinds of errors (digest integrity and CPU lockup) that I like
to understand. I've read similar threads here:
http://lists.linbit.com/pipermail/drbd-user/2009-February/011357.html

I disabled swap on the VM, but no difference.

Xen1
2.6.18-194.3.1.el5xen
drbd83-8.3.8-1.el5.centos
kmod-drbd83-xen-8.3.8-1.el5.centos

xen2
2.6.18-194.11.3.el5xen
drbd83-8.3.8-1.el5.centos
kmod-drbd83-xen-8.3.8-1.el5.centos

xen3
2.6.18-194.11.1.el5xen
drbd83-8.3.8-1.el5.centos
kmod-drbd83-xen-8.3.8-1.el5.centos

xen4
2.6.18-194.3.1.el5xen
drbd83-8.3.8-1.el5.centos
kmod-drbd83-xen-8.3.8-1.el5.centos

On the machines I created a logical volume as backing store for the
drbd. The drbd is given to a XEN virtual machine:

Xen config:
disk = [ "phy:/dev/drbd_gans,xvda,w" ]

As soon as i increase the load in the VM (both CPU and IO at the same
time), the xen host disconnects, and reconnect:

on xen3 (primary)
Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: sock was shut down by peer
Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: peer( Secondary ->
Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: short read expecting
header on sock: r=0

on xen1
Sep 14 11:46:50 xenutrecht kernel: block drbd1: Digest integrity check
FAILED.
Sep 14 11:46:50 xenutrecht kernel: block drbd1: error receiving Data, l:
28700!
Sep 14 11:46:50 xenutrecht kernel: block drbd1: peer( Primary -> Unknown
) conn( Connected -> ProtocolError ) pds
k( UpToDate -> DUnknown )
Sep 14 11:46:50 xenutrecht kernel: block drbd1: asender terminated


And tried on two other host with another VM:
xen4

on xen2 (primary)
Sep 14 12:43:25 xenutrecht02 kernel: block drbd5: Digest integrity check
FAILED.
Sep 14 12:43:25 xenutrecht02 kernel: block drbd5: error receiving Data,
l: 20508!

on xen4
Sep 14 12:43:25 valk kernel: block drbd5: sock was shut down by peer
Sep 14 12:43:25 valk kernel: block drbd5: peer( Secondary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )


High CPU load does not matter. High IO bandwidth (dd if=/dev/zero
of=/tmp/test) also does not seem to matter. But real lifer triggers this
behavior all of the time (one or two times a day). I found an easy way
to reproduce this by running pi 22 (Version 2.0 of the super_pi for
Linux OS). This happens to write a lot of files with data to /tmp, which
triggers this unwanted behavior after 10 seconds.

The four xen hosts are all different brands of hardware, but all intel.
What's common is the managed switch that connects them, as they are all
equipped with a single on-board Ethernet adapter. I checked the packet
error logs in the switch but that's not likely the problem. Errors in
the switch would also not influence the drbd channel as the TCP packet
would just be retransmitted. Performance is currently not my biggest
concern.

The verify-alg has been added to the configuration due to unresponsive
VMs. Before this statement was added, I had to manually disconnect and
re-connect to unblock the disk for the VM. The drbd state showed up fine
but the VMs were frozen due to blocked IO. Currently the problem repairs
itself, but I'm not happy with these error messages. Even now I get on
one machine the following in log, which indicate a CPU lockup:

xen4 (primary)
Sep 14 10:28:19 valk kernel: coretemp coretemp.0: Unable to access MSR
0xEE, for Tjmax,...
block drbd1: PingAck did not arrive in time.
Sep 14 10:28:19 valk kernel: block drbd1: peer( Secondary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk(UpToDate -> DUnknown )
Sep 14 10:28:19 valk kernel: block drbd1: asender terminated
<snip>
Sep 14 10:28:27 valk kernel: NETDEV WATCHDOG: peth0: transmit timed out
Sep 14 10:28:27 valk kernel: block drbd1: Connection closed
Sep 14 10:28:27 valk kernel: block drbd1: conn( NetworkFailure ->
Unconnected )
Sep 14 10:28:27 valk kernel: block drbd1: receiver terminated
Sep 14 10:28:27 valk kernel: block drbd1: Restarting receiver thread
Sep 14 10:28:27 valk kernel: block drbd1: receiver (re)started
Sep 14 10:28:27 valk kernel: block drbd1: conn( Unconnected ->
WFConnection )
Sep 14 10:28:28 valk kernel: BUG: soft lockup - CPU#0 stuck for 11s!
[swapper:0]
Sep 14 10:28:28 valk kernel: CPU 0:
Sep 14 10:28:28 valk kernel: Modules linked in: coretemp(U)
iptable_filter ip_tables loop md5 nls_utf8 hfsplus tun ip6table_filter
ip6_tables xt_physdev netloop netbk blktap blkbk bridge drbd(U) autofs4
hwmon_vid hidp l2cap bluetooth lockd sunrpc ipt_REJECT ip6t_REJECT
xt_tcpudp x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio
cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2
scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq freq_table
dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec
dell_wmi wmi button battery asus_acpi ac parport_pc lp parport
snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss i2c_i801 snd_pcm snd_timer
snd_page_alloc snd_hwdep i2c_core snd r8169 shpchp serio_raw soundcore
mii pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache
dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod
scsi_mod ext3 jbd uhc
Sep 14 10:28:28 valk kernel: _hcd ohci_hcd ehci_hcd
Sep 14 10:28:28 valk kernel: Pid: 0, comm: swapper Tainted: G     
2.6.18-194.3.1.el5xen #1
Sep 14 10:28:28 valk kernel: RIP: e030:[<ffffffff802063aa>] 
[<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
Sep 14 10:28:28 valk kernel: RSP: e02b:ffffffff80643f58  EFLAGS: 00000246
Sep 14 10:28:28 valk kernel: RAX: 0000000000000000 RBX: 0000000000000000
RCX: ffffffff802063aa
Sep 14 10:28:28 valk kernel: RDX: 0000000000000001 RSI: 0000000000000000
RDI: 0000000000000001
Sep 14 10:28:28 valk kernel: RBP: 0000000000000000 R08: 00000000000000b8
R09: 000000010e0ed9b8
Sep 14 10:28:28 valk kernel: R10: ffff8800103b73e0 R11: 0000000000000246
R12: 0000000000000000
Sep 14 10:28:28 valk kernel: R13: 0000000000000000 R14: 0000000000000000
R15: 0000000000000000
Sep 14 10:28:28 valk kernel: FS:  00002abb927854d0(0000)
GS:ffffffff805d2000(0000) knlGS:0000000000000000
Sep 14 10:28:28 valk kernel: CS:  e033 DS: 0000 ES: 0000
Sep 14 10:28:28 valk kernel:
Sep 14 10:28:28 valk kernel: Call Trace:
Sep 14 10:28:28 valk kernel:  [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8
Sep 14 10:28:28 valk kernel:  [<ffffffff8026ca80>] xen_idle+0x38/0x4a
Sep 14 10:28:28 valk kernel:  [<ffffffff8024b0aa>] cpu_idle+0x97/0xba
Sep 14 10:28:28 valk kernel:  [<ffffffff8064cb0f>] start_kernel+0x21f/0x224
Sep 14 10:28:28 valk kernel:  [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb
Sep 14 10:28:28 valk kernel:
Sep 14 10:28:28 valk kernel: r8169: peth0: link up

xen2 (secondary)
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: PingAck did not arrive
in time.
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: peer( Primary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: asender terminated
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: Terminating asender thread
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: short read expecting
header on sock: r=-512
Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: Connection closed


The drbd.conf:

global {
  usage-count yes;
}
common {
  protocol C;
  syncer {
    rate 10M;
    verify-alg md5;
  }
  disk { on-io-error detach; }
  startup {
    wfc-timeout  30;
  }
  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    data-integrity-alg crc32c;
  }
}

resource r_gans {
  meta-disk internal;
  startup { become-primary-on xenutrecht03; }
  device    /dev/drbd_gans minor 1;
  on xenutrecht03 {
    disk      /dev/vghost/gans;
    address   192.168.48.12:7789;
  }
  on xenutrecht {
    disk      /dev/vghost/gans;
    address   192.168.48.2:7789;
  }
}

Any help or tips is appreciated.

Thanks,
Theo




More information about the drbd-user mailing list