Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have a simple setup of drbd83 in primary/secondary between two machines. Both machines run CentOS5.5 and drbd, and I can reproduce the problem on any combination of two hosts out of four that I have currently installed. I see two kinds of errors (digest integrity and CPU lockup) that I like to understand. I've read similar threads here: http://lists.linbit.com/pipermail/drbd-user/2009-February/011357.html I disabled swap on the VM, but no difference. Xen1 2.6.18-194.3.1.el5xen drbd83-8.3.8-1.el5.centos kmod-drbd83-xen-8.3.8-1.el5.centos xen2 2.6.18-194.11.3.el5xen drbd83-8.3.8-1.el5.centos kmod-drbd83-xen-8.3.8-1.el5.centos xen3 2.6.18-194.11.1.el5xen drbd83-8.3.8-1.el5.centos kmod-drbd83-xen-8.3.8-1.el5.centos xen4 2.6.18-194.3.1.el5xen drbd83-8.3.8-1.el5.centos kmod-drbd83-xen-8.3.8-1.el5.centos On the machines I created a logical volume as backing store for the drbd. The drbd is given to a XEN virtual machine: Xen config: disk = [ "phy:/dev/drbd_gans,xvda,w" ] As soon as i increase the load in the VM (both CPU and IO at the same time), the xen host disconnects, and reconnect: on xen3 (primary) Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: sock was shut down by peer Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Sep 14 11:46:50 xenutrecht03 kernel: block drbd1: short read expecting header on sock: r=0 on xen1 Sep 14 11:46:50 xenutrecht kernel: block drbd1: Digest integrity check FAILED. Sep 14 11:46:50 xenutrecht kernel: block drbd1: error receiving Data, l: 28700! Sep 14 11:46:50 xenutrecht kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pds k( UpToDate -> DUnknown ) Sep 14 11:46:50 xenutrecht kernel: block drbd1: asender terminated And tried on two other host with another VM: xen4 on xen2 (primary) Sep 14 12:43:25 xenutrecht02 kernel: block drbd5: Digest integrity check FAILED. Sep 14 12:43:25 xenutrecht02 kernel: block drbd5: error receiving Data, l: 20508! on xen4 Sep 14 12:43:25 valk kernel: block drbd5: sock was shut down by peer Sep 14 12:43:25 valk kernel: block drbd5: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) High CPU load does not matter. High IO bandwidth (dd if=/dev/zero of=/tmp/test) also does not seem to matter. But real lifer triggers this behavior all of the time (one or two times a day). I found an easy way to reproduce this by running pi 22 (Version 2.0 of the super_pi for Linux OS). This happens to write a lot of files with data to /tmp, which triggers this unwanted behavior after 10 seconds. The four xen hosts are all different brands of hardware, but all intel. What's common is the managed switch that connects them, as they are all equipped with a single on-board Ethernet adapter. I checked the packet error logs in the switch but that's not likely the problem. Errors in the switch would also not influence the drbd channel as the TCP packet would just be retransmitted. Performance is currently not my biggest concern. The verify-alg has been added to the configuration due to unresponsive VMs. Before this statement was added, I had to manually disconnect and re-connect to unblock the disk for the VM. The drbd state showed up fine but the VMs were frozen due to blocked IO. Currently the problem repairs itself, but I'm not happy with these error messages. Even now I get on one machine the following in log, which indicate a CPU lockup: xen4 (primary) Sep 14 10:28:19 valk kernel: coretemp coretemp.0: Unable to access MSR 0xEE, for Tjmax,... block drbd1: PingAck did not arrive in time. Sep 14 10:28:19 valk kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(UpToDate -> DUnknown ) Sep 14 10:28:19 valk kernel: block drbd1: asender terminated <snip> Sep 14 10:28:27 valk kernel: NETDEV WATCHDOG: peth0: transmit timed out Sep 14 10:28:27 valk kernel: block drbd1: Connection closed Sep 14 10:28:27 valk kernel: block drbd1: conn( NetworkFailure -> Unconnected ) Sep 14 10:28:27 valk kernel: block drbd1: receiver terminated Sep 14 10:28:27 valk kernel: block drbd1: Restarting receiver thread Sep 14 10:28:27 valk kernel: block drbd1: receiver (re)started Sep 14 10:28:27 valk kernel: block drbd1: conn( Unconnected -> WFConnection ) Sep 14 10:28:28 valk kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0] Sep 14 10:28:28 valk kernel: CPU 0: Sep 14 10:28:28 valk kernel: Modules linked in: coretemp(U) iptable_filter ip_tables loop md5 nls_utf8 hfsplus tun ip6table_filter ip6_tables xt_physdev netloop netbk blktap blkbk bridge drbd(U) autofs4 hwmon_vid hidp l2cap bluetooth lockd sunrpc ipt_REJECT ip6t_REJECT xt_tcpudp x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss i2c_i801 snd_pcm snd_timer snd_page_alloc snd_hwdep i2c_core snd r8169 shpchp serio_raw soundcore mii pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod scsi_mod ext3 jbd uhc Sep 14 10:28:28 valk kernel: _hcd ohci_hcd ehci_hcd Sep 14 10:28:28 valk kernel: Pid: 0, comm: swapper Tainted: G 2.6.18-194.3.1.el5xen #1 Sep 14 10:28:28 valk kernel: RIP: e030:[<ffffffff802063aa>] [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 Sep 14 10:28:28 valk kernel: RSP: e02b:ffffffff80643f58 EFLAGS: 00000246 Sep 14 10:28:28 valk kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff802063aa Sep 14 10:28:28 valk kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001 Sep 14 10:28:28 valk kernel: RBP: 0000000000000000 R08: 00000000000000b8 R09: 000000010e0ed9b8 Sep 14 10:28:28 valk kernel: R10: ffff8800103b73e0 R11: 0000000000000246 R12: 0000000000000000 Sep 14 10:28:28 valk kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Sep 14 10:28:28 valk kernel: FS: 00002abb927854d0(0000) GS:ffffffff805d2000(0000) knlGS:0000000000000000 Sep 14 10:28:28 valk kernel: CS: e033 DS: 0000 ES: 0000 Sep 14 10:28:28 valk kernel: Sep 14 10:28:28 valk kernel: Call Trace: Sep 14 10:28:28 valk kernel: [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8 Sep 14 10:28:28 valk kernel: [<ffffffff8026ca80>] xen_idle+0x38/0x4a Sep 14 10:28:28 valk kernel: [<ffffffff8024b0aa>] cpu_idle+0x97/0xba Sep 14 10:28:28 valk kernel: [<ffffffff8064cb0f>] start_kernel+0x21f/0x224 Sep 14 10:28:28 valk kernel: [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb Sep 14 10:28:28 valk kernel: Sep 14 10:28:28 valk kernel: r8169: peth0: link up xen2 (secondary) Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: PingAck did not arrive in time. Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: asender terminated Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: Terminating asender thread Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: short read expecting header on sock: r=-512 Sep 14 10:28:19 xenutrecht02 kernel: block drbd1: Connection closed The drbd.conf: global { usage-count yes; } common { protocol C; syncer { rate 10M; verify-alg md5; } disk { on-io-error detach; } startup { wfc-timeout 30; } net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; data-integrity-alg crc32c; } } resource r_gans { meta-disk internal; startup { become-primary-on xenutrecht03; } device /dev/drbd_gans minor 1; on xenutrecht03 { disk /dev/vghost/gans; address 192.168.48.12:7789; } on xenutrecht { disk /dev/vghost/gans; address 192.168.48.2:7789; } } Any help or tips is appreciated. Thanks, Theo