[DRBD-user] DRBD - one half of Proxmox cluster miscommunicating

Mon Jul 30 22:06:17 CEST 2012

Hi,

I moved my Proxmox cluster - consisting essentially of two
physical servers, two Cisco NAS units where the (KVM) VM images
live and two switches, to a new data centre where they now have
new IP addresses.

I reconfigured basic networking on the two servers, updated the
IP addresses in the Proxmox config and rebooted the boxes, master
node first.

The storage is set up as /dev/drbdvg0 and /dev/drbdvg1. I didn't
install this myself and I'm not that familiar with DRBD or indeed
iSCSI. Both are used to store KVM guest virtual machine images,
seen by both servers.

Everything looked fine, until I attempted to start a VM on the
second (slave) node. It took ages to start, hanging for thirty
seconds at a time. It was clearly miscommunicating with the NAS.

All of the images, including those set up on the second node,
will run fine on the first (and that's what I'm doing for now).

So the first (master) box has excellent access to the NAS, while
the second (slave) has trouble reading from it.

On the first box, /proc/drbd looks like this:

version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:27568823 dr:156762105 al:309656 bm:309639 lo:0 pe:0 ua:0
ap:0 ep:1 wo:b oos:10184632
 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:2451648 dr:14918745 al:1244 bm:1211 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:1152564

And on the second, troublesome box:

version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:0 dr:1705944 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:954596
 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:0 dr:1821288 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:520192

So it looks like at some level they aren't talking to each other
- I don't see the usual "UpToDate/UpToDate".

I'm also seeing lots of messages like this on the second node:

  connection1:0: ping timeout of 5 secs expired, recv timeout 5,
  last rx 4329026692, last ping 4329027942, now 4329029192
  connection1:0: detected conn error (1011)

Can anyone suggest what might have gone wrong here? A cabling
issue maybe? Or how to fix it? I'm particular anxious to avoid
losing updates to the images as seen by the first node if they
manage to sync up - don't want to lose or corrupt the VM images!

I inherited this setup and I'm not that familiar with DRBD, though
keen to learn. Very grateful for any advice.

Thanks,
James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120730/d322079a/attachment.htm>