[DRBD-user] DRBD 8.2.6 disk access issue

Wed Jan 21 14:23:48 CET 2009

I don't think it is a network problem because this issue occur only when I
write data from virtual machine, not if write data on drbd device on Dom0
[root at vm1 ~]# dd if=/dev/zero of=/tmp/expand bs=1024k count=10000 => issue
occur
[root at xen1 ~]# dd if=/dev/zero of=/srv/share/expand bs=1024k count=10000 =>
issue don't occur

-----Message d'origine-----
De : Robert Dunkley [mailto:Robert at saq.co.uk] 
Envoyé : mercredi 21 janvier 2009 13:56
À : Baptiste Agasse; drbd-user at lists.linbit.com
Objet : RE: [DRBD-user] DRBD 8.2.6 disk access issue

Have you tried a benchmark tool like IPerf over the bonded NIC
interface? (Just trying to confirm if this is a problem related to any
heavy traffic over the link or a problem only with heavy DRBD traffic
over the link).

-----Original Message-----
From: drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Baptiste Agasse
Sent: 21 January 2009 11:53
To: drbd-user at lists.linbit.com
Subject: [DRBD-user] DRBD 8.2.6 disk access issue

Hi all,

I have a problem with disk access with DRBD (8.2.6). I have a 2 node
cluster
under CentOS 5.2 and cluster is managed by RedHat Cluster Suite tools.

I have a 270Gb device shared with DRBD in primary/primary mode and on
top of
this, GFS2 filesystem. 

This partition is used to store xen virtual machines image files. 

For replication, DRBD use a two 1Gb NIC bonded interface (only used by
drbd)
directly linked between two nodes.

My DRBD configuration :

Xen1 : 192.168.2.6 (IP address on LAN), 10.1.1.1 (DRBD direct link)
Xen2 : 192.168.2.8 (IP address on LAN), 10.1.1.2 (DRBD direct link)

[root at xen2 ~]# cat /etc/drbd.conf
global {
  usage-count no;
}
common {
  syncer {
    rate 400M;
    verify-alg "sha1";
  }
  protocol C;
  startup {
    become-primary-on both;
    wfc-timeout 120;
    degr-wfc-timeout 120;
  }
  net {
    allow-two-primaries;
    cram-hmac-alg "sha1";
    shared-secret "secret";
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  disk {
    on-io-error detach;
    fencing resource-and-stonith;
  }
  # script found at http://people.redhat.com/lhh/obliterate
  handlers {
    outdate-peer "/sbin/obliterate";
  }
}
resource share{
  on xen1 {
    device /dev/drbd0;
    disk /dev/sda3;
    address 10.1.1.1:7789;
    meta-disk internal;
  }
  on xen2 {
    device /dev/drbd0;
    disk /dev/sda3;
    address 10.1.1.2:7789;
    meta-disk internal;
  }
}

[root at xen2 ~]# yum list installed | grep drbd
drbd82.x86_64                            8.2.6-1.el5.centos
installed
kmod-drbd82-xen.x86_64                   8.2.6-2
installed

[root at xen2 ~]# df -h | grep drbd
/dev/drbd0            266G  109G  157G  41% /srv/share

[root at xen2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-x8664-build, 2008-10-03 11:30:32
m:res    cs         st               ds                 p  mounted
fstype
0:share  Connected  Primary/Primary  UpToDate/UpToDate  C  /srv/share
gfs2

My problem is :

When a xen virtual machines have a lot of disk I/O on DRBD device, the
two
NIC used for data replication become down and DRBD see a network failure
and
reboot the peer node

This is what I see in /var/log/message on the node :

Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:29 xen1 kernel: bnx2: eth2 NIC Copper Link is Down
Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down
for interface eth2, disabling it 
Jan 21 09:43:29 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:31 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:31 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:31 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:32 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:32 xen1 kernel: bonding: bond0: link status definitely up
for interface eth2.
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] entering GATHER state from
2.  
Jan 21 09:43:34 xen1 kernel: drbd0: PingAck did not arrive in time.
Jan 21 09:43:34 xen1 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1
)
Jan 21 09:43:34 xen1 kernel: drbd0: asender terminated
Jan 21 09:43:34 xen1 kernel: drbd0: Terminating asender thread
Jan 21 09:43:34 xen1 kernel: drbd0: short read receiving data: read 2216
expected 4096
Jan 21 09:43:34 xen1 kernel: drbd0: error receiving Data, l: 4120!
Jan 21 09:43:34 xen1 kernel: drbd0: Creating new current UUID
Jan 21 09:43:34 xen1 kernel: drbd0: Writing meta data super block now.
Jan 21 09:43:34 xen1 kernel: drbd0: Connection closed
Jan 21 09:43:34 xen1 kernel: drbd0: helper command: /sbin/drbdadm
outdate-peer
Jan 21 09:43:36 xen1 kernel: bnx2: eth1 NIC Copper Link is Down
Jan 21 09:43:36 xen1 kernel: bnx2: eth2 NIC Copper Link is Down
Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down
for interface eth2, disabling it
Jan 21 09:43:36 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:38 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 100 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:38 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:38 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering GATHER state from
0.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Creating commit token
because I am the rep.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Saving state aru 61 high seq
received 61
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Storing new sequence id for
ring 150
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering COMMIT state.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering RECOVERY state.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] position [0] member
192.168.2.6:
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] previous ring seq 332 rep
192.168.2.6
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] aru 61 high delivered 61
received flag 1
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Did not need to originate
any messages in recovery.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Sending initial ORF token
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] New Configuration:
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ]     r(0) ip(192.168.2.6)
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] Members Left:
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ]     r(0) ip(192.168.2.8)
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] Members Joined:
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] New Configuration:
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ]     r(0) ip(192.168.2.6)
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] Members Left:
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] Members Joined:
Jan 21 09:43:38 xen1 openais[6491]: [SYNC ] This node is within the
primary component and will provide service.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering OPERATIONAL state.
Jan 21 09:43:38 xen1 openais[6491]: [CLM  ] got nodejoin message
192.168.2.6
Jan 21 09:43:38 xen1 openais[6491]: [CPG  ] got joinlist message from
node 2
Jan 21 09:43:38 xen1 kernel: dlm: closing connection to node 1
Jan 21 09:43:38 xen1 fenced[6508]: 192.168.2.8 not a cluster member
after 0 sec post_fail_delay
Jan 21 09:43:38 xen1 fenced[6508]: fencing node "192.168.2.8"
Jan 21 09:43:44 xen1 kernel: bnx2: eth1 NIC Copper Link is Down
Jan 21 09:43:44 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:44 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:47 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:47 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up
for interface eth2.
Jan 21 09:43:47 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:52 xen1 fenced[6508]: fence "192.168.2.8" success
Jan 21 09:43:52 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Trying to acquire journal lock...
Jan 21 09:43:53 xen1 fence_node[14010]: Fence of "192.168.2.8" was
successful
Jan 21 09:43:53 xen1 kernel: drbd0: outdate-peer helper returned 7 (peer
was stonithed)
Jan 21 09:43:53 xen1 kernel: drbd0: pdsk( DUnknown -> Outdated )
Jan 21 09:43:53 xen1 kernel: drbd0: tl_clear()
Jan 21 09:43:53 xen1 kernel: drbd0: susp( 1 -> 0 )
Jan 21 09:43:53 xen1 kernel: drbd0: Writing meta data super block now.
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Looking at journal...
Jan 21 09:43:53 xen1 kernel: drbd0: conn( NetworkFailure -> Unconnected
)
Jan 21 09:43:53 xen1 kernel: drbd0: receiver terminated
Jan 21 09:43:53 xen1 kernel: drbd0: receiver (re)started
Jan 21 09:43:53 xen1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Acquiring the transaction lock...
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Replaying journal...
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Replayed 1 of 1 blocks
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Found 0 revoke tags
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Journal replayed in 0s
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Done
Jan 21 09:43:53 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm1 from down member 192.168.2.8
Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm2 from down member 192.168.2.8
Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm3 from down member 192.168.2.8

Thanks for your answers.

PS : sorry for my bad English.

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
Company Number: 06481952

http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T.
services to UK Business.

Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
Backups : Managed Networks : Remote Support.

ISPA Member

Find us in http://www.thebestof.co.uk/petersfield