Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I don't think it is a network problem because this issue occur only when I
write data from virtual machine, not if write data on drbd device on Dom0
[root at vm1 ~]# dd if=/dev/zero of=/tmp/expand bs=1024k count=10000 => issue
occur
[root at xen1 ~]# dd if=/dev/zero of=/srv/share/expand bs=1024k count=10000 =>
issue don't occur
-----Message d'origine-----
De : Robert Dunkley [mailto:Robert at saq.co.uk]
Envoyé : mercredi 21 janvier 2009 13:56
À : Baptiste Agasse; drbd-user at lists.linbit.com
Objet : RE: [DRBD-user] DRBD 8.2.6 disk access issue
Have you tried a benchmark tool like IPerf over the bonded NIC
interface? (Just trying to confirm if this is a problem related to any
heavy traffic over the link or a problem only with heavy DRBD traffic
over the link).
-----Original Message-----
From: drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Baptiste Agasse
Sent: 21 January 2009 11:53
To: drbd-user at lists.linbit.com
Subject: [DRBD-user] DRBD 8.2.6 disk access issue
Hi all,
I have a problem with disk access with DRBD (8.2.6). I have a 2 node
cluster
under CentOS 5.2 and cluster is managed by RedHat Cluster Suite tools.
I have a 270Gb device shared with DRBD in primary/primary mode and on
top of
this, GFS2 filesystem.
This partition is used to store xen virtual machines image files.
For replication, DRBD use a two 1Gb NIC bonded interface (only used by
drbd)
directly linked between two nodes.
My DRBD configuration :
Xen1 : 192.168.2.6 (IP address on LAN), 10.1.1.1 (DRBD direct link)
Xen2 : 192.168.2.8 (IP address on LAN), 10.1.1.2 (DRBD direct link)
[root at xen2 ~]# cat /etc/drbd.conf
global {
usage-count no;
}
common {
syncer {
rate 400M;
verify-alg "sha1";
}
protocol C;
startup {
become-primary-on both;
wfc-timeout 120;
degr-wfc-timeout 120;
}
net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "secret";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
on-io-error detach;
fencing resource-and-stonith;
}
# script found at http://people.redhat.com/lhh/obliterate
handlers {
outdate-peer "/sbin/obliterate";
}
}
resource share{
on xen1 {
device /dev/drbd0;
disk /dev/sda3;
address 10.1.1.1:7789;
meta-disk internal;
}
on xen2 {
device /dev/drbd0;
disk /dev/sda3;
address 10.1.1.2:7789;
meta-disk internal;
}
}
[root at xen2 ~]# yum list installed | grep drbd
drbd82.x86_64 8.2.6-1.el5.centos
installed
kmod-drbd82-xen.x86_64 8.2.6-2
installed
[root at xen2 ~]# df -h | grep drbd
/dev/drbd0 266G 109G 157G 41% /srv/share
[root at xen2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-x8664-build, 2008-10-03 11:30:32
m:res cs st ds p mounted
fstype
0:share Connected Primary/Primary UpToDate/UpToDate C /srv/share
gfs2
My problem is :
When a xen virtual machines have a lot of disk I/O on DRBD device, the
two
NIC used for data replication become down and DRBD see a network failure
and
reboot the peer node
This is what I see in /var/log/message on the node :
Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:29 xen1 kernel: bnx2: eth2 NIC Copper Link is Down
Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down
for interface eth2, disabling it
Jan 21 09:43:29 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:31 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:31 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:31 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:32 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:32 xen1 kernel: bonding: bond0: link status definitely up
for interface eth2.
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] entering GATHER state from
2.
Jan 21 09:43:34 xen1 kernel: drbd0: PingAck did not arrive in time.
Jan 21 09:43:34 xen1 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1
)
Jan 21 09:43:34 xen1 kernel: drbd0: asender terminated
Jan 21 09:43:34 xen1 kernel: drbd0: Terminating asender thread
Jan 21 09:43:34 xen1 kernel: drbd0: short read receiving data: read 2216
expected 4096
Jan 21 09:43:34 xen1 kernel: drbd0: error receiving Data, l: 4120!
Jan 21 09:43:34 xen1 kernel: drbd0: Creating new current UUID
Jan 21 09:43:34 xen1 kernel: drbd0: Writing meta data super block now.
Jan 21 09:43:34 xen1 kernel: drbd0: Connection closed
Jan 21 09:43:34 xen1 kernel: drbd0: helper command: /sbin/drbdadm
outdate-peer
Jan 21 09:43:36 xen1 kernel: bnx2: eth1 NIC Copper Link is Down
Jan 21 09:43:36 xen1 kernel: bnx2: eth2 NIC Copper Link is Down
Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down
for interface eth2, disabling it
Jan 21 09:43:36 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:38 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 100 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:38 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:38 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering GATHER state from
0.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Creating commit token
because I am the rep.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Saving state aru 61 high seq
received 61
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Storing new sequence id for
ring 150
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering COMMIT state.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering RECOVERY state.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] position [0] member
192.168.2.6:
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] previous ring seq 332 rep
192.168.2.6
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] aru 61 high delivered 61
received flag 1
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Did not need to originate
any messages in recovery.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Sending initial ORF token
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] New Configuration:
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.6)
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Left:
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.8)
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Joined:
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] New Configuration:
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.6)
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Left:
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Joined:
Jan 21 09:43:38 xen1 openais[6491]: [SYNC ] This node is within the
primary component and will provide service.
Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering OPERATIONAL state.
Jan 21 09:43:38 xen1 openais[6491]: [CLM ] got nodejoin message
192.168.2.6
Jan 21 09:43:38 xen1 openais[6491]: [CPG ] got joinlist message from
node 2
Jan 21 09:43:38 xen1 kernel: dlm: closing connection to node 1
Jan 21 09:43:38 xen1 fenced[6508]: 192.168.2.8 not a cluster member
after 0 sec post_fail_delay
Jan 21 09:43:38 xen1 fenced[6508]: fencing node "192.168.2.8"
Jan 21 09:43:44 xen1 kernel: bnx2: eth1 NIC Copper Link is Down
Jan 21 09:43:44 xen1 kernel: bonding: bond0: link status definitely down
for interface eth1, disabling it
Jan 21 09:43:44 xen1 kernel: bonding: bond0: now running without any
active interface !
Jan 21 09:43:47 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:47 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up
for interface eth1.
Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up
for interface eth2.
Jan 21 09:43:47 xen1 kernel: bonding: bond0: first active interface up!
Jan 21 09:43:52 xen1 fenced[6508]: fence "192.168.2.8" success
Jan 21 09:43:52 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Trying to acquire journal lock...
Jan 21 09:43:53 xen1 fence_node[14010]: Fence of "192.168.2.8" was
successful
Jan 21 09:43:53 xen1 kernel: drbd0: outdate-peer helper returned 7 (peer
was stonithed)
Jan 21 09:43:53 xen1 kernel: drbd0: pdsk( DUnknown -> Outdated )
Jan 21 09:43:53 xen1 kernel: drbd0: tl_clear()
Jan 21 09:43:53 xen1 kernel: drbd0: susp( 1 -> 0 )
Jan 21 09:43:53 xen1 kernel: drbd0: Writing meta data super block now.
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Looking at journal...
Jan 21 09:43:53 xen1 kernel: drbd0: conn( NetworkFailure -> Unconnected
)
Jan 21 09:43:53 xen1 kernel: drbd0: receiver terminated
Jan 21 09:43:53 xen1 kernel: drbd0: receiver (re)started
Jan 21 09:43:53 xen1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Acquiring the transaction lock...
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Replaying journal...
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Replayed 1 of 1 blocks
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Found 0 revoke tags
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1:
Journal replayed in 0s
Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Done
Jan 21 09:43:53 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm1 from down member 192.168.2.8
Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm2 from down member 192.168.2.8
Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service
vm:vm3 from down member 192.168.2.8
Thanks for your answers.
PS : sorry for my bad English.
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
The SAQ Group
Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
Company Number: 06481952
http://www.saqnet.co.uk AS29219
SAQ Group Delivers high quality, honestly priced communication and I.T.
services to UK Business.
Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
Backups : Managed Networks : Remote Support.
ISPA Member
Find us in http://www.thebestof.co.uk/petersfield