Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I have a problem with disk access with DRBD (8.2.6). I have a 2 node cluster under CentOS 5.2 and cluster is managed by RedHat Cluster Suite tools. I have a 270Gb device shared with DRBD in primary/primary mode and on top of this, GFS2 filesystem. This partition is used to store xen virtual machines image files. For replication, DRBD use a two 1Gb NIC bonded interface (only used by drbd) directly linked between two nodes. My DRBD configuration : Xen1 : 192.168.2.6 (IP address on LAN), 10.1.1.1 (DRBD direct link) Xen2 : 192.168.2.8 (IP address on LAN), 10.1.1.2 (DRBD direct link) [root at xen2 ~]# cat /etc/drbd.conf global { usage-count no; } common { syncer { rate 400M; verify-alg "sha1"; } protocol C; startup { become-primary-on both; wfc-timeout 120; degr-wfc-timeout 120; } net { allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "secret"; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } disk { on-io-error detach; fencing resource-and-stonith; } # script found at http://people.redhat.com/lhh/obliterate handlers { outdate-peer "/sbin/obliterate"; } } resource share{ on xen1 { device /dev/drbd0; disk /dev/sda3; address 10.1.1.1:7789; meta-disk internal; } on xen2 { device /dev/drbd0; disk /dev/sda3; address 10.1.1.2:7789; meta-disk internal; } } [root at xen2 ~]# yum list installed | grep drbd drbd82.x86_64 8.2.6-1.el5.centos installed kmod-drbd82-xen.x86_64 8.2.6-2 installed [root at xen2 ~]# df -h | grep drbd /dev/drbd0 266G 109G 157G 41% /srv/share [root at xen2 ~]# service drbd status drbd driver loaded OK; device status: version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-x8664-build, 2008-10-03 11:30:32 m:res cs st ds p mounted fstype 0:share Connected Primary/Primary UpToDate/UpToDate C /srv/share gfs2 My problem is : When a xen virtual machines have a lot of disk I/O on DRBD device, the two NIC used for data replication become down and DRBD see a network failure and reboot the peer node This is what I see in /var/log/message on the node : Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 21 09:43:29 xen1 kernel: bnx2: eth2 NIC Copper Link is Down Jan 21 09:43:29 xen1 kernel: bonding: bond0: link status definitely down for interface eth2, disabling it Jan 21 09:43:29 xen1 kernel: bonding: bond0: now running without any active interface ! Jan 21 09:43:31 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 21 09:43:31 xen1 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 21 09:43:31 xen1 kernel: bonding: bond0: first active interface up! Jan 21 09:43:32 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 21 09:43:32 xen1 kernel: bonding: bond0: link status definitely up for interface eth2. Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] The token was lost in the OPERATIONAL state. Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jan 21 09:43:33 xen1 openais[6491]: [TOTEM] entering GATHER state from 2. Jan 21 09:43:34 xen1 kernel: drbd0: PingAck did not arrive in time. Jan 21 09:43:34 xen1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Jan 21 09:43:34 xen1 kernel: drbd0: asender terminated Jan 21 09:43:34 xen1 kernel: drbd0: Terminating asender thread Jan 21 09:43:34 xen1 kernel: drbd0: short read receiving data: read 2216 expected 4096 Jan 21 09:43:34 xen1 kernel: drbd0: error receiving Data, l: 4120! Jan 21 09:43:34 xen1 kernel: drbd0: Creating new current UUID Jan 21 09:43:34 xen1 kernel: drbd0: Writing meta data super block now. Jan 21 09:43:34 xen1 kernel: drbd0: Connection closed Jan 21 09:43:34 xen1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Jan 21 09:43:36 xen1 kernel: bnx2: eth1 NIC Copper Link is Down Jan 21 09:43:36 xen1 kernel: bnx2: eth2 NIC Copper Link is Down Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 21 09:43:36 xen1 kernel: bonding: bond0: link status definitely down for interface eth2, disabling it Jan 21 09:43:36 xen1 kernel: bonding: bond0: now running without any active interface ! Jan 21 09:43:38 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON Jan 21 09:43:38 xen1 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 21 09:43:38 xen1 kernel: bonding: bond0: first active interface up! Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering GATHER state from 0. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Creating commit token because I am the rep. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Saving state aru 61 high seq received 61 Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Storing new sequence id for ring 150 Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering COMMIT state. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering RECOVERY state. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] position [0] member 192.168.2.6: Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] previous ring seq 332 rep 192.168.2.6 Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] aru 61 high delivered 61 received flag 1 Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Did not need to originate any messages in recovery. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] Sending initial ORF token Jan 21 09:43:38 xen1 openais[6491]: [CLM ] CLM CONFIGURATION CHANGE Jan 21 09:43:38 xen1 openais[6491]: [CLM ] New Configuration: Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.6) Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Left: Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.8) Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Joined: Jan 21 09:43:38 xen1 openais[6491]: [CLM ] CLM CONFIGURATION CHANGE Jan 21 09:43:38 xen1 openais[6491]: [CLM ] New Configuration: Jan 21 09:43:38 xen1 openais[6491]: [CLM ] r(0) ip(192.168.2.6) Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Left: Jan 21 09:43:38 xen1 openais[6491]: [CLM ] Members Joined: Jan 21 09:43:38 xen1 openais[6491]: [SYNC ] This node is within the primary component and will provide service. Jan 21 09:43:38 xen1 openais[6491]: [TOTEM] entering OPERATIONAL state. Jan 21 09:43:38 xen1 openais[6491]: [CLM ] got nodejoin message 192.168.2.6 Jan 21 09:43:38 xen1 openais[6491]: [CPG ] got joinlist message from node 2 Jan 21 09:43:38 xen1 kernel: dlm: closing connection to node 1 Jan 21 09:43:38 xen1 fenced[6508]: 192.168.2.8 not a cluster member after 0 sec post_fail_delay Jan 21 09:43:38 xen1 fenced[6508]: fencing node "192.168.2.8" Jan 21 09:43:44 xen1 kernel: bnx2: eth1 NIC Copper Link is Down Jan 21 09:43:44 xen1 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 21 09:43:44 xen1 kernel: bonding: bond0: now running without any active interface ! Jan 21 09:43:47 xen1 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 21 09:43:47 xen1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 21 09:43:47 xen1 kernel: bonding: bond0: link status definitely up for interface eth2. Jan 21 09:43:47 xen1 kernel: bonding: bond0: first active interface up! Jan 21 09:43:52 xen1 fenced[6508]: fence "192.168.2.8" success Jan 21 09:43:52 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Trying to acquire journal lock... Jan 21 09:43:53 xen1 fence_node[14010]: Fence of "192.168.2.8" was successful Jan 21 09:43:53 xen1 kernel: drbd0: outdate-peer helper returned 7 (peer was stonithed) Jan 21 09:43:53 xen1 kernel: drbd0: pdsk( DUnknown -> Outdated ) Jan 21 09:43:53 xen1 kernel: drbd0: tl_clear() Jan 21 09:43:53 xen1 kernel: drbd0: susp( 1 -> 0 ) Jan 21 09:43:53 xen1 kernel: drbd0: Writing meta data super block now. Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Looking at journal... Jan 21 09:43:53 xen1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jan 21 09:43:53 xen1 kernel: drbd0: receiver terminated Jan 21 09:43:53 xen1 kernel: drbd0: receiver (re)started Jan 21 09:43:53 xen1 kernel: drbd0: conn( Unconnected -> WFConnection ) Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Acquiring the transaction lock... Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Replaying journal... Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Replayed 1 of 1 blocks Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Found 0 revoke tags Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Journal replayed in 0s Jan 21 09:43:53 xen1 kernel: GFS2: fsid=lan_cluster:share.0: jid=1: Done Jan 21 09:43:53 xen1 clurgmgrd[11532]: <notice> Taking over service vm:vm1 from down member 192.168.2.8 Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service vm:vm2 from down member 192.168.2.8 Jan 21 09:43:54 xen1 clurgmgrd[11532]: <notice> Taking over service vm:vm3 from down member 192.168.2.8 Thanks for your answers. PS : sorry for my bad English.