Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi ALL,
Digimer, thank you very much for your response. OK See you below:
# cat /etc/drbd.d/global_common.conf
global {
usage-count yes;
}
common {
protocol C;
handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";
}
startup {
wfc-timeout 100;
degr-wfc-timeout 60;
become-primary-on both;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier
no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
ping-timeout 20;
}
syncer {
rate 110M;
}
}
#
Here my answers on your questions:
1) There is definitely split brain not a network problem. I demonstrated
at my previous message I can ping members of the cluster and they have
open firewall. When I use telnet and sniffer I see nodes try to estimate
network connection, but they send reject pockets only.
2) Here is info from /var/log/messages file
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: initialized.
Version: 8.3.8 (api:88/proto:86-94)
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: GIT-hash:
d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild at builder10.centos.org, 2010-06-04 08:04:09
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: registered as block
device major 147
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: minor_table @
0xffff8101371471c0
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Starting
worker thread (from cqueue/0 [213])
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: disk(
Diskless -> Attaching )
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Found 4
transactions (70 active extents) in activity log.
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Method to
ensure write ordering: barrier
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1:
max_segment_size ( = BIO size ) = 32768
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1:
drbd_bm_resize called with capacity == 629118192
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: resync
bitmap: bits=78639774 words=1228747
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: size = 300
GB (314559096 KB)
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: recounting
of set bits took additional 10 jiffies
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: 0 KB (0
bits) marked out-of-sync by on disk bit-map.
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Marked
additional 252 MB as out-of-sync based on AL.
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: disk(
Attaching -> UpToDate )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
StandAlone -> Unconnected )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting
receiver thread (from drbd1_worker [3435])
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver
(re)started
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
Unconnected -> WFConnection )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Handshake
successful: Agreed network protocol version 94
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
WFConnection -> WFReportParams )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting
asender thread (from drbd1_receiver [3443])
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
data-integrity-alg: <not-used>
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
drbd_sync_handshake:
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: self
B3DE46FD85A4C304:D3D8A848BA989089:F5DB2DE79EFEC3E5:AE5C6A69A1F93A43
bits:64512 flags:0
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: peer
CAD3EACF4FCC5066:D3D8A848BA989089:F5DB2DE79EFEC3E4:AE5C6A69A1F93A43
bits:130048 flags:2
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
uuid_compare()=100 by rule 90
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm initial-split-brain minor-1
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Dec 2 10:04:00 infplsm018 <kern.alert> kernel: block drbd1: Split-Brain
detected but unresolved, dropping connection!
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm split-brain minor-1
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
WFReportParams -> Disconnecting )
Dec 2 10:04:00 infplsm018 <kern.err> kernel: block drbd1: error
receiving ReportState, l: 4!
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: asender
terminated
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating
asender thread
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Connection
closed
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
Disconnecting -> StandAlone )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver
terminated
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating
receiver thread
Dec 2 10:04:01 infplsm018 <kern.info> kernel: block drbd1: role(
Secondary -> Primary )
3) And here my /etc/cluster/cluster.conf file
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="newnfscl" config_version="224" name="newnfscl">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="20"/>
<clusternodes>
<clusternode name="infplsm017-clust" nodeid="1" votes="1">
<multicast addr="224.0.0.1" interface="eth2"/>
<fence>
<method name="1">
<device name="manfence" nodename="infplsm017-clust"/>
</method>
</fence>
</clusternode>
<clusternode name="infplsm018-clust" nodeid="2" votes="1">
<multicast addr="224.0.0.1" interface="eth2"/>
<fence>
<method name="1">
<device name="manfence" nodename="infplsm018-clust"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="224.0.0.1"/>
</cman>
<fencedevices>
<fencedevice agent="fence_null" name="nullfence"/>
<fencedevice agent="fence_manual" name="manfence"/>
</fencedevices>
<rm log_facility="syslog" log_level="7">
<failoverdomains>
<failoverdomain name="Test Domain" nofailback="0" ordered="1"
restricted="1">
<failoverdomainnode name="infplsm017-clust" priority="1"/>
<failoverdomainnode name="infplsm018-clust" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<clusterfs device="/dev/Shared/home" force_unmount="0" fsid="58812"
fstype="gfs2" mountpoint="/home" name="homegfs" options="rw,localflocks"
self_fence="0"/>
<nfsexport name="homenfs"/>
<nfsclient allow_recover="1" name="nfsclient" options="rw" target="*"/>
<ip address="10.10.28.15" monitor_link="1"/>
</resources>
<service autostart="1" exclusive="0" name="nfs-over-gfs2" nfslock="1"
recovery="relocate">
<clusterfs ref="homegfs">
<nfsexport ref="homenfs">
<nfsclient ref="nfsclient"/>
</nfsexport>
</clusterfs>
<ip ref="10.10.28.15"/>
</service>
</rm>
<logging debug="on" logfile_priority="debug" syslog_facility="daemon"
syslog_priority="info" to_logfile="yes" to_syslog="yes">
<logging_daemon logfile="/var/log/cluster/qdiskd.log" name="qdiskd"/>
<logging_daemon logfile="/var/log/cluster/fenced.log" name="fenced"/>
<logging_daemon logfile="/var/log/cluster/dlm_controld.log"
name="dlm_controld"/>
<logging_daemon logfile="/var/log/cluster/gfs_controld.log"
name="gfs_controld"/>
<logging_daemon logfile="/var/log/cluster/rgmanager.log" name="rgmanager"/>
<logging_daemon logfile="/var/log/cluster/corosync.log" name="corosync"/>
</logging>
</cluster>
On 12/02/2011 03:05 PM, Digimer wrote:
> On 12/01/2011 07:30 PM, Ivan Pavlenko wrote:
>> Hi ALL,
>>
>> Could you help me to fix a problem with split brain, please?
>>
>> I have Red Hat cluster based on RHEL 5.7 and provide nfs-over-gfs2
>> service. I use DRBD as a storage.
>>
>> # cat /etc/drbd.conf
>> #
>> # please have a a look at the example configuration file in
>> # /usr/share/doc/drbd83/drbd.conf
>> #
>> include "/etc/drbd.d/global_common.conf";
> This is a good file to see. Can you share it, please?
>
>> include "/etc/drbd.d/r0.res";
>>
>> # cat /etc/drbd.d/r0.res
>> resource r0 {
>> on infplsm017 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 10.10.24.10:7789;
>> meta-disk internal;
>> }
>> on infplsm018 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 10.10.24.11:7789;
>> meta-disk internal;
>> }
>> }
>>
>> As you can see, there is nothing sophisticated here.
>>
>> I have:
>>
>> # cat /proc/drbd
>> version: 8.3.8 (api:88/proto:86-94)
>> GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
>> mockbuild at builder10.centos.org, 2010-06-04 08:04:09
>>
>> 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----
>> ns:0 nr:0 dw:0 dr:332 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
>> oos:524288
>>
>> # ping 10.10.24.11
>> PING 10.10.24.11 (10.10.24.11) 56(84) bytes of data.
>> 64 bytes from 10.10.24.11: icmp_seq=1 ttl=64 time=2.99 ms
>> 64 bytes from 10.10.24.11: icmp_seq=2 ttl=64 time=13.9 ms
>>
>> But when I try to use telnet for port 7789 I get:
>>
>> # telnet 10.10.24.11 7789
>> Trying 10.10.24.11...
>> telnet: connect to address 10.10.24.11: Connection refused
>> telnet: Unable to connect to remote host: Connection refused only
>>
>> But at the same time:
>>
>> # service iptables status
>> Table: filter
>> Chain INPUT (policy ACCEPT)
>> num target prot opt source destination
>>
>> Chain FORWARD (policy ACCEPT)
>> num target prot opt source destination
>>
>> Chain OUTPUT (policy ACCEPT)
>> num target prot opt source destination
>>
>>
>> I did it from my first server (INFPLSM017). And I have absolutely same
>> result from the second one (INFPLSM018). Could you tell me, please, wht
>> the possible reason of this problem and how I can fix this.
>>
>> Thank you in advance,
>> Ivan
> Is this a network or split-brain problem?
>
> What happens when you try to connect?
>
> What state is the other node in?
>
> Anything interesting in /var/log/messages?
>
> How does DRBD tie into the cluster? What is the cluster's configuration?
> Are you using fencing?
>
> More details are needed to provide assistance.
>