Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi ALL, Digimer, thank you very much for your response. OK See you below: # cat /etc/drbd.d/global_common.conf global { usage-count yes; } common { protocol C; handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; } startup { wfc-timeout 100; degr-wfc-timeout 60; become-primary-on both; } disk { # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes # no-disk-drain no-md-flushes max-bio-bvecs } net { allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ping-timeout 20; } syncer { rate 110M; } } # Here my answers on your questions: 1) There is definitely split brain not a network problem. I demonstrated at my previous message I can ping members of the cluster and they have open firewall. When I use telnet and sniffer I see nodes try to estimate network connection, but they send reject pockets only. 2) Here is info from /var/log/messages file Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: initialized. Version: 8.3.8 (api:88/proto:86-94) Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild at builder10.centos.org, 2010-06-04 08:04:09 Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: registered as block device major 147 Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: minor_table @ 0xffff8101371471c0 Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Starting worker thread (from cqueue/0 [213]) Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: disk( Diskless -> Attaching ) Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Found 4 transactions (70 active extents) in activity log. Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Method to ensure write ordering: barrier Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: max_segment_size ( = BIO size ) = 32768 Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: drbd_bm_resize called with capacity == 629118192 Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: resync bitmap: bits=78639774 words=1228747 Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: size = 300 GB (314559096 KB) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: recounting of set bits took additional 10 jiffies Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Marked additional 252 MB as out-of-sync based on AL. Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: disk( Attaching -> UpToDate ) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn( StandAlone -> Unconnected ) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting receiver thread (from drbd1_worker [3435]) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver (re)started Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn( Unconnected -> WFConnection ) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Handshake successful: Agreed network protocol version 94 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn( WFConnection -> WFReportParams ) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting asender thread (from drbd1_receiver [3443]) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: data-integrity-alg: <not-used> Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: drbd_sync_handshake: Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: self B3DE46FD85A4C304:D3D8A848BA989089:F5DB2DE79EFEC3E5:AE5C6A69A1F93A43 bits:64512 flags:0 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: peer CAD3EACF4FCC5066:D3D8A848BA989089:F5DB2DE79EFEC3E4:AE5C6A69A1F93A43 bits:130048 flags:2 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: uuid_compare()=100 by rule 90 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0) Dec 2 10:04:00 infplsm018 <kern.alert> kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn( WFReportParams -> Disconnecting ) Dec 2 10:04:00 infplsm018 <kern.err> kernel: block drbd1: error receiving ReportState, l: 4! Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: asender terminated Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating asender thread Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Connection closed Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn( Disconnecting -> StandAlone ) Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver terminated Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating receiver thread Dec 2 10:04:01 infplsm018 <kern.info> kernel: block drbd1: role( Secondary -> Primary ) 3) And here my /etc/cluster/cluster.conf file # cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster alias="newnfscl" config_version="224" name="newnfscl"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="20"/> <clusternodes> <clusternode name="infplsm017-clust" nodeid="1" votes="1"> <multicast addr="224.0.0.1" interface="eth2"/> <fence> <method name="1"> <device name="manfence" nodename="infplsm017-clust"/> </method> </fence> </clusternode> <clusternode name="infplsm018-clust" nodeid="2" votes="1"> <multicast addr="224.0.0.1" interface="eth2"/> <fence> <method name="1"> <device name="manfence" nodename="infplsm018-clust"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"> <multicast addr="224.0.0.1"/> </cman> <fencedevices> <fencedevice agent="fence_null" name="nullfence"/> <fencedevice agent="fence_manual" name="manfence"/> </fencedevices> <rm log_facility="syslog" log_level="7"> <failoverdomains> <failoverdomain name="Test Domain" nofailback="0" ordered="1" restricted="1"> <failoverdomainnode name="infplsm017-clust" priority="1"/> <failoverdomainnode name="infplsm018-clust" priority="1"/> </failoverdomain> </failoverdomains> <resources> <clusterfs device="/dev/Shared/home" force_unmount="0" fsid="58812" fstype="gfs2" mountpoint="/home" name="homegfs" options="rw,localflocks" self_fence="0"/> <nfsexport name="homenfs"/> <nfsclient allow_recover="1" name="nfsclient" options="rw" target="*"/> <ip address="10.10.28.15" monitor_link="1"/> </resources> <service autostart="1" exclusive="0" name="nfs-over-gfs2" nfslock="1" recovery="relocate"> <clusterfs ref="homegfs"> <nfsexport ref="homenfs"> <nfsclient ref="nfsclient"/> </nfsexport> </clusterfs> <ip ref="10.10.28.15"/> </service> </rm> <logging debug="on" logfile_priority="debug" syslog_facility="daemon" syslog_priority="info" to_logfile="yes" to_syslog="yes"> <logging_daemon logfile="/var/log/cluster/qdiskd.log" name="qdiskd"/> <logging_daemon logfile="/var/log/cluster/fenced.log" name="fenced"/> <logging_daemon logfile="/var/log/cluster/dlm_controld.log" name="dlm_controld"/> <logging_daemon logfile="/var/log/cluster/gfs_controld.log" name="gfs_controld"/> <logging_daemon logfile="/var/log/cluster/rgmanager.log" name="rgmanager"/> <logging_daemon logfile="/var/log/cluster/corosync.log" name="corosync"/> </logging> </cluster> On 12/02/2011 03:05 PM, Digimer wrote: > On 12/01/2011 07:30 PM, Ivan Pavlenko wrote: >> Hi ALL, >> >> Could you help me to fix a problem with split brain, please? >> >> I have Red Hat cluster based on RHEL 5.7 and provide nfs-over-gfs2 >> service. I use DRBD as a storage. >> >> # cat /etc/drbd.conf >> # >> # please have a a look at the example configuration file in >> # /usr/share/doc/drbd83/drbd.conf >> # >> include "/etc/drbd.d/global_common.conf"; > This is a good file to see. Can you share it, please? > >> include "/etc/drbd.d/r0.res"; >> >> # cat /etc/drbd.d/r0.res >> resource r0 { >> on infplsm017 { >> device /dev/drbd1; >> disk /dev/sdb1; >> address 10.10.24.10:7789; >> meta-disk internal; >> } >> on infplsm018 { >> device /dev/drbd1; >> disk /dev/sdb1; >> address 10.10.24.11:7789; >> meta-disk internal; >> } >> } >> >> As you can see, there is nothing sophisticated here. >> >> I have: >> >> # cat /proc/drbd >> version: 8.3.8 (api:88/proto:86-94) >> GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by >> mockbuild at builder10.centos.org, 2010-06-04 08:04:09 >> >> 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r---- >> ns:0 nr:0 dw:0 dr:332 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b >> oos:524288 >> >> # ping 10.10.24.11 >> PING 10.10.24.11 (10.10.24.11) 56(84) bytes of data. >> 64 bytes from 10.10.24.11: icmp_seq=1 ttl=64 time=2.99 ms >> 64 bytes from 10.10.24.11: icmp_seq=2 ttl=64 time=13.9 ms >> >> But when I try to use telnet for port 7789 I get: >> >> # telnet 10.10.24.11 7789 >> Trying 10.10.24.11... >> telnet: connect to address 10.10.24.11: Connection refused >> telnet: Unable to connect to remote host: Connection refused only >> >> But at the same time: >> >> # service iptables status >> Table: filter >> Chain INPUT (policy ACCEPT) >> num target prot opt source destination >> >> Chain FORWARD (policy ACCEPT) >> num target prot opt source destination >> >> Chain OUTPUT (policy ACCEPT) >> num target prot opt source destination >> >> >> I did it from my first server (INFPLSM017). And I have absolutely same >> result from the second one (INFPLSM018). Could you tell me, please, wht >> the possible reason of this problem and how I can fix this. >> >> Thank you in advance, >> Ivan > Is this a network or split-brain problem? > > What happens when you try to connect? > > What state is the other node in? > > Anything interesting in /var/log/messages? > > How does DRBD tie into the cluster? What is the cluster's configuration? > Are you using fencing? > > More details are needed to provide assistance. >