Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello guys, I'm trying to use one cluster with 2 nodes, using DRDB 8.x and GFS 1.x on RHEL 5.2 x84_64. The problem is: Then one machine was gone (node2) the node1 stop to work (one simple 'ls -l' on shared mounted point) until the second machine return. I'm using GFS on this way: # gfs_mkfs -t hotsite:gfs-00 -p lock_dlm -j 2 /dev/drbd0 # mount -v /dev/drbd0 /test 'Causing a FAIL on second node on this way: # echo 1 > /proc/sys/kernel/sysrq # echo b > /proc/sysrq-trigger ============================================================================== $ cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="hotsite" config_version="4"> <cman two_node="1" expected_votes="1"/> <fence_daemon post_join_delay="60"> </fence_daemon> <clusternodes> <clusternode name="drdb_hotsite-1" nodeid="1"> <fence> <method name="single"> <device name="gnbd" ipaddr="192.168.0.3"/> </method> </fence> </clusternode> <clusternode name="drdb_hotsite-2" nodeid="2"> <fence> <method name="single"> <device name="gnbd" ipaddr="192.168.0.3"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="manual" agent="fence_manual"/> </fencedevices> </cluster> ============================================================================== # DRDB Configuration global { usage-count no; } resource hotsite { protocol C; startup { wfc-timeout 0; degr-wfc-timeout 120; become-primary-on both; } disk { fencing resource-and-stonith; } # handlers { # outdate-peer "/sbin/obliterate"; # We'll get back to this. # } net { cram-hmac-alg sha1; shared-secret "3n7r3iN at F31r@D at Fru7@"; timeout 60; connect-int 10; ping-int 10; max-buffers 2048; max-epoch-size 2048; allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; rr-conflict violently; } syncer { rate 650M; } on hotsite-1 { device /dev/drbd0; disk /dev/vol0/lvm; address 192.168.0.3:7789; meta-disk internal; } on hotsite-2 { device /dev/drbd0; disk /dev/vol0/lvm; address 192.168.0.4:7789; meta-disk internal; } } ============================================================================== Follow the logs: Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: PingAck did not arrive in time. Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: asender terminated Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Terminating asender thread Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: short read expecting header on sock: r=-512 Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Creating new current UUID Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now. Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Connection closed Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: outdate-peer helper broken, returned 0 Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:NetworkFailure st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver terminated Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver (re)started Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( Unconnected -> WFConnection ) Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] The token was lost in the OPERATIONAL state. Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 2. Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: drdb_hotsite-2 not a cluster member after 0 sec post_fail_delay Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2" Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 0. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 31 high seq received 31 Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 168 Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 356 rep 192.168.0.3 Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 31 high delivered 31 received flag 1 Jun 11 19:59:12 hotsite-bsb-la-1 kernel: dlm: closing connection to node 2 Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration: Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3) Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4) Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3) Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined: Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state. Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.3 Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message from node 1 Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2" Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2" Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2" Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed ..... Jun 11 20:01:32 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2" Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 11. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 14 high seq received 14 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 16c Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.3 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 14 high delivered 14 received flag 1 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [1] member 192.168.0.4: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.4 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 9 high delivered 9 received flag 1 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3) Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3) Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4) Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined: Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4) Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state. Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.4 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.3 Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message from node 1 Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock... Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal... Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Handshake successful: Agreed network protocol version 88 Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: old = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown s--- } Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFConnection -> WFReportParams ) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Starting asender thread (from drbd0_receiver [526]) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: data-integrity-alg: <not-used> Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Outdated ) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now. Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: tl_clear() Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: susp( 1 -> 0 ) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Secondary -> Primary ) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Began resync as SyncSource (will sync 548864 KB [137216 bits set]). Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now. Jun 11 20:05:05 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Acquiring the transaction lock... Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replaying journal... Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replayed 0 of 1 blocks Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: replays = 0, skips = 0, sames = 1 Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Journal replayed in 5s Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Resync done (total 15 sec; paused 0 sec; 36588 K/sec) Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now. Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Trying to join cluster "lock_dlm", "hotsite:gfs-00" Jun 11 20:07:03 hotsite-bsb-la-1 kernel: dlm: Using TCP for communications Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Joined cluster. Now mounting FS... Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Trying to acquire journal lock... Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Looking at journal... Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Done Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock... Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal... Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done Jun 11 20:07:25 hotsite-bsb-la-1 kernel: dlm: connecting to 2 Thanks! -- Tiago Cruz http://everlinux.com Linux User #282636