[DRBD-user] heartbeat+drbd scenario: question on particular case

Fri Aug 22 16:06:51 CEST 2008

 Hello all, my first post.
it is quite long, but I try to give details.

I have 2xcentos5.2 servers with heartbeat+drbd
two eth channels and no stonith at the moment (suggestions in this respect
are welcome, yet to analyze in deep)
packages versions are
heartbeat-2.1.3-3.el5.centos
drbd82-8.2.6-1.el5.centos
kmod-drbd82-8.2.6-1.2.6.18_92.el5
kernel-2.6.18-92.el5

I intend to provide nfs services in HA and consulted many docs, arriving at
a specific config (see below)
At the moment I'm planning to give only a primary/slave service on one nfs
resource

I'm trying to consider and simulate various planned/unplanned scenarios and
at the moment I have this with doubts about heartbeat and/or drbd config and
relative "service" behaviour.
Excuse if it could be offtopic in case of a misconfiguration of heartbeat

I have the cluster running:
nfsnode1 active and master
nfsnode2 active and slave

heartbeat resource is
nfsnode2 drbddisk::drbd-resource-0 \
        Filesystem::/dev/drbd0::/drbd0::ext3 \
        killnfsd \
        nfslock \
        nfs \
        Delay::3::0 \
        IPaddr::10.4.5.103/24/eth0

and config is

keepalive 1
deadtime 10
warntime 2
ucast eth0 10.4.5.102
ucast eth1 10.4.192.242
auto_failback off
node    nfsnode1
node    nfsnode2
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster
use_logd yes

actions
1) nfs service provided and write operations on drbd device active from
clients
2) shutdown nfsnode2 (so heartbeat and drbd cleanly stop)

3) write operations continue against drbd device (on nfsnode1) while
nfsnode2 is powered off

4) shutdown nfsnode1 (so heartbeat and drbd cleanly stop)

5) restart nfsnode2 with nfsnode1 yet powered down
(suppose for example wrong action done by an operator during maintenance
activities..)
===> I think it should not start the services as it was slave when shut down
and so it is .. Good

6) start nfsnode1
===> I would expect now nfsnode1 carrying on the service as it was the
latest master, while the other was slave and both shutdown operations were
clean, and infact sync correctly happens between the twos
But both drbd resources remain Secondary... Bad (in my opinion)

0:drbd-resource-0  Connected  Secondary/Secondary  UpToDate/UpToDate  C
So that heartbeat chain doesn't start and nfs service is not provided
my drbd.conf at the moment is:
resource "drbd-resource-0" {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }
  startup {
#    wfc-timeout         0;  ## Infinite!
    degr-wfc-timeout  120;  ## 2 minutes.
  }
  disk {
    on-io-error detach;
    fencing resource-only;
  }
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 60M;
    al-extents 257;
  }

  # It is valid to move device, disk and meta-disk to the
  # resource level.
  device        /dev/drbd0;
  disk          /dev/sdb1;
  meta-disk     internal;

  on nfsnode1 {
    address     10.4.192.241:7789;
  }

  on nfsnode2 {
    address    10.4.192.242:7789;
  }
}

Some log data:

at shutdown of nfsnode2 at step 2), nfsnode1 becomes

0:drbd-resource-0  WFConnection  Primary/Unknown  UpToDate/Outdated  C
/drbd0

and in messages:
Aug 22 12:42:04 nfsnode1 kernel: drbd0: State change failed: Refusing to be
Primary while peer is not outdated
Aug 22 12:42:04 nfsnode1 kernel: drbd0:   state = { cs:Connected
st:Primary/Secondary ds:UpToDate/UpToDate r--- }
Aug 22 12:42:04 nfsnode1 kernel: drbd0:  wanted = { cs:TearDown
st:Primary/Unknown ds:UpToDate/DUnknown r--- }
Aug 22 12:42:04 nfsnode1 kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> TearDown ) pdsk( UpToDate -> Outdated )

at restart of nfsnode2 in step 5) it will wait forever connection with
nfsnode1 in os console (wfc-timeout=0 )
And this is a safe (?) step because it is not a correct situation
I stop the wait writing in console yes+<INVIO> (I simulate the perfect
operator... I have been an operator too in the past, so I can tell... ;-)

drbd status in nfsnode2 is and remains
0:drbd-resource-0  WFConnection  Secondary/Unknown  Outdated/DUnknown  C

and in the log 6 times:
Aug 22 12:57:37 nfsnode2 kernel: drbd0: State change failed: Refusing to be
Primary without at least one UpToDate disk
Aug 22 12:57:37 nfsnode2 kernel: drbd0:   state = { cs:WFConnection
st:Secondary/Unknown ds:Outdated/DUnknown r--- }
Aug 22 12:57:37 nfsnode2 kernel: drbd0:  wanted = { cs:WFConnection
st:Primary/Unknown ds:Outdated/DUnknown r--- }
...
Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2683]: ERROR: Return code 1
from /etc/ha.d/resource.d/drbddisk
Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2684]: CRIT: Giving up
resources due to failure of drbddisk::drbd-resource-0
Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2685]: info: Releasing
resource group: nfsnode2 drbddisk::drbd-resource-0
Filesystem::/dev/drbd0::/drbd0::ext3 killnfsd nfslock nfs Delay::3::0
IPaddr::10.4.5.103/24/eth0
...
Aug 22 12:58:13 nfsnode2 hb_standby[3047]: [3053]: Going standby [foreign].
Aug 22 12:58:14 nfsnode2 heartbeat: [2204]: info: nfsnode2 wants to go
standby [foreign]
Aug 22 12:58:24 nfsnode2 heartbeat: [2204]: WARN: No reply to standby
request.  Standby request cancelled.

Now startup nfsnode1 step 6):
the node nfsnode1 correctly executes sync of drbd data versus nfsnode2 that
was not aligned

Aug 22 13:06:11 nfsnode1 kernel: drbd0: Began resync as SyncSource (will
sync 304 KB [76 bits set]).
Aug 22 13:06:11 nfsnode1 kernel: drbd0: Writing meta data super block now.
Aug 22 13:06:11 nfsnode1 kernel: drbd0: Resync done (total 1 sec; paused 0
sec; 304 K/sec)
Aug 22 13:06:11 nfsnode1 kernel: drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )

at the end both nodes are in state Secondary

0:drbd-resource-0  Connected  Secondary/Secondary  UpToDate/UpToDate  C

and so heartbeat doesn't activate nfs service and relative virtual ip

heartbeat logs give:
Aug 22 13:06:12 nfsnode1 heartbeat: [2040]: info: Local status now set to:
'up'
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth0 up.
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Status update for node
nfsnode2: status active
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: WARN: G_CH_dispatch_int:
Dispatch function for read child took to
o long to execute: 100 ms (> 50 ms) (GSource: 0x83b8940)
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth1 up.
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Comm_now_up(): updating
status to active
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local status now set to:
'active'
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Starting child client
"/usr/lib/heartbeat/dopd" (498,496)
Aug 22 13:06:13 nfsnode1 heartbeat: [2188]: info: Starting
"/usr/lib/heartbeat/dopd" as uid 498  gid 496 (pid
 2188)
Aug 22 13:06:13 nfsnode1 harc[2186]: [2196]: info: Running
/etc/ha.d/rc.d/status status
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition
completed.
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition
completed.
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local Resource acquisition
completed. (none)
Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Initial resource
acquisition complete (T_RESOURCES(them))

Any hints and suggestions?
Thanks in advance

Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080822/cbfaf985/attachment.htm>