[DRBD-user] no resync after network failure

Fri May 19 12:10:55 CEST 2006

Hi List !

I have setup a drbd installation (Version: 0.7.18 on debian etch)
with primary node nas1 and secondary node nas2 which is working
fine when both nodes are up and connected.

When I simulate a network failure by unplugging the crossover
connection the cluster detects this and nas1 goes to the state
Primary/Unknown and nas2 goes to Secondary/Unknown. /dev/drbd0
is still mounted on nas1 as /data1 and working properly. Then
I write some files to the primary node to cause some changes
in the filesystem.

When I reconnect the network between the two nodes again I would
expect them to switch to Primay/Secondary again and sync the
changes that were made to nas1 while nas2 was disconnected.
But this does not happen. nas2 reconnects but does not start syncing.

This is what I get in the syslog of nas2 after the network
connection is lost:
drbd0: drbd0_asender [6512]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [6359]: cstate NetworkFailure --> BrokenPipe
drbd0: short read expecting header on sock: r=-512
drbd0: worker terminated
drbd0: drbd0_receiver [6359]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [6359]: cstate Unconnected --> WFConnection

When the network connection is restored I get:
e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
drbd0: drbd0_receiver [6359]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 1:00000002:00000002:00000021:00000002:01
drbd0: Peer(P): 1:00000002:00000002:00000022:00000002:10
drbd0: drbd0_receiver [6359]: cstate WFReportParams --> WFBitMapT
drbd0: Secondary/Unknown --> Secondary/Primary
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967295
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967294
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967293
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967292

the last lines are repeated over and over again.

I have found some google hits for this error message but nothing seems to
point to the same problem I have.

cat /proc/drbd on nas2 shows:
version: 0.7.18 (api:78/proto:74)
SVN Revision: 2176 build by root at nas2, 2006-04-27 18:27:31
 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent
    ns:17 nr:91549 dw:91566 dr:65796 al:0 bm:13 lo:0 pe:0 ua:0 ap:0

nas2 detects that it needs syncing but does not start it.
nas1 even states in the log:
nas1 kernel: drbd0: 57 MB marked out-of-sync by on disk bit-map.
but does not start syncing neither.

When I reboot both nodes and restart the drbd services the syncing
takes place and the cluster goes back to redundant operation.

syslog on nas1 after reboot of both nodes:
drbd0: drbd0_receiver [6586]: cstate WFBitMapS --> SyncSource
drbd0: Resync started as SyncSource (need to sync 58900 KB [14725 bits set]).
drbd0: Resync done (total 1 sec; paused 0 sec; 58900 K/sec)
drbd0: drbd0_worker [6573]: cstate SyncSource --> Connected

Does anybody have an idea how I can get the cluster to resync after
the network connection is restored ? The whole point of the HA cluster
should be that I don't have to reboot both nodes after a failure.

Thanks for reading this far.
best regards,
Andreas

Attached my drbd.conf:
resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! drdb0 pri on incon-degr' | wall ; sleep
60 ; halt -f";
  startup {
    wfc-timeout  120;
    degr-wfc-timeout 30;
  }
  disk {
    on-io-error   pass_on;
  }
  net {
    on-disconnect reconnect;
  }
  syncer {
    rate 100M;
    group 1;
    al-extents 257;
  }
  on nas1 {
    device     /dev/drbd0;
    disk       /dev/sda3;
    address    192.168.8.1:7788;
    meta-disk  internal;
  }
  on nas2 {
    device    /dev/drbd0;
    disk      /dev/sda3;
    address   192.168.8.2:7788;
    meta-disk internal;
  }
}