Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi List !
I have setup a drbd installation (Version: 0.7.18 on debian etch)
with primary node nas1 and secondary node nas2 which is working
fine when both nodes are up and connected.
When I simulate a network failure by unplugging the crossover
connection the cluster detects this and nas1 goes to the state
Primary/Unknown and nas2 goes to Secondary/Unknown. /dev/drbd0
is still mounted on nas1 as /data1 and working properly. Then
I write some files to the primary node to cause some changes
in the filesystem.
When I reconnect the network between the two nodes again I would
expect them to switch to Primay/Secondary again and sync the
changes that were made to nas1 while nas2 was disconnected.
But this does not happen. nas2 reconnects but does not start syncing.
This is what I get in the syslog of nas2 after the network
connection is lost:
drbd0: drbd0_asender [6512]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [6359]: cstate NetworkFailure --> BrokenPipe
drbd0: short read expecting header on sock: r=-512
drbd0: worker terminated
drbd0: drbd0_receiver [6359]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [6359]: cstate Unconnected --> WFConnection
When the network connection is restored I get:
e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
drbd0: drbd0_receiver [6359]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 1:00000002:00000002:00000021:00000002:01
drbd0: Peer(P): 1:00000002:00000002:00000022:00000002:10
drbd0: drbd0_receiver [6359]: cstate WFReportParams --> WFBitMapT
drbd0: Secondary/Unknown --> Secondary/Primary
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967295
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967294
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967293
drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967292
the last lines are repeated over and over again.
I have found some google hits for this error message but nothing seems to
point to the same problem I have.
cat /proc/drbd on nas2 shows:
version: 0.7.18 (api:78/proto:74)
SVN Revision: 2176 build by root at nas2, 2006-04-27 18:27:31
0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent
ns:17 nr:91549 dw:91566 dr:65796 al:0 bm:13 lo:0 pe:0 ua:0 ap:0
nas2 detects that it needs syncing but does not start it.
nas1 even states in the log:
nas1 kernel: drbd0: 57 MB marked out-of-sync by on disk bit-map.
but does not start syncing neither.
When I reboot both nodes and restart the drbd services the syncing
takes place and the cluster goes back to redundant operation.
syslog on nas1 after reboot of both nodes:
drbd0: drbd0_receiver [6586]: cstate WFBitMapS --> SyncSource
drbd0: Resync started as SyncSource (need to sync 58900 KB [14725 bits set]).
drbd0: Resync done (total 1 sec; paused 0 sec; 58900 K/sec)
drbd0: drbd0_worker [6573]: cstate SyncSource --> Connected
Does anybody have an idea how I can get the cluster to resync after
the network connection is restored ? The whole point of the HA cluster
should be that I don't have to reboot both nodes after a failure.
Thanks for reading this far.
best regards,
Andreas
Attached my drbd.conf:
resource r0 {
protocol C;
incon-degr-cmd "echo '!DRBD! drdb0 pri on incon-degr' | wall ; sleep
60 ; halt -f";
startup {
wfc-timeout 120;
degr-wfc-timeout 30;
}
disk {
on-io-error pass_on;
}
net {
on-disconnect reconnect;
}
syncer {
rate 100M;
group 1;
al-extents 257;
}
on nas1 {
device /dev/drbd0;
disk /dev/sda3;
address 192.168.8.1:7788;
meta-disk internal;
}
on nas2 {
device /dev/drbd0;
disk /dev/sda3;
address 192.168.8.2:7788;
meta-disk internal;
}
}