Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi List ! I have setup a drbd installation (Version: 0.7.18 on debian etch) with primary node nas1 and secondary node nas2 which is working fine when both nodes are up and connected. When I simulate a network failure by unplugging the crossover connection the cluster detects this and nas1 goes to the state Primary/Unknown and nas2 goes to Secondary/Unknown. /dev/drbd0 is still mounted on nas1 as /data1 and working properly. Then I write some files to the primary node to cause some changes in the filesystem. When I reconnect the network between the two nodes again I would expect them to switch to Primay/Secondary again and sync the changes that were made to nas1 while nas2 was disconnected. But this does not happen. nas2 reconnects but does not start syncing. This is what I get in the syslog of nas2 after the network connection is lost: drbd0: drbd0_asender [6512]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [6359]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: worker terminated drbd0: drbd0_receiver [6359]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [6359]: cstate Unconnected --> WFConnection When the network connection is restored I get: e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex drbd0: drbd0_receiver [6359]: cstate WFConnection --> WFReportParams drbd0: Handshake successful: DRBD Network Protocol version 74 drbd0: Connection established. drbd0: I am(S): 1:00000002:00000002:00000021:00000002:01 drbd0: Peer(P): 1:00000002:00000002:00000022:00000002:10 drbd0: drbd0_receiver [6359]: cstate WFReportParams --> WFBitMapT drbd0: Secondary/Unknown --> Secondary/Primary drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967295 drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967294 drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967293 drbd0: [drbd0_receiver/6359] sock_sendmsg time expired, ko = 4294967292 the last lines are repeated over and over again. I have found some google hits for this error message but nothing seems to point to the same problem I have. cat /proc/drbd on nas2 shows: version: 0.7.18 (api:78/proto:74) SVN Revision: 2176 build by root at nas2, 2006-04-27 18:27:31 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent ns:17 nr:91549 dw:91566 dr:65796 al:0 bm:13 lo:0 pe:0 ua:0 ap:0 nas2 detects that it needs syncing but does not start it. nas1 even states in the log: nas1 kernel: drbd0: 57 MB marked out-of-sync by on disk bit-map. but does not start syncing neither. When I reboot both nodes and restart the drbd services the syncing takes place and the cluster goes back to redundant operation. syslog on nas1 after reboot of both nodes: drbd0: drbd0_receiver [6586]: cstate WFBitMapS --> SyncSource drbd0: Resync started as SyncSource (need to sync 58900 KB [14725 bits set]). drbd0: Resync done (total 1 sec; paused 0 sec; 58900 K/sec) drbd0: drbd0_worker [6573]: cstate SyncSource --> Connected Does anybody have an idea how I can get the cluster to resync after the network connection is restored ? The whole point of the HA cluster should be that I don't have to reboot both nodes after a failure. Thanks for reading this far. best regards, Andreas Attached my drbd.conf: resource r0 { protocol C; incon-degr-cmd "echo '!DRBD! drdb0 pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { wfc-timeout 120; degr-wfc-timeout 30; } disk { on-io-error pass_on; } net { on-disconnect reconnect; } syncer { rate 100M; group 1; al-extents 257; } on nas1 { device /dev/drbd0; disk /dev/sda3; address 192.168.8.1:7788; meta-disk internal; } on nas2 { device /dev/drbd0; disk /dev/sda3; address 192.168.8.2:7788; meta-disk internal; } }