[DRBD-user] Help with recovering from network failure (failover and back again)

Guus Houtzager guus at houtzager.net
Mon Sep 12 18:06:17 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

For some time now I've been trying to build an HA setup with drbd and I
keep running into the same thing every time. I'm building a setup in
which 1 half of the drbd will be in a geographically different location,
so I need to test what happens if the netwerk connection between those
locations fails. To that end I've configured 2 machines, fs1 and fs2,
each with 2 nics. One of those nics per machine is dedicated for drbd. I
didn't use a crosscable, but connected those to a managed switch, so I
can easily shutdown ports to simulate a network failure.
When all switchports are enabled, everything works just fine. Drbd
works, I can swap the primary to the other machine (first drbdadm
primary all on the primary, then drbdadm primary all on the secondary),
no problem whatsoever.
I have one drbd resource, r1, everything started and ok (connected and
consistent), fs1 = primary, fs2 = secondary.
At that point I shutdown the switchport of the drbd nic of fs1. A few
seconds later both sides notice they can't connect to the other side and
change the status of that side to Unknown. Now I want fs2 to become
primary (apperently something is wrong with fs1, so I want application
servers on the location of fs2 to take over with fs2 as fileserver), so
I do a drbdadm primary all on fs2 and a drbdadm secondary all on fs1
(just to be sure, can't have 2 primaries when I re-enable the
switchport). Both sides update their status accordingly.
If I then re-enable the switchport, both sides "see" each other again,
but won't reconnect, because fs1 wants to sync as source with fs2 as
target. That seems totally wrong to me. I expect fs1 to become a
secondary with fs2 primary. Fs2 does refuse the sync (as it should) and
aborts. The strange part is that if I stop the drbd device on fs2 en
restart it, it comes up as secondary (correct) and syncs back to fs1
with fs2 source and fs1 target, just as it should!
I'm running 0.7.13 on a 2.4 kernel. 
I hope someone can help me out with this!

Here are some parts from syslog:

Interface has been shutdown, make fs1 secondary:
Sep 12 17:04:07 fs1 kernel: drbd0: Primary/Unknown --> Secondary/Unknown
Idem make fs2 primary:
Sep 12 17:04:37 fs2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown

Re-enabled the switchport:

On fs1:

Sep 12 17:05:34 fs1 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Sep 12 17:05:34 fs1 kernel: drbd0: Connection established.
Sep 12 17:05:34 fs1 kernel: drbd0: I am(S):
1:00000002:00000001:0000001c:00000010:00
Sep 12 17:05:34 fs1 kernel: drbd0: Peer(P):
1:00000002:00000001:0000001b:00000011:10
Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
WFReportParams --> WFBitMapS
Sep 12 17:05:34 fs1 kernel: drbd0: meta connection shut down by peer.
Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_asender [4378]: cstate
WFBitMapS --> NetworkFailure
Sep 12 17:05:34 fs1 kernel: drbd0: asender terminated
Sep 12 17:05:34 fs1 kernel: drbd0: sock_sendmsg returned -104
Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
NetworkFailure --> BrokenPipe
Sep 12 17:05:34 fs1 kernel: drbd0: short sent ReportBitMap size=4096
sent=2104
Sep 12 17:05:34 fs1 kernel: drbd0: Secondary/Unknown -->
Secondary/Primary
Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
BrokenPipe --> BrokenPipe
Sep 12 17:05:34 fs1 kernel: drbd0: short read expecting header on sock:
r=-512
Sep 12 17:05:34 fs1 kernel: drbd0: worker terminated
Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
BrokenPipe --> Unconnected
Sep 12 17:05:34 fs1 kernel: drbd0: Connection lost.

On fs2:

Sep 12 17:05:34 fs2 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Sep 12 17:05:34 fs2 kernel: drbd0: Connection established.
Sep 12 17:05:34 fs2 kernel: drbd0: I am(P):
1:00000002:00000001:0000001b:00000011:10
Sep 12 17:05:34 fs2 kernel: drbd0: Peer(S):
1:00000002:00000001:0000001c:00000010:00
Sep 12 17:05:34 fs2 kernel: drbd0: Current Primary shall become sync
TARGET! Aborting to prevent data corruption.
Sep 12 17:05:34 fs2 kernel: drbd0: drbd0_receiver [17971]: cstate
WFReportParams --> StandAlone
Sep 12 17:05:34 fs2 kernel: drbd0: error receiving ReportParams, l: 72!
Sep 12 17:05:34 fs2 kernel: drbd0: asender terminated
Sep 12 17:05:34 fs2 kernel: drbd0: worker terminated
Sep 12 17:05:34 fs2 kernel: drbd0: drbd0_receiver [17971]: cstate
StandAlone --> StandAlone
Sep 12 17:05:34 fs2 kernel: drbd0: Connection lost.
Sep 12 17:05:34 fs2 kernel: drbd0: receiver terminated

After stopping and starting the drbd device on fs2:

On fs1:

Sep 12 17:07:41 fs1 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Sep 12 17:07:41 fs1 kernel: drbd0: Connection established.
Sep 12 17:07:41 fs1 kernel: drbd0: I am(S):
1:00000002:00000001:0000001c:00000010:00
Sep 12 17:07:41 fs1 kernel: drbd0: Peer(S):
1:00000002:00000001:0000001c:00000011:00
Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
WFReportParams --> WFBitMapT
Sep 12 17:07:41 fs1 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
WFBitMapT --> SyncTarget
Sep 12 17:07:41 fs1 kernel: drbd0: Resync started as SyncTarget (need to
sync 0 KB [0 bits set]).
Sep 12 17:07:41 fs1 kernel: drbd0: Resync done (total 1 sec; paused 0
sec; 0 K/sec)
Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate
SyncTarget --> Connected

On fs2:

Sep 12 17:07:41 fs2 kernel: drbd0: Connection established.
Sep 12 17:07:41 fs2 kernel: drbd0: I am(S):
1:00000002:00000001:0000001c:00000011:00
Sep 12 17:07:41 fs2 kernel: drbd0: Peer(S):
1:00000002:00000001:0000001c:00000010:00
Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate
WFReportParams --> WFBitMapS
Sep 12 17:07:41 fs2 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate
WFBitMapS --> SyncSource
Sep 12 17:07:41 fs2 kernel: drbd0: Resync started as SyncSource (need to
sync 0 KB [0 bits set]).
Sep 12 17:07:41 fs2 kernel: drbd0: Resync done (total 1 sec; paused 0
sec; 0 K/sec)
Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate
SyncSource --> Connected

My /etc/drbd.conf (the same on both machines) looks like this (there's
lots more in the file actually, but all commented out):

resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ;
halt -f";
  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  net {
  }
  syncer {
    rate 10M;
    group 1;
  }
  on fs1 {
    device     /dev/drbd0;
    disk       /dev/hda3;
    address    10.1.1.1:7788;
    meta-disk  internal;
  }
  on fs2 {
    device    /dev/drbd0;
    disk      /dev/hda3;
    address   10.1.2.1:7788;
    meta-disk internal;
  }
}

I hope someone can help me debug this or tell me what I did wrong. TIA!

Regards,

-- 
Guus Houtzager                           Email: guus at houtzager.net
PGP fingerprint = 5E E6 96 35 F0 64 34 14  CC 03 2B 36 71 FB 4B 5D
Early to rise, early to bed, makes a man healthy, wealthy and dead.
        --Rincewind, The Light Fantastic




More information about the drbd-user mailing list