Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, For some time now I've been trying to build an HA setup with drbd and I keep running into the same thing every time. I'm building a setup in which 1 half of the drbd will be in a geographically different location, so I need to test what happens if the netwerk connection between those locations fails. To that end I've configured 2 machines, fs1 and fs2, each with 2 nics. One of those nics per machine is dedicated for drbd. I didn't use a crosscable, but connected those to a managed switch, so I can easily shutdown ports to simulate a network failure. When all switchports are enabled, everything works just fine. Drbd works, I can swap the primary to the other machine (first drbdadm primary all on the primary, then drbdadm primary all on the secondary), no problem whatsoever. I have one drbd resource, r1, everything started and ok (connected and consistent), fs1 = primary, fs2 = secondary. At that point I shutdown the switchport of the drbd nic of fs1. A few seconds later both sides notice they can't connect to the other side and change the status of that side to Unknown. Now I want fs2 to become primary (apperently something is wrong with fs1, so I want application servers on the location of fs2 to take over with fs2 as fileserver), so I do a drbdadm primary all on fs2 and a drbdadm secondary all on fs1 (just to be sure, can't have 2 primaries when I re-enable the switchport). Both sides update their status accordingly. If I then re-enable the switchport, both sides "see" each other again, but won't reconnect, because fs1 wants to sync as source with fs2 as target. That seems totally wrong to me. I expect fs1 to become a secondary with fs2 primary. Fs2 does refuse the sync (as it should) and aborts. The strange part is that if I stop the drbd device on fs2 en restart it, it comes up as secondary (correct) and syncs back to fs1 with fs2 source and fs1 target, just as it should! I'm running 0.7.13 on a 2.4 kernel. I hope someone can help me out with this! Here are some parts from syslog: Interface has been shutdown, make fs1 secondary: Sep 12 17:04:07 fs1 kernel: drbd0: Primary/Unknown --> Secondary/Unknown Idem make fs2 primary: Sep 12 17:04:37 fs2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown Re-enabled the switchport: On fs1: Sep 12 17:05:34 fs1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 12 17:05:34 fs1 kernel: drbd0: Connection established. Sep 12 17:05:34 fs1 kernel: drbd0: I am(S): 1:00000002:00000001:0000001c:00000010:00 Sep 12 17:05:34 fs1 kernel: drbd0: Peer(P): 1:00000002:00000001:0000001b:00000011:10 Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate WFReportParams --> WFBitMapS Sep 12 17:05:34 fs1 kernel: drbd0: meta connection shut down by peer. Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_asender [4378]: cstate WFBitMapS --> NetworkFailure Sep 12 17:05:34 fs1 kernel: drbd0: asender terminated Sep 12 17:05:34 fs1 kernel: drbd0: sock_sendmsg returned -104 Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate NetworkFailure --> BrokenPipe Sep 12 17:05:34 fs1 kernel: drbd0: short sent ReportBitMap size=4096 sent=2104 Sep 12 17:05:34 fs1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate BrokenPipe --> BrokenPipe Sep 12 17:05:34 fs1 kernel: drbd0: short read expecting header on sock: r=-512 Sep 12 17:05:34 fs1 kernel: drbd0: worker terminated Sep 12 17:05:34 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate BrokenPipe --> Unconnected Sep 12 17:05:34 fs1 kernel: drbd0: Connection lost. On fs2: Sep 12 17:05:34 fs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 12 17:05:34 fs2 kernel: drbd0: Connection established. Sep 12 17:05:34 fs2 kernel: drbd0: I am(P): 1:00000002:00000001:0000001b:00000011:10 Sep 12 17:05:34 fs2 kernel: drbd0: Peer(S): 1:00000002:00000001:0000001c:00000010:00 Sep 12 17:05:34 fs2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Sep 12 17:05:34 fs2 kernel: drbd0: drbd0_receiver [17971]: cstate WFReportParams --> StandAlone Sep 12 17:05:34 fs2 kernel: drbd0: error receiving ReportParams, l: 72! Sep 12 17:05:34 fs2 kernel: drbd0: asender terminated Sep 12 17:05:34 fs2 kernel: drbd0: worker terminated Sep 12 17:05:34 fs2 kernel: drbd0: drbd0_receiver [17971]: cstate StandAlone --> StandAlone Sep 12 17:05:34 fs2 kernel: drbd0: Connection lost. Sep 12 17:05:34 fs2 kernel: drbd0: receiver terminated After stopping and starting the drbd device on fs2: On fs1: Sep 12 17:07:41 fs1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 12 17:07:41 fs1 kernel: drbd0: Connection established. Sep 12 17:07:41 fs1 kernel: drbd0: I am(S): 1:00000002:00000001:0000001c:00000010:00 Sep 12 17:07:41 fs1 kernel: drbd0: Peer(S): 1:00000002:00000001:0000001c:00000011:00 Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate WFReportParams --> WFBitMapT Sep 12 17:07:41 fs1 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate WFBitMapT --> SyncTarget Sep 12 17:07:41 fs1 kernel: drbd0: Resync started as SyncTarget (need to sync 0 KB [0 bits set]). Sep 12 17:07:41 fs1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Sep 12 17:07:41 fs1 kernel: drbd0: drbd0_receiver [4357]: cstate SyncTarget --> Connected On fs2: Sep 12 17:07:41 fs2 kernel: drbd0: Connection established. Sep 12 17:07:41 fs2 kernel: drbd0: I am(S): 1:00000002:00000001:0000001c:00000011:00 Sep 12 17:07:41 fs2 kernel: drbd0: Peer(S): 1:00000002:00000001:0000001c:00000010:00 Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate WFReportParams --> WFBitMapS Sep 12 17:07:41 fs2 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate WFBitMapS --> SyncSource Sep 12 17:07:41 fs2 kernel: drbd0: Resync started as SyncSource (need to sync 0 KB [0 bits set]). Sep 12 17:07:41 fs2 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Sep 12 17:07:41 fs2 kernel: drbd0: drbd0_receiver [18061]: cstate SyncSource --> Connected My /etc/drbd.conf (the same on both machines) looks like this (there's lots more in the file actually, but all commented out): resource r0 { protocol C; incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; } net { } syncer { rate 10M; group 1; } on fs1 { device /dev/drbd0; disk /dev/hda3; address 10.1.1.1:7788; meta-disk internal; } on fs2 { device /dev/drbd0; disk /dev/hda3; address 10.1.2.1:7788; meta-disk internal; } } I hope someone can help me debug this or tell me what I did wrong. TIA! Regards, -- Guus Houtzager Email: guus at houtzager.net PGP fingerprint = 5E E6 96 35 F0 64 34 14 CC 03 2B 36 71 FB 4B 5D Early to rise, early to bed, makes a man healthy, wealthy and dead. --Rincewind, The Light Fantastic