[DRBD-user] Failure to resync - resulted in data loss

Tue Sep 25 09:52:45 CEST 2007

On Tue, Sep 25, 2007 at 12:14:30PM +1000, Alexander Strachan wrote:
> 
> Quoting Lars Ellenberg <lars.ellenberg at linbit.com>:
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established.
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S): 1:00000002:00000001:000000c5:00000033:10
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P): 1:00000002:00000001:000000c3:00000034:10
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFReportParams --> WFBitMapS
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFBitMapS --> BrokenPipe
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap size=4096 sent=0
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
> > >
> > > It has detected it's partner but then breaks the connection.
> > > This has resulted in serious data loss.
> >
> > no. that _prevented_ data loss. see [*] below.
> >
> > > Is there a way to prevent this?
> >
> > you monitor your raid for degraded mode.
> > then please monitor drbd connection status as well, and fix it in time.
> >
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by peer.
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> BrokenPipe
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on sock: r=0
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> Unconnected
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost.
> > >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection
> > >   Sep 21 10:12:01 sinfids3a1 rc: Starting drbd:  succeeded
> > >
> > > The scenario:
> > >
> > > sinfids3a1 running as master
> > > host sinfids3a1 hung
> > > reboot sinfids3a1
> >
> >  aparently here or
> >
> > > sinfids3a2 running as master
> > > updated filesystem (nagios config)
> > > powered off/on sinfids3a1
> >
> >   there, drbd did not re-establish the connection,
> >   reason given below.
> >   so it did no longer replicate,
> >   sinfids3a1 humming along just nursing its stale data.
> >
> > > [.....]
> > > failed system over to sinfids3a1
> >
> > and there you decided to go online with that stale data.
> >
> > > updates to the filesystem were LOST !!!
> >
> > right. you asked for it.
> >
> > [*]
> > because of the way the generation counters work in drbd 0.7,
> > after the reboot of sinfids3a1, it would have liked to become sync source.
> > luckily sinfids3a2 detected that this was nonsense,
> > and refused to become sync target, otherwise you'd have synced the stale
> > data to the live node right here and now.
> > that's why I say it _prevented_ data loss.
> >
> > in drbd 8, you'd get the infamous "split brain detected" message.
> >
> > this all only means that there has been a time where both nodes
> > have modified their data independently.
> >
> > sinfids3a1 during the time during its shutdown
> > where it still was primary but already closed the network.
> >
> > sinfids3a2 during the time when it took over.
> >
> > >   Sep 21 10:10:57 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Down
> > >   Sep 21 10:11:00 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate WFConnection --> WFReportParams
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection established.
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: I am(P): 1:00000002:00000001:000000c3:00000034:10
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Peer(S): 1:00000002:00000001:000000c5:00000033:10
> > >   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption.
> >
> > right there, you should have been alerted to the degraded
> 
> I will add in some alert checks but this will not stop HB from starting the DRBD
> resource against a secondary/unknown consistent filesystem.
> 
> > (not connected, not replicating) state of drbd,
> > and should have done "drbdadm connect" as aprorpiate
> > to reestablish connections.
> >
> 
> 
> I am using HB2 with drbddisk; when Hb ran 'drbddisk start' then this was
> successful and the system came up using the stale filesystem.
> 
> I didn't check the status of DRBD after the reboot of sinfids3a1.  Later on
> I instructed HB to failover by putting the current node to stand-by.
> 
> In summary:
> It is possible to have  (after a failed host reboots with either software or
> hardware error)
> 
> sinfids3a1
>   DRBD secondary/unknown   consistent
>   HB host online status
>   HB 'drbddisk status' is Okay

     ???
my drbddisk says "running" if something is "Primary",
and "stopped" if it is "Secondary".
again, what am I missing?

> sinfids3a2
>   DRBD primary/unknown     consistent
>   HB host online status
>   HB 'drbddisk status' is Okay
> 
> If host sinfids3a2 is put to HB Standby status then HB will mount the
> stale filesystem of sinfids3a1.
> 
> When sinfids3a1 initially connected with sinfids3a2 it did
> >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
> >   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by peer.
> 
> Would you have expected the status of DRBD on sinfids3a1 to have changed to
> Secondary/unkown   inconsistent  ?

it has.  if it did not, that would be a bug.
afaicr, drbd 0.7 does not log every single status change.

> This would have prevented HB from mounting the stale filesystem.

no, it would not.  what makes you think so?
what am I missing?

> You indicated that maybe the network was stopped before drbd; these are the
> relevant SK scripts - is the order okay ?
> 
> [root at sinfids3b1 ~]# cat /etc/redhat-release
> Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
> 
> [root at sinfids3b1 ~]# find /etc -type l -name "*drbd*" | sort
> /etc/rc.d/rc0.d/K08drbd
> /etc/rc.d/rc0.d/K90network

yes. but as you wrote in your original post,
drbd could not be "stopped" (made secondary),
because it was still in use, you still had it mounted.

> My concern is now that there is a possibility of
>   sinfids3a1  (now Primary) fails due a software/hardware fault and is hardwareatchdogged
>   sinfids3a2  (is now primary)
>   sinfids3a1 has finished rebooting, HB is happy but DRBD has not reconnected.
>   sinfids3a2  (now Primary) fails due a software/hardware fault and is hardwareatchdogged
>   sinfids3a1  (is now primary) but using a stale filesystem.

you are talking about multiple error scenarios.
you can always find some scenario that will screw up.

you have a raid5, first one disk fails, then you have a power spike
which destroys an other one, data is gone.  now, if you had replaced the
first disk in time, and where lucky that it resynced in time ... 

point being, you need to recognize "degraded" mode. in time.
and fix it. in time.
before the next failure pops up.

> How to prevent this from happening?  Is there a way to modify the drbddisk
> script so that it can detect that the partner is there but there is a problem
> with connectivity - to the fact that the partner is not there.  For the first
> scenario when HB issues 'drbddisk start' then it should fail.

probably.  you then still do not cover the corner case of a "real"
(heartbeat too) split brain.

drbd 8 can be configured to call a user land helper program
we call drbd-peer-outdater, which will (try to) mark the other
nodes data as "outdated" via e.g. the redundant heartbeat
communication channels, when a Primary loses connection,
or when an unconnected Secondary shall become Primary.
this will do almost what you ask for,
and should be easily adapted to do whatever you need.

probably we still have cornercases there where during a rejoin
after cluster reboot after cluster partition
have differen (possibly divergin) data sets,
both are secondary,
heartbeat would decide do make one of them primary,
which may be the "wrong" one, 
and which then "outdates" the other one.

one "generic" solution is to not allow Secondary -> Primary transition
when not connected. only then, you don't have HA failover anymore ;->

happy to hear your concepts do get it right
in every possible multiple failure scenario.

just remember to not "fix" only the one cornercase you have in mind now,
because that would likely be the wrong fix for some other cornercase.
any solution here needs to be generic.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.