[DRBD-user] Failure to resync - resulted in data loss

Lars Ellenberg lars.ellenberg at linbit.com
Mon Sep 24 11:33:29 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Sep 24, 2007 at 11:43:13AM +1000, Alex Strachan wrote:
> I have experienced a failure for DRBD to reconnect after a system
> forced power off/on due to a hung host.
>  
> DRBD 0.7.21
> Hearbeat 2.1.2
>  
> DRBD log output
>  
>   Sep 20 17:58:06 sinfids3a1 drbd: Stopping all DRBD resources
>   Sep 20 17:58:06 sinfids3a1 drbd: ioctl(,SET_STATE,) failed:
>  
> The host - on it's way down.
> At this point the filesystem was still active as Heartbeat didn't stop successfully.
>  
> Is there a way for drbd to handle this?
>  
>   Sep 20 17:58:06 sinfids3a1 drbd: Device or resource busy
>   Sep 20 17:58:06 sinfids3a1 drbd: Someone has opened the device for RW access!
>   Sep 20 17:58:06 sinfids3a1 drbd: Command '/sbin/drbdsetup /dev/drbd0 down' terminated with exit code 20
>   Sep 20 17:58:06 sinfids3a1 drbd: drbdsetup exited with code 20
>   Sep 20 17:58:06 sinfids3a1 drbd: ERROR: Module drbd is in use
>   Sep 20 17:58:06 sinfids3a1 drbd: .
>   Sep 20 17:58:06 sinfids3a1 rc: Stopping drbd:  succeeded
>  
> The host - on it's way up.
>  
>   Sep 21 10:11:49 sinfids3a1 drbd: [
>   Sep 21 10:11:49 sinfids3a1 drbd: d0
>   Sep 21 10:11:49 sinfids3a1 kernel: drbd0: Adjusting my ra_pages to backing device's (32 -> 1024)
>   Sep 21 10:11:49 sinfids3a1 kernel: klogd 1.4.1, ---------- state change ----------
>   Sep 21 10:11:49 sinfids3a1 kernel: drbd0: resync bitmap: bits=43350764 words=677356
>   Sep 21 10:11:49 sinfids3a1 kernel: drbd0: size = 165 GB (173403056 KB)
>   Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy
>   Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy
>   Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy
>   Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: 180 MB marked out-of-sync by on disk bit-map.
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log.
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Marked additional 983 MB as out-of-sync based on AL.
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5217]: cstate Unconfigured --> StandAlone
>   Sep 21 10:11:51 sinfids3a1 drbd: s0
>   Sep 21 10:11:51 sinfids3a1 drbd: n0
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5318]: cstate StandAlone --> Unconnected
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection
>   Sep 21 10:11:51 sinfids3a1 drbd: ]
>   Sep 21 10:11:51 sinfids3a1 drbd: .
>   Sep 21 10:11:51 sinfids3a1 drbd: WARN: stdin/stdout is not a TTY; using /dev/console
>   Sep 21 10:11:51 sinfids3a1 kernel: drbd0: using degr_wfc_timeout=10 seconds
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFConnection --> WFReportParams
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established.
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S): 1:00000002:00000001:000000c5:00000033:10
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P): 1:00000002:00000001:000000c3:00000034:10
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFReportParams --> WFBitMapS
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFBitMapS --> BrokenPipe
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap size=4096 sent=0
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary 
>  
> It has detected it's partner but then breaks the connection.
> This has resulted in serious data loss.

no. that _prevented_ data loss. see [*] below.

> Is there a way to prevent this?

you monitor your raid for degraded mode.
then please monitor drbd connection status as well, and fix it in time.  

>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by peer.
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> BrokenPipe
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on sock: r=0
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> Unconnected
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost.
>   Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection
>   Sep 21 10:12:01 sinfids3a1 rc: Starting drbd:  succeeded
>  
> The scenario:
>  
> sinfids3a1 running as master
> host sinfids3a1 hung
> reboot sinfids3a1 

 aparently here or

> sinfids3a2 running as master
> updated filesystem (nagios config)
> powered off/on sinfids3a1

  there, drbd did not re-establish the connection,
  reason given below.
  so it did no longer replicate,
  sinfids3a1 humming along just nursing its stale data.

> [.....]
> failed system over to sinfids3a1

and there you decided to go online with that stale data.

> updates to the filesystem were LOST !!!

right. you asked for it.

[*]
because of the way the generation counters work in drbd 0.7,
after the reboot of sinfids3a1, it would have liked to become sync source.
luckily sinfids3a2 detected that this was nonsense,
and refused to become sync target, otherwise you'd have synced the stale
data to the live node right here and now.
that's why I say it _prevented_ data loss.

in drbd 8, you'd get the infamous "split brain detected" message.

this all only means that there has been a time where both nodes
have modified their data independently.

sinfids3a1 during the time during its shutdown
where it still was primary but already closed the network.

sinfids3a2 during the time when it took over.

>   Sep 21 10:10:57 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Down
>   Sep 21 10:11:00 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate WFConnection --> WFReportParams
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection established.
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: I am(P): 1:00000002:00000001:000000c3:00000034:10
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Peer(S): 1:00000002:00000001:000000c5:00000033:10
>   Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption.

right there, you should have been alerted to the degraded
(not connected, not replicating) state of drbd,
and should have done "drbdadm connect" as aprorpiate
to reestablish connections.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list