[DRBD-user] DRBD I/O problems & corrupted data

Lars Ellenberg lars.ellenberg at linbit.com
Mon Jan 26 10:55:07 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Fri, Jan 16, 2015 at 08:53:16PM +0100, Saso Slavicic wrote:
> Hello,
>  
> We have been using DRBD for about 4 years now, so I have some
> experience with it. Today it was the first time that DRBD actually
> caused data loss…
>  
> We mostly use DRBD in cases where older hardware is reused (upgraded
> with HDDs) and serves as a hot storage in case of primary node
> failure. This also means a lot of usage on md arrays, as usually only
> the primary will have a suitable RAID controller with BBU. I
> understand the shortcomings of such setups and have always been
> successful in achieving suitable performance. Nothing is perfect and
> you need to work with what you have…
>  
> Most of the setups is still 8.3 (with CentOS 6) and I have been
> testing 8.4 on some nodes and here’s where the problem lies. One of
> the setups actually replicates over WAN with proto A (no proxy…),
> again not perfect but you do what you have to do. The system was
> running stable since 2012…
>  
> I have been using disk-timeout option (set at 900) and ko-count (5).
> Again, this is not perfect (I read the documentation) but we have been
> bitten a few times when a bad sector on a disk rendered the entire
> cluster unresponsive. This is unacceptable. Having some sort of
> mechanism to detach local storage in case of problems seems to be one
> reason for making a cluster in the first place…
>  
> The problem starts a approx. 10:21:
>  
> Jan 16 10:21:33 primary kernel: block drbd2: Local backing device failed to meet the disk-timeout
> Jan 16 10:21:33 primary kernel: block drbd2: disk( UpToDate -> Failed )
> Jan 16 10:21:33 primary kernel: block drbd2: Local IO failed in request_timer_fn. Detaching...
> Jan 16 10:21:33 primary kernel: block drbd2: local WRITE IO error sector 111041344+8 on md22
> Jan 16 10:21:33 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2

>From your config below:
>                 local-io-error   "/usr/lib/drbd/notify-io-error.sh; drbdadm detach $DRBD_RESOURCE";

VERY bad idea.
Synchronously calling drbdadm from inside a synchronous handler will block
(until that drbdadm will eventually timeout, 121 seconds later).
And it is absolutely useless: DRBD will detach after local io error all by itself.

This is already the trigger reason why your peer does not notice,
and therefore not bump its UUIDs in time. Which then is the reason
for the wrong sync direction. It is also the trigger why DRBD thinks the
peer was non-responsive, it blocked the DRBD worker thread for too long.

> Local device is timeouting after 90s (!), without any I/O related errors in
> kernel. The disks have recently been upgraded to SSDs, so badblocks not
> really possible!

Possibly you hit this:
http://git.linbit.com/gitweb.cgi?p=drbd-8.4.git;a=commitdiff;h=795c7d61b3765a625aee8e0996d79f21d4f8a518
| Fix spurious disk-timeout
| 
| (You should not use disk-timeout anyways,
|  see the man page for why...)
| 
| We add incoming requests to the tail of some ring list.
| On local completion, requests are removed from that list.
| The timer looks only at the head of that ring list,
| so is supposed to only see the oldest request.
| All protected by a spinlock.
| 
| The request object is created with timestamps zeroed out.
| The timestamp was only filled in just before the actual submit.
| But to actually submit the request, we need to give up the spinlock.
| 
| If you are unlucky, there is no older still pending request, the timer
| looks at a new request with timestamp still zero (before it even was
| submitted), and 0 + timeout is most likely older than "now".
| 
| Better assign the timestamp right when we put the
| request object on said ring list.

> Jan 16 10:22:33 primary kernel: block drbd2: Remote failed to finish a request within ko-count * timeout
> Jan 16 10:22:33 primary kernel: block drbd2: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> Jan 16 10:22:33 primary kernel: drbd kvmimages: susp( 0 -> 1 )
> Jan 16 10:22:33 primary kernel: drbd kvmimages: asender terminated
> Jan 16 10:22:33 primary kernel: drbd kvmimages: Terminating drbd_a_kvmimages
>  
> Exactly one minute later (!), the remote stops responding (both fail at the same time?). As on-suspend is configured, all I/O is suspended.
> Ok, unusual, but still ok so far…

That's a side effect of your blocking the worker...

I would expect that you have a few
	"INFO: task drbd_w_... blocked for more than 120 seconds."
and then call traces in the kernel log as well?

But yes, in case someone blocks the worker,
we should not blame the peer for being "unresponsive",
we are unresponsive ourselves.
Will try to distinguish this in the future.

> Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2 exit code 20 (0x1400)

Finally, the drbdadm timed out, provides an exit code about "did not work, I'm sorry",
and the worker continues. Next step would have been to notify the peer about the failed disk,
but since the connection already dropped itself as well, that can no longer be done.

I'm going to change the order here: first notify the peer, then call the handler.

> Jan 16 10:23:34 primary kernel: block drbd2: bitmap WRITE of 0 pages took 0 jiffies
> Jan 16 10:23:34 primary kernel: block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> Jan 16 10:23:34 primary kernel: block drbd2: disk( Failed -> Diskless )
> Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2
> Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2 exit code 0 (0x0)
> Jan 16 10:23:34 primary kernel: drbd kvmimages: Connection closed
> Jan 16 10:23:34 primary kernel: drbd kvmimages: out of mem, failed to invoke fence-peer helper

^^
Kernel changed in incompatible ways.
DRBD has learned a workaround meanwhile:
http://git.linbit.com/gitweb.cgi?p=drbd-8.4.git;a=commitdiff;h=e998365475194a8faf31a86081e88034d7bd1a41

> Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Timeout -> Unconnected )
> Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver terminated
> Jan 16 10:23:34 primary kernel: drbd kvmimages: Restarting receiver thread
> Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver (re)started
> Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Unconnected -> WFConnection )
> Jan 16 10:23:35 primary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101
> Jan 16 10:23:35 primary kernel: drbd kvmimages: Agreed to support TRIM on protocol level
> Jan 16 10:23:35 primary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams )
> Jan 16 10:23:35 primary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [42146])
> Jan 16 10:23:35 primary kernel: block drbd2: receiver updated UUIDs to effective data uuid: C7F11A2B6505F460

There.
This is actually something we may need to fix.
But that's in fact a multiple error scenario including misconfiguration we did not think of yet.
Thanks for abusing DRBD in creative ways, and reporting the outcome ;-)

If a Secondary with intact data connects to a Primary *without* data,
it needs to bump the UUID before the handshake,
just as it bumps the UUID while we are connected,
if the Primary loses its data,
or if we promote a diskless peer to Primary

> Jan 16 10:23:35 primary kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
> Jan 16 10:23:35 primary kernel: block drbd2: Should have called drbd_al_complete_io(, 111041344, 4096), but my Disk seems to have failed :(
> Jan 16 10:23:35 primary kernel: drbd kvmimages: susp( 1 -> 0 )
>  
> Helpers called, remote reconnected, suspend is lifted…
> At this point, Primary is running diskless. Now, here comes the fun
> part…the admin (me) notices this after a couple of hours (I was away)
> and re-attaches the disk on primary node (as per
> documentation: “If using internal meta data, it is sufficient to bind
> the DRBD device to the new hard disk.”). Here’s where everything goes
> south:

Yep.

> Jan 16 12:22:03 primary kernel: block drbd2: disk( Diskless -> Attaching )
> Jan 16 12:22:03 primary kernel: block drbd2: recounting of set bits took additional 3 jiffies
> Jan 16 12:22:03 primary kernel: block drbd2: 4948 MB (1266688 bits) marked out-of-sync by on disk bit-map.
> Jan 16 12:22:03 primary kernel: block drbd2: disk( Attaching -> Negotiating )
> Jan 16 12:22:03 primary kernel: block drbd2: attached to UUIDs C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB
> Jan 16 12:22:03 primary kernel: block drbd2: drbd_sync_handshake:
> Jan 16 12:22:03 primary kernel: block drbd2: self C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:0
> Jan 16 12:22:03 primary kernel: block drbd2: peer C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:53805 flags:0

identical UUIDs would mean "equivalent" data...
which is obviousy not the case.

> Jan 16 12:22:03 primary kernel: block drbd2: uuid_compare()=1 by rule 40
> Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent )
>  
> Why is primary disk now UpToDate, when it was offline for 2 hours? It should be inconsistent and resynced from the Secondary…
> To make matters worse, pull-ahead is triggered at the same time (remember, replicating over WAN), since 5GB was marked out-of-sync by making the disk on primary up to date:

Because the UUIDs have not been bumped.

Because you had the bad idea to be smarter than DRBD
and help it detach while it was detaching,
thereby blocking it for 121 seconds,
which also caused it to declare the peer unresponsive,
and made it impossible to notify it about the event.

Because we did not anticipate that you would block the worker,
and did call the handler before we notify the peer.

And because we did not anticipate a suspended diskless primary
to connect to a Secondary without previously bumped UUIDs.

Because you used disk-timeout, which is documented as don't use,
and because it triggered when it should not have.
But if it had triggered for a reason, the outcome would have been the same.

So a few BUGs on the side of DRBD in code paths that would not have been
reached, if you had not blocked the worker for two minutes, trying to
do something that DRBD was just about to do you.

> Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent )
> Jan 16 12:22:03 primary kernel: block drbd2: Congestion-fill threshold reached
> Jan 16 12:22:03 primary kernel: block drbd2: conn( WFBitMapS -> Ahead )

Uh.  Interesting state change.

> Jan 16 12:22:03 primary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%
>  
> Now the disks stay in this state, the sync is actually not starting:

No, it went "Ahead" already.

> [root at primary] # cat /proc/drbd
> version: 8.4.5 (api:1/proto:86-101)
> GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, 2014-10-28 10:32:53
>  
> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>     ns:48488816 nr:0 dw:106632540 dr:213443889 al:24231 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 2: cs:Ahead ro:Primary/Secondary ds:UpToDate/Consistent A r-----
>     ns:0 nr:0 dw:564028 dr:163202 al:52275 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5190228
> 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>     ns:13699098 nr:0 dw:34361764 dr:248373080 al:3374 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  
> [root at secondary] # cat /proc/drbd
> version: 8.4.5 (api:1/proto:86-101)
> GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, 2014-10-28 10:32:53
>  
> 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>     ns:0 nr:49240716 dw:61104624 dr:176250476 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 2: cs:Behind ro:Secondary/Primary ds:Outdated/UpToDate A r-----
>     ns:53230 nr:46356310 dw:48639878 dr:55173318 al:886 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5241020
> 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>     ns:0 nr:21494090 dw:21494090 dr:195651772 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  
> With both nodes sure, that primary is UpToDate!
> Recognizing the problem, I started shutting down services and copying
> all the data (whatever it was) off the drbd partitions…somewhere along
> the copying, maybe the WAN link failed or something else happened,

It probably just became quiet enough to drop out of "Ahead"...

> drbd started resyncing:

Whatever.  Everything after this is no longer interesting.
The real issue was that it did not bump the UUID when it should have.
Because you blocked it.

> There’s nothing in the logs of the Secondary to indicate any disk problems around 10:22:

If it had been notified, it had bumped its UUIDs.

> Jan 16 10:22:33 secondary kernel: drbd kvmimages: sock was shut down by peer
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: short read (expected size 16)
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: meta connection shut down by peer.
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: asender terminated
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: Terminating drbd_a_kvmimages
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: Connection closed
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: conn( BrokenPipe -> Unconnected )
> Jan 16 10:22:33 secondary kernel: drbd kvmimages: receiver terminated
> Jan 16 10:22:34 secondary kernel: drbd kvmimages: Restarting receiver thread
> Jan 16 10:22:34 secondary kernel: drbd kvmimages: receiver (re)started
> Jan 16 10:22:34 secondary kernel: drbd kvmimages: conn( Unconnected -> WFConnection )
> Jan 16 10:23:35 secondary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101
> Jan 16 10:23:35 secondary kernel: drbd kvmimages: Agreed to support TRIM on protocol level
> Jan 16 10:23:35 secondary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams )
> Jan 16 10:23:35 secondary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [11935])
> Jan 16 10:23:35 secondary kernel: block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless )
>  
> As far as it is concerned, the primary terminated connection !?

Consistent with what you told before: the Primary thought the peer was no longer responding.
It is probably not the smartest idea to kick out the peer if that's the only data we still have,
but it still is intentional and covers some use cases.
This is also a side effect of your blocking the worker thread.

This handshake would have been the last possible time to bump the UUID,
as mentioned before.  That's a TODO for us.

> After primary was reattached:
>  
> Jan 16 12:22:03 secondary kernel: block drbd2: real peer disk state = Consistent
> Jan 16 12:22:03 secondary kernel: block drbd2: drbd_sync_handshake:
> Jan 16 12:22:03 secondary kernel: block drbd2: self C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:0 flags:0
> Jan 16 12:22:03 secondary kernel: block drbd2: peer C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:2

And there you can observe the implications:
again, identical UUIDs would mean "equivalent" data...
which is obviousy not the case.

> Jan 16 12:22:03 secondary kernel: block drbd2: uuid_compare()=-1 by rule 40
> Jan 16 12:22:03 secondary kernel: block drbd2: conn( Connected -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( Diskless -> UpToDate )
> Jan 16 12:22:04 secondary kernel: block drbd2: conn( WFBitMapT -> Behind )

Then you even manage to go "Ahead/Behind" during bitmap exchange.
I'm not sure I've seen that before.

> Jan 16 12:22:04 secondary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%
> Jan 16 12:22:04 secondary kernel: block drbd2: unexpected cstate (Behind) in receive_bitmap

Absolutely.

> And then happily overwrote it’s data:

Yes.

> Now, sure, the documentation states that disk-timeout is dangerous. I
> accept that, ‘buy better hardware’…but the point here is, that there
> doesn’t seem anything wrong with the hardware. And if there
> was, drbd should be able to recognize where the UpToDate data is –
> that’s the whole point. The most interesting part is that this was
> actually the second occurrence of a very similar problem. The
> first was on vastly different hardware (PERC H710 with BBU…), there’s
> absolutely no way it would timeout:
>  
> Jan 12 19:36:30 glinx kernel: block drbd1: Local backing device failed to meet the disk-timeout
> Jan 12 19:36:30 glinx kernel: block drbd1: disk( UpToDate -> Failed )
> Jan 12 19:36:30 glinx kernel: block drbd1: Local IO failed in request_timer_fn. Detaching...

> Jan 12 19:36:30 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1

> Jan 12 19:37:30 glinx kernel: block drbd1: Remote failed to finish a request within ko-count * timeout
> Jan 12 19:37:30 glinx kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> Jan 12 19:37:30 glinx kernel: drbd dbserver: susp( 0 -> 1 )
> Jan 12 19:37:30 glinx kernel: drbd dbserver: asender terminated
> Jan 12 19:37:30 glinx kernel: drbd dbserver: Terminating drbd_a_dbserver

> Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1 exit code 20 (0x1400)

*again* your blocking drbdadm detach from the local-io-error handler is the trigger.
Don't do that. You also should even background the "notify", if you really insist on using it.

> Jan 12 19:38:31 glinx kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies
> Jan 12 19:38:31 glinx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> Jan 12 19:38:31 glinx kernel: block drbd1: disk( Failed -> Diskless )
> Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1
> Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0)
> Jan 12 19:38:31 glinx kernel: drbd dbserver: Connection closed
> Jan 12 19:38:31 glinx kernel: drbd dbserver: out of mem, failed to invoke fence-peer helper
> Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Timeout -> Unconnected )
> Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver terminated
> Jan 12 19:38:31 glinx kernel: drbd dbserver: Restarting receiver thread
> Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver (re)started
> Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Unconnected -> WFConnection )
> Jan 12 19:38:32 glinx kernel: drbd dbserver: Handshake successful: Agreed network protocol version 101
> Jan 12 19:38:32 glinx kernel: drbd dbserver: Agreed to support TRIM on protocol level
> Jan 12 19:38:32 glinx kernel: drbd dbserver: conn( WFConnection -> WFReportParams )
> Jan 12 19:38:32 glinx kernel: drbd dbserver: Starting asender thread (from drbd_r_dbserver [32682])
> Jan 12 19:38:32 glinx kernel: block drbd1: receiver updated UUIDs to effective data uuid: 43734559ED5920C0
> Jan 12 19:38:32 glinx kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
> Jan 12 19:38:32 glinx kernel: block drbd1: Should have called drbd_al_complete_io(, 29734872, 4096), but my Disk seems to have failed :(
> Jan 12 19:38:32 glinx kernel: drbd dbserver: susp( 1 -> 0 )
>  
> Again the same set of events…local disk timeouts, 1 min later remote
> timeouts. After reattaching, local device thinks it’s UpToDate:

And again the same cause.
And the same fix.
Don't block the local-io-error handler,
and all the rest will not happen.

> SynsSource, how can you be SyncSource if the disk was not attached for 15
> mins? Not that this setup is running proto B on local gigabit and no
> pull-ahead.  This is far too common to be a coincidence and unless I’m making
> a big error somewhere, shouldn’t be happening.  Also note, that all the
> helpers are left in some weird state (this was captured at 16:57):
>  
> root     103848  0.0  0.0  11336   612 ?        S    10:21   0:00 /bin/bash /usr/lib/drbd/notify-io-error.sh
> root     103849  0.0  0.0   4060   616 ?        S    10:21   0:00 logger -t notify-io-error.sh[103847] -p local5.info
> root     103931  0.0  0.0  11336   612 ?        S    10:23   0:00 /bin/bash /usr/lib/drbd/notify-pri-on-incon-degr.sh
> root     103932  0.0  0.0   4060   616 ?        S    10:23   0:00 logger -t notify-pri-on-incon-degr.sh[103930] -p local5.info

We probably need the equivalent of
http://git.linbit.com/gitweb.cgi?p=drbd-utils.git;a=commitdiff;h=e871d3c52bb110c170068fa496a871e5b187490e
for "notify.sh" as well.
| Bash, or some versions of it, may decide to involve a subshell
| for the output redirection to logger, so the process tree would look
| like
|   main script
|     sub-shell
|       logger
| 
| Sometimes, the sub-shell would linger around,
| and keep logger from exiting, even if the main-shell finished already.
| 
| The second exec replaces the subshell, and logger will notice directly
| when its stdin is closed once the main script exits.
| This avoids the spurious logger processes.

> Configuration from the first (WAN) cluster setup:
>  
> common {
>         handlers {
>                 pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh;
>                 pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;
>                 local-io-error   "/usr/lib/drbd/notify-io-error.sh; drbdadm detach $DRBD_RESOURCE";

VERY bad idea, also useless, and the reason for all your trouble.
If you insist, rather do "/usr/lib/drbd/notify-io-error.sh &";

>                 out-of-sync      "/etc/scripts/drbd-verify.sh out-of-sync";
>                 split-brain      "/usr/lib/drbd/notify-split-brain.sh root";
>         }
>  
>         startup {
>                 degr-wfc-timeout     30;
>                 outdated-wfc-timeout 30;
>                 wfc-timeout          30;
>         }
>  
>         options {
>                on-no-data-accessible suspend-io;
>         }
>  
>         disk {
>                on-io-error      call-local-io-error;
>                 disk-timeout     900;

Instead of using DRBD disk-timeout,
please consider reducing the scsi midlayer timeouts,
so DRBD will see a real error "in time".
/sys/block/device-name/device/timeout
Note that this timeout may need to expire several times
(possibly for each of the escalating recovery attempts)
until the device finally goes offline.

If you ever will hit a real timeout,
but then a successfull completion is
delivered after that timeout,
DRBD has to *panic* to avoid random memory corruption.
(I'm not sure if we would even need to panic
for an error completion as well...  currently we don't)

>                 c-plan-ahead      20;
>                 c-fill-target     3M;
>                 c-min-rate        1M;
>                 c-max-rate       25M;
>         }
>         net {
>                 protocol                        A;
>                 ko-count                        5;
>                 on-congestion           pull-ahead;
>                 congestion-fill         7000K;

really? for 7 megabyte you want to pull ahead already?
you basically won't have a consistent secondary, ever.
Why not use rsync or csync2 instead...  Or at least also configure a
before-resync-target handler that would snapshot?
(Which would also have saved your last good data in this incident).

>                 congestion-extents      127;
>  
>                 use-rle                         yes;
>                 verify-alg                      sha1;
>                 csums-alg                       sha1;
>         }
> }
>  
> resource kvmimages {
>         on primary {
>                 device    /dev/drbd2;
>                 disk      /dev/md22;
>                 address   172.16.10.1:7790;
>                 meta-disk internal;
>         }
>         on secondary {
>                 device    /dev/drbd2;
>                 disk      /dev/md6;
>                 address   172.16.10.6:7790;
>                 meta-disk internal;
>         }
>         net {
>                 sndbuf-size 10M;
>                 rcvbuf-size 512k;
>         }
> }
>  
> As problems like this never happened on 8.3 (local disk failing without any
> obvious reason), I can’t say how 8.3 would react in the similar situation
> (diskless on primary)… But still if drbd is running diskless on primary, it
> shouldn’t overwrite the secondary data after it is back online. Surely the
> reattached disk is not supposed to be invalidated manually before
> reattaching?

No, the UUIDs should get bumped in time,
and all would work as intended.

> Also I haven’t really checked why handlers are left running in the processes

Spurious logger subshells.
Depends on bash and distribution versions apparently.
Has partially been fixed, see above.
Notify.sh would need to add an "exec 1>&-" in front of the ">2&- logger" as well.

> as I have just reverted back to 8.3.16 for now…

disk-timeout (as well as ko-count) in 8.3, and a number of other things,
are broken in different ways. Just saying.

Cheers,

	Lars



More information about the drbd-user mailing list