<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";
        mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";
        mso-fareast-language:EN-US;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=SL link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span lang=EN-US>Hello,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>We have been using DRBD for about 4 years now, so I have some experience with it. Today it was the first time that DRBD actually caused data loss…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>We mostly use DRBD in cases where older hardware is reused (upgraded with HDDs) and serves as a hot storage in case of primary node failure. This also means a lot of usage on md arrays, as usually only the primary will have a suitable RAID controller with BBU. I understand the shortcomings of such setups and have always been successful in achieving suitable performance. Nothing is perfect and you need to work with what you have…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Most of the setups is still 8.3 (with CentOS 6) and I have been testing 8.4 on some nodes and here’s where the problem lies. One of the setups actually replicates over WAN with proto A (no proxy…), again not perfect but you do what you have to do. The system was running stable since 2012…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>I have been using disk-timeout option (set at 900) and ko-count (5). Again, this is not perfect (I read the documentation) but we have been bitten a few times when a bad sector on a disk rendered the entire cluster unresponsive. This is unacceptable. Having some sort of mechanism to detach local storage in case of problems seems to be one reason for making a cluster in the first place…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>The problem starts a approx. 10:21:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:21:33 primary kernel: block drbd2: Local backing device failed to meet the disk-timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:21:33 primary kernel: block drbd2: disk( UpToDate -> Failed )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:21:33 primary kernel: block drbd2: Local IO failed in request_timer_fn. Detaching...<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:21:33 primary kernel: block drbd2: local WRITE IO error sector 111041344+8 on md22<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:21:33 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Local device is timeouting after 90s (!), without any I/O related errors in kernel. The disks have recently been upgraded to SSDs, so badblocks not really possible!<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 primary kernel: block drbd2: Remote failed to finish a request within ko-count * timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 primary kernel: block drbd2: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 primary kernel: drbd kvmimages: susp( 0 -> 1 )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 primary kernel: drbd kvmimages: asender terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 primary kernel: drbd kvmimages: Terminating drbd_a_kvmimages<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Exactly one minute later (!), the remote stops responding (both fail at the same time?). As on-suspend is configured, all I/O is suspended.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Ok, unusual, but still ok so far…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2 exit code 20 (0x1400)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: bitmap WRITE of 0 pages took 0 jiffies<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: disk( Failed -> Diskless )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: Connection closed<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: out of mem, failed to invoke fence-peer helper<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Timeout -> Unconnected )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: Restarting receiver thread<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver (re)started<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Unconnected -> WFConnection )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: drbd kvmimages: Agreed to support TRIM on protocol level<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [42146])<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: block drbd2: receiver updated UUIDs to effective data uuid: C7F11A2B6505F460<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: block drbd2: Should have called drbd_al_complete_io(, 111041344, 4096), but my Disk seems to have failed :(<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 primary kernel: drbd kvmimages: susp( 1 -> 0 )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Helpers called, remote reconnected, suspend is lifted…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>At this point, Primary is running diskless. Now, here comes the fun part…the admin (me) notices this after a couple of hours (I was away) and re-attaches the disk on primary node (as per documentation: “If using internal meta data, it is sufficient to bind the DRBD device to the new hard disk.”). Here’s where everything goes south:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: disk( Diskless -> Attaching )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: recounting of set bits took additional 3 jiffies<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: 4948 MB (1266688 bits) marked out-of-sync by on disk bit-map.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: disk( Attaching -> Negotiating )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: attached to UUIDs C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: drbd_sync_handshake:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: self C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: peer C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:53805 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: uuid_compare()=1 by rule 40<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Why is primary disk now UpToDate, when it was offline for 2 hours? It should be inconsistent and resynced from the Secondary…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>To make matters worse, pull-ahead is triggered at the same time (remember, replicating over WAN), since 5GB was marked out-of-sync by making the disk on primary up to date:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: Congestion-fill threshold reached<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: conn( WFBitMapS -> Ahead )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 primary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Now the disks stay in this state, the sync is actually not starting:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>[root@primary] # cat /proc/drbd<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>version: 8.4.5 (api:1/proto:86-101)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil@Build64R6, 2014-10-28 10:32:53<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:48488816 nr:0 dw:106632540 dr:213443889 al:24231 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> 2: cs:Ahead ro:Primary/Secondary ds:UpToDate/Consistent A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:0 nr:0 dw:564028 dr:163202 al:52275 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5190228<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:13699098 nr:0 dw:34361764 dr:248373080 al:3374 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>[root@secondary] # cat /proc/drbd<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>version: 8.4.5 (api:1/proto:86-101)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil@Build64R6, 2014-10-28 10:32:53<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:0 nr:49240716 dw:61104624 dr:176250476 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> 2: cs:Behind ro:Secondary/Primary ds:Outdated/UpToDate A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:53230 nr:46356310 dw:48639878 dr:55173318 al:886 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5241020<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ns:0 nr:21494090 dw:21494090 dr:195651772 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>With both nodes sure, that primary is UpToDate!<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Recognizing the problem, I started shutting down services and copying all the data (whatever it was) off the drbd partitions…somewhere along the copying, maybe the WAN link failed or something else happened, drbd started resyncing:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:16 primary kernel: block drbd2: new current UUID A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: drbd_sync_handshake:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: self A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1299480 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: peer C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:1312123 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: uuid_compare()=1 by rule 70<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 primary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-source minor-2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-source minor-2 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: Began resync as SyncSource (will sync 5248492 KB [1312123 bits set]).<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 primary kernel: block drbd2: updated sync UUID A30F012BA25C9803:C7F21A2B6505F461:C7F11A2B6505F461:9C59495AB871DD6E<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 primary kernel: block drbd2: Resync done (total 289 sec; paused 0 sec; 18160 K/sec)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 primary kernel: block drbd2: 69 % had equal checksums, eliminated: 3672368K; transferred 1576124K total 5248492K<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 primary kernel: block drbd2: updated UUIDs A30F012BA25C9803:0000000000000000:C7F21A2B6505F461:C7F11A2B6505F461<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 primary kernel: block drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>…and overwritten approx. 1.5GB data on the secondary from probably stale data (and new data written to the primary in between) from the primary.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>There’s nothing in the logs of the Secondary to indicate any disk problems around 10:22:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: sock was shut down by peer<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: short read (expected size 16)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: meta connection shut down by peer.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: asender terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: Terminating drbd_a_kvmimages<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: Connection closed<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: conn( BrokenPipe -> Unconnected )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:33 secondary kernel: drbd kvmimages: receiver terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:34 secondary kernel: drbd kvmimages: Restarting receiver thread<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:34 secondary kernel: drbd kvmimages: receiver (re)started<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:22:34 secondary kernel: drbd kvmimages: conn( Unconnected -> WFConnection )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 secondary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 secondary kernel: drbd kvmimages: Agreed to support TRIM on protocol level<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 secondary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 secondary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [11935])<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 10:23:35 secondary kernel: block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>As far as it is concerned, the primary terminated connection !?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>After primary was reattached:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: real peer disk state = Consistent<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: drbd_sync_handshake:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: self C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:0 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: peer C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: uuid_compare()=-1 by rule 40<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:03 secondary kernel: block drbd2: conn( Connected -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( Diskless -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:04 secondary kernel: block drbd2: conn( WFBitMapT -> Behind )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:04 secondary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 12:22:04 secondary kernel: block drbd2: unexpected cstate (Behind) in receive_bitmap<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>And then happily overwrote it’s data:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 secondary kernel: block drbd2: drbd_sync_handshake:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 secondary kernel: block drbd2: self C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:1312123 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 secondary kernel: block drbd2: peer A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1299480 flags:2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 secondary kernel: block drbd2: uuid_compare()=-1 by rule 50<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:38 secondary kernel: block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFBitMapT -> WFSyncUUID )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: updated sync uuid C7F21A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:36:39 secondary kernel: block drbd2: Began resync as SyncTarget (will sync 5248492 KB [1312123 bits set]).<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: Resync done (total 289 sec; paused 0 sec; 18160 K/sec)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: 69 % had equal checksums, eliminated: 3672368K; transferred 1576124K total 5248492K<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: updated UUIDs A30F012BA25C9802:0000000000000000:C7F21A2B6505F460:C7F11A2B6505F461<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Now, sure, the documentation states that disk-timeout is dangerous. I accept that, ‘buy better hardware’…but the point here is, that there doesn’t seem anything wrong with the hardware. And if there was, drbd should be able to recognize where the UpToDate data is – that’s the whole point. The most interesting part is that this was actually the second occurrence of a very similar problem. The first was on vastly different hardware (PERC H710 with BBU…), there’s absolutely no way it would timeout:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:36:30 glinx kernel: block drbd1: Local backing device failed to meet the disk-timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:36:30 glinx kernel: block drbd1: disk( UpToDate -> Failed )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:36:30 glinx kernel: block drbd1: Local IO failed in request_timer_fn. Detaching...<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:36:30 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:37:30 glinx kernel: block drbd1: Remote failed to finish a request within ko-count * timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:37:30 glinx kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:37:30 glinx kernel: drbd dbserver: susp( 0 -> 1 )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:37:30 glinx kernel: drbd dbserver: asender terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:37:30 glinx kernel: drbd dbserver: Terminating drbd_a_dbserver<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1 exit code 20 (0x1400)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: disk( Failed -> Diskless )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: Connection closed<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: out of mem, failed to invoke fence-peer helper<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Timeout -> Unconnected )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver terminated<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: Restarting receiver thread<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver (re)started<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Unconnected -> WFConnection )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: drbd dbserver: Handshake successful: Agreed network protocol version 101<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: drbd dbserver: Agreed to support TRIM on protocol level<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: drbd dbserver: conn( WFConnection -> WFReportParams )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: drbd dbserver: Starting asender thread (from drbd_r_dbserver [32682])<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: block drbd1: receiver updated UUIDs to effective data uuid: 43734559ED5920C0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: block drbd1: Should have called drbd_al_complete_io(, 29734872, 4096), but my Disk seems to have failed :(<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:38:32 glinx kernel: drbd dbserver: susp( 1 -> 0 )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Again the same set of events…local disk timeouts, 1 min later remote timeouts. After reattaching, local device thinks it’s UpToDate:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:17 glinx kernel: block drbd1: disk( Diskless -> Attaching )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:17 glinx kernel: block drbd1: recounting of set bits took additional 4 jiffies<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:17 glinx kernel: block drbd1: 4948 MB (1266688 bits) marked out-of-sync by on disk bit-map.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:17 glinx kernel: block drbd1: disk( Attaching -> Negotiating )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:17 glinx kernel: block drbd1: attached to UUIDs 43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: drbd_sync_handshake:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: self 43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A bits:1266688 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: peer 43734559ED5920C0:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A bits:235 flags:0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: uuid_compare()=1 by rule 40<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0%<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Jan 12 19:53:18 glinx kernel: block drbd1: Began resync as SyncSource (will sync 5066752 KB [1266688 bits set]).<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>SynsSource, how can you be SyncSource if the disk was not attached for 15 mins? Not that this setup is running proto B on local gigabit and no pull-ahead.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>This is far too common to be a coincidence and unless I’m making a big error somewhere, shouldn’t be happening. Also note, that all the helpers are left in some weird state (this was captured at 16:57):<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>root 103848 0.0 0.0 11336 612 ? S 10:21 0:00 /bin/bash /usr/lib/drbd/notify-io-error.sh<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>root 103849 0.0 0.0 4060 616 ? S 10:21 0:00 logger -t notify-io-error.sh[103847] -p local5.info<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>root 103931 0.0 0.0 11336 612 ? S 10:23 0:00 /bin/bash /usr/lib/drbd/notify-pri-on-incon-degr.sh<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>root 103932 0.0 0.0 4060 616 ? S 10:23 0:00 logger -t notify-pri-on-incon-degr.sh[103930] -p local5.info<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Configuration from the first (WAN) cluster setup:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>common {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> handlers {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> local-io-error "/usr/lib/drbd/notify-io-error.sh; drbdadm detach $DRBD_RESOURCE";<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> out-of-sync "/etc/scripts/drbd-verify.sh out-of-sync";<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> split-brain "/usr/lib/drbd/notify-split-brain.sh root";<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> startup {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> degr-wfc-timeout 30;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> outdated-wfc-timeout 30;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> wfc-timeout 30;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> options {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> on-no-data-accessible suspend-io;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> disk { <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> on-io-error call-local-io-error;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> disk-timeout 900;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> c-plan-ahead 20;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> c-fill-target 3M;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> c-min-rate 1M;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> c-max-rate 25M;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> net {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> protocol A;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> ko-count 5;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> on-congestion pull-ahead;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> congestion-fill 7000K;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> congestion-extents 127;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US> use-rle yes;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> verify-alg sha1;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> csums-alg sha1;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>}<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>resource kvmimages {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> on primary {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> device /dev/drbd2;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> disk /dev/md22;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> address 172.16.10.1:7790;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> meta-disk internal;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> on secondary {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> device /dev/drbd2;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> disk /dev/md6;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> address 172.16.10.6:7790;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> meta-disk internal;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> net {<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> sndbuf-size 10M;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> rcvbuf-size 512k;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> }<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>}<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>As problems like this never happened on 8.3 (local disk failing without any obvious reason), I can’t say how 8.3 would react in the similar situation (diskless on primary)…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>But still if drbd is running diskless on primary, it shouldn’t overwrite the secondary data after it is back online. Surely the reattached disk is not supposed to be invalidated manually before reattaching?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Also I haven’t really checked why handlers are left running in the processes as I have just reverted back to 8.3.16 for now…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Regards,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Saso Slavicic<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p></div></body></html>