<div dir="ltr">I had a very peculiar situation with DRBD. <div>I performed a fail-over test, which worked, but after I added the failed node back as secondary (discarding its data) and performing another fail-over test, all data on the secondary node turned out to be completely useless (files contained garbage after the first block, files containing only garbage, etc.).</div><div><br></div><div><div><div>Here is my setup:<br><div><br></div><div>I run two DRBD nodes in primary-secondary mode. Usually the primary server is used, but in emergencies, the secondary, less powerful server could be activated manually to provide at least some service (crash-consistent, at least).</div></div></div><div><br></div><div>The primary server uses a LVM volume of 40 GiB, upon which DRBD acts, which is provided for the domU to create partitions on. The domU doesn&#39;t know DRBD is backing it. So inside the DRBD block device, it creates partitions as usual.<br></div><div>My mistake, as I assume, is that I checked the partition size from within the domU, and used that number to create the partition backing DRBD on the secondary server (which is a mere virtual server), which was now too small, since I should have checked the size of the LVM volume backing DRBD on the primary. </div><div><br></div><div>Synchronization worked, and so did fail-over: </div><div>- I disconnected the secondary node.</div><div>- I promoted it to primary.</div><div>- I mounted the DRBD volume (mount -o rw /dev/mapper/&quot;$drbdres&quot;1 /data).</div><div>- I ran tests, wrote to the disk, etc..</div><div><br></div><div><div>May 22 14:27:08 kernel: [935443.236935] block drbd2: peer( Primary -&gt; Unknown ) conn( Connected -&gt; Disconnecting ) pdsk( UpToDate -&gt; DUnknown )</div><div>May 22 14:27:08 kernel: [935443.238259] block drbd2: asender terminated</div><div>May 22 14:27:08 kernel: [935443.238263] block drbd2: Terminating drbd2_asender</div><div>May 22 14:27:08 kernel: [935443.238327] block drbd2: Connection closed</div><div>May 22 14:27:08 kernel: [935443.238331] block drbd2: conn( Disconnecting -&gt; StandAlone )</div><div>May 22 14:27:08 kernel: [935443.238340] block drbd2: receiver terminated</div><div>May 22 14:27:08 kernel: [935443.238342] block drbd2: Terminating drbd2_receiver</div><div>May 22 14:27:26 kernel: [935460.609680] block drbd2: role( Secondary -&gt; Primary )</div><div>May 22 14:28:24 kernel: [935519.197240] device-mapper: multipath: version 1.3.2 loaded</div><div>May 22 14:28:41 kernel: [935535.837046] device-mapper: table: 252:2: drbd2 too small for target: start=81793024, len=2093056, dev_size=83883448</div></div><div><div>May 22 14:38:02 kernel: [936096.382990] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 666094</div><div>May 22 14:38:02 kernel: [936096.384190] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683278</div><div>May 22 14:38:02 kernel: [936096.384199] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683273</div><div>May 22 14:38:02 kernel: [936096.384206] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683272</div><div>May 22 14:38:02 kernel: [936096.384211] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683271</div><div>May 22 14:38:02 kernel: [936096.384216] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683270</div><div>May 22 14:38:02 kernel: [936096.384220] EXT4-fs (dm-0): 6 orphan inodes deleted</div><div>May 22 14:38:02 kernel: [936096.384222] EXT4-fs (dm-0): recovery complete</div><div>May 22 14:38:02 kernel: [936096.385250] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)</div></div><div><br></div><div>Note in particular the message from device-mapper. The swap partition was at the referenced location, so I could work on the data partition, but I wonder if something already went wrong here. I did not try mounting the swap partition.</div><div><br></div><div>Since the now-promoted server would have been functional, I wanted to restore the previous state.</div><div>- I demoted and connected it as a secondary again, with --discard-my-data. </div><div>- About 1.2 GB were synced, and everything was &quot;UpToDate&quot; again.</div><div><br></div><div><div>May 22 15:29:40 kernel: [939194.560779] block drbd2: Split-Brain detected, manually solved. Sync from peer node</div><div>May 22 15:29:40 kernel: [939194.560784] block drbd2: peer( Unknown -&gt; Primary ) conn( WFReportParams -&gt; WFBitMapT ) disk( UpToDate -&gt; Outdated ) pdsk( DUnknown -&gt; UpToDate )</div><div>May 22 15:29:40 kernel: [939194.579725] block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 15(1), total 15; compression: 100.0%</div><div>May 22 15:29:40 kernel: [939194.584810] block drbd2: conn( WFBitMapT -&gt; WFSyncUUID )</div><div>May 22 15:29:40 kernel: [939195.171600] block drbd2: updated sync uuid 41AB15EA146152E0:0000000000000000:22955A9C3A791A4A:22945A9C3A791A4B</div><div>May 22 15:29:40 kernel: [939195.171882] block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2</div><div>May 22 15:29:40 kernel: [939195.173982] block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0)</div><div>May 22 15:29:40 kernel: [939195.173988] block drbd2: conn( WFSyncUUID -&gt; SyncTarget ) disk( Outdated -&gt; Inconsistent )</div><div>May 22 15:29:40 kernel: [939195.173993] block drbd2: Began resync as SyncTarget (will sync 1226284 KB [306571 bits set]).</div><div>May 22 15:33:12 kernel: [939406.611609] block drbd2: Resync done (total 211 sec; paused 0 sec; 5808 K/sec)</div><div>May 22 15:33:12 kernel: [939406.611615] block drbd2: updated UUIDs 1A549A81F3FE347E:0000000000000000:41AB15EA146152E0:41AA15EA146152E1</div><div>May 22 15:33:12 kernel: [939406.611620] block drbd2: conn( SyncTarget -&gt; Connected ) disk( Inconsistent -&gt; UpToDate )</div></div><div><br></div><div><br></div><div>I ran the same test again:</div><div><div>- I disconnected the secondary node.</div><div>- I promoted it to primary.</div></div><div>- I mounted the DRBD volume<br></div><div>- I noticed that none of the services could start anymore.</div><div>- Checked the configuration files, which contained garbage data.</div><div>- Checking the syslog, fsck had to correct a few entries during the mount; the files were there, but no useful data was within them.</div><div><br></div><div><div>May 22 16:07:36 kernel: [941470.358138] block drbd2: peer( Primary -&gt; Unknown ) conn( Connected -&gt; Disconnecting ) pdsk( UpToDate -&gt; DUnknown )</div><div>May 22 16:07:36 kernel: [941470.358461] block drbd2: asender terminated</div><div>May 22 16:07:36 kernel: [941470.358464] block drbd2: Terminating drbd2_asender</div><div>May 22 16:07:36 kernel: [941470.358525] block drbd2: Connection closed</div><div>May 22 16:07:36 kernel: [941470.358529] block drbd2: conn( Disconnecting -&gt; StandAlone )</div><div>May 22 16:07:36 kernel: [941470.358580] block drbd2: receiver terminated</div><div>May 22 16:07:36 kernel: [941470.358583] block drbd2: Terminating drbd2_receiver</div><div>May 22 16:07:36 kernel: [941470.361063] block drbd2: role( Secondary -&gt; Primary )</div><div>May 22 16:07:36 kernel: [941470.361220] block drbd2: new current UUID C6E989B8069E233D:1A549A81F3FE347E:41AB15EA146152E0:41AA15EA146152E1</div><div>May 22 16:07:36 kernel: [941470.366794] device-mapper: table: 252:2: drbd2 too small for target: start=81793024, len=2093056, dev_size=83883448</div><div>May 22 16:07:36 kernel: [941470.443423] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 665239</div><div>May 22 16:07:36 kernel: [941470.443796] EXT4-fs error (device dm-0): ext4_mb_generate_buddy:739: group 158, 1528 clusters in bitmap, 5723 in gd</div><div>May 22 16:07:36 kernel: [941470.449068] EXT4-fs (dm-0): ext4_orphan_cleanup: truncating inode 662235 to 757 bytes</div><div>May 22 16:07:36 kernel: [941470.449426] EXT4-fs (dm-0): 1 orphan inode deleted</div><div>May 22 16:07:36 kernel: [941470.449429] EXT4-fs (dm-0): 1 truncate cleaned up</div><div>May 22 16:07:36 kernel: [941470.449431] EXT4-fs (dm-0): recovery complete</div><div>May 22 16:07:36 kernel: [941470.450756] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)</div><div>May 22 16:07:40 kernel: [941474.730794] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #1179664: comm tail: deleted inode referenced: 1247435</div><div>May 22 16:07:40 kernel: [941474.734670] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #1179664: comm tail: deleted inode referenced: 1247435</div><div>May 22 16:09:33 kernel: [941587.487254] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm bash: deleted inode referenced: 683837</div><div>May 22 16:09:33 kernel: [941587.727142] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm vi: deleted inode referenced: 683837</div><div>[...]</div><div>May 22 16:09:33 kernel: [941587.730099] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm vi: deleted inode referenced: 683837</div></div><div><br></div><div>Running online verification found tons of blocks that were out of sync.</div><div><br></div><div><div>May 22 16:18:30 kernel: [942125.119936] block drbd2: Starting Online Verify from sector 0</div><div>May 22 16:18:35 kernel: [942129.459450] block drbd2: Out of sync: start=2056, size=16 (sectors)</div><div>May 22 16:18:52 kernel: [942146.550253] block drbd2: Out of sync: start=10272, size=8 (sectors)</div><div>May 22 16:21:19 kernel: [942293.893023] block drbd2: Out of sync: start=1374144, size=64 (sectors)</div><div>May 22 16:21:20 kernel: [942295.189719] block drbd2: Out of sync: start=1387992, size=8 (sectors)</div><div>May 22 16:21:20 kernel: [942295.193997] block drbd2: Out of sync: start=1388032, size=432 (sectors)</div><div>May 22 16:21:21 kernel: [942295.292691] block drbd2: Out of sync: start=1388768, size=288 (sectors)</div><div>May 22 16:21:21 kernel: [942295.377464] block drbd2: Out of sync: start=1389616, size=16 (sectors)</div><div>May 22 16:21:21 kernel: [942295.380208] block drbd2: Out of sync: start=1389656, size=48 (sectors)</div><div>May 22 16:21:23 kernel: [942297.476909] block drbd2: Out of sync: start=1410272, size=800 (sectors)</div><div>May 22 16:21:34 kernel: [942308.667068] block drbd2: Out of sync: start=1525888, size=8 (sectors)</div><div>May 22 16:21:34 kernel: [942308.669117] block drbd2: Out of sync: start=1525896, size=8 (sectors)</div><div>May 22 16:21:34 kernel: [942309.102272] block drbd2: Out of sync: start=1530080, size=24 (sectors)</div><div>May 22 16:21:34 kernel: [942309.239587] block drbd2: Out of sync: start=1531104, size=656 (sectors)</div><div>May 22 16:21:35 kernel: [942309.681967] block drbd2: Out of sync: start=1536088, size=328 (sectors)</div><div>May 22 16:21:39 kernel: [942313.464188] block drbd2: Out of sync: start=1574936, size=16 (sectors)</div></div><div><br></div><div>No errors were logged by DRBD that would indicate to me that the data was trashed.<br></div><div><br></div><div>My question is: </div><div>- Was the possible mistake that I mentioned above an actual mistake? ( I am guessing that due to internal metadata, the partition in the domU was actually smaller than the space DRBD needed. So I accidentally made the partition on the secondary server slightly too small, and when the secondary node was promoted, it shreded some of DRBD&#39;s metadata somehow?) Could it have been the cause for this?</div><div><br></div><div>I have no knowledge of how DRBD internally works, so I&#39;m at a loss, but perhaps someone can shed some light on this?</div></div><div><br></div><div>Thanks,</div><div>Michael</div></div>