Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I had a very peculiar situation with DRBD. I performed a fail-over test, which worked, but after I added the failed node back as secondary (discarding its data) and performing another fail-over test, all data on the secondary node turned out to be completely useless (files contained garbage after the first block, files containing only garbage, etc.). Here is my setup: I run two DRBD nodes in primary-secondary mode. Usually the primary server is used, but in emergencies, the secondary, less powerful server could be activated manually to provide at least some service (crash-consistent, at least). The primary server uses a LVM volume of 40 GiB, upon which DRBD acts, which is provided for the domU to create partitions on. The domU doesn't know DRBD is backing it. So inside the DRBD block device, it creates partitions as usual. My mistake, as I assume, is that I checked the partition size from within the domU, and used that number to create the partition backing DRBD on the secondary server (which is a mere virtual server), which was now too small, since I should have checked the size of the LVM volume backing DRBD on the primary. Synchronization worked, and so did fail-over: - I disconnected the secondary node. - I promoted it to primary. - I mounted the DRBD volume (mount -o rw /dev/mapper/"$drbdres"1 /data). - I ran tests, wrote to the disk, etc.. May 22 14:27:08 kernel: [935443.236935] block drbd2: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) May 22 14:27:08 kernel: [935443.238259] block drbd2: asender terminated May 22 14:27:08 kernel: [935443.238263] block drbd2: Terminating drbd2_asender May 22 14:27:08 kernel: [935443.238327] block drbd2: Connection closed May 22 14:27:08 kernel: [935443.238331] block drbd2: conn( Disconnecting -> StandAlone ) May 22 14:27:08 kernel: [935443.238340] block drbd2: receiver terminated May 22 14:27:08 kernel: [935443.238342] block drbd2: Terminating drbd2_receiver May 22 14:27:26 kernel: [935460.609680] block drbd2: role( Secondary -> Primary ) May 22 14:28:24 kernel: [935519.197240] device-mapper: multipath: version 1.3.2 loaded May 22 14:28:41 kernel: [935535.837046] device-mapper: table: 252:2: drbd2 too small for target: start=81793024, len=2093056, dev_size=83883448 May 22 14:38:02 kernel: [936096.382990] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 666094 May 22 14:38:02 kernel: [936096.384190] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683278 May 22 14:38:02 kernel: [936096.384199] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683273 May 22 14:38:02 kernel: [936096.384206] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683272 May 22 14:38:02 kernel: [936096.384211] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683271 May 22 14:38:02 kernel: [936096.384216] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 683270 May 22 14:38:02 kernel: [936096.384220] EXT4-fs (dm-0): 6 orphan inodes deleted May 22 14:38:02 kernel: [936096.384222] EXT4-fs (dm-0): recovery complete May 22 14:38:02 kernel: [936096.385250] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) Note in particular the message from device-mapper. The swap partition was at the referenced location, so I could work on the data partition, but I wonder if something already went wrong here. I did not try mounting the swap partition. Since the now-promoted server would have been functional, I wanted to restore the previous state. - I demoted and connected it as a secondary again, with --discard-my-data. - About 1.2 GB were synced, and everything was "UpToDate" again. May 22 15:29:40 kernel: [939194.560779] block drbd2: Split-Brain detected, manually solved. Sync from peer node May 22 15:29:40 kernel: [939194.560784] block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) May 22 15:29:40 kernel: [939194.579725] block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 15(1), total 15; compression: 100.0% May 22 15:29:40 kernel: [939194.584810] block drbd2: conn( WFBitMapT -> WFSyncUUID ) May 22 15:29:40 kernel: [939195.171600] block drbd2: updated sync uuid 41AB15EA146152E0:0000000000000000:22955A9C3A791A4A:22945A9C3A791A4B May 22 15:29:40 kernel: [939195.171882] block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 May 22 15:29:40 kernel: [939195.173982] block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0) May 22 15:29:40 kernel: [939195.173988] block drbd2: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) May 22 15:29:40 kernel: [939195.173993] block drbd2: Began resync as SyncTarget (will sync 1226284 KB [306571 bits set]). May 22 15:33:12 kernel: [939406.611609] block drbd2: Resync done (total 211 sec; paused 0 sec; 5808 K/sec) May 22 15:33:12 kernel: [939406.611615] block drbd2: updated UUIDs 1A549A81F3FE347E:0000000000000000:41AB15EA146152E0:41AA15EA146152E1 May 22 15:33:12 kernel: [939406.611620] block drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) I ran the same test again: - I disconnected the secondary node. - I promoted it to primary. - I mounted the DRBD volume - I noticed that none of the services could start anymore. - Checked the configuration files, which contained garbage data. - Checking the syslog, fsck had to correct a few entries during the mount; the files were there, but no useful data was within them. May 22 16:07:36 kernel: [941470.358138] block drbd2: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) May 22 16:07:36 kernel: [941470.358461] block drbd2: asender terminated May 22 16:07:36 kernel: [941470.358464] block drbd2: Terminating drbd2_asender May 22 16:07:36 kernel: [941470.358525] block drbd2: Connection closed May 22 16:07:36 kernel: [941470.358529] block drbd2: conn( Disconnecting -> StandAlone ) May 22 16:07:36 kernel: [941470.358580] block drbd2: receiver terminated May 22 16:07:36 kernel: [941470.358583] block drbd2: Terminating drbd2_receiver May 22 16:07:36 kernel: [941470.361063] block drbd2: role( Secondary -> Primary ) May 22 16:07:36 kernel: [941470.361220] block drbd2: new current UUID C6E989B8069E233D:1A549A81F3FE347E:41AB15EA146152E0:41AA15EA146152E1 May 22 16:07:36 kernel: [941470.366794] device-mapper: table: 252:2: drbd2 too small for target: start=81793024, len=2093056, dev_size=83883448 May 22 16:07:36 kernel: [941470.443423] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 665239 May 22 16:07:36 kernel: [941470.443796] EXT4-fs error (device dm-0): ext4_mb_generate_buddy:739: group 158, 1528 clusters in bitmap, 5723 in gd May 22 16:07:36 kernel: [941470.449068] EXT4-fs (dm-0): ext4_orphan_cleanup: truncating inode 662235 to 757 bytes May 22 16:07:36 kernel: [941470.449426] EXT4-fs (dm-0): 1 orphan inode deleted May 22 16:07:36 kernel: [941470.449429] EXT4-fs (dm-0): 1 truncate cleaned up May 22 16:07:36 kernel: [941470.449431] EXT4-fs (dm-0): recovery complete May 22 16:07:36 kernel: [941470.450756] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) May 22 16:07:40 kernel: [941474.730794] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #1179664: comm tail: deleted inode referenced: 1247435 May 22 16:07:40 kernel: [941474.734670] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #1179664: comm tail: deleted inode referenced: 1247435 May 22 16:09:33 kernel: [941587.487254] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm bash: deleted inode referenced: 683837 May 22 16:09:33 kernel: [941587.727142] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm vi: deleted inode referenced: 683837 [...] May 22 16:09:33 kernel: [941587.730099] EXT4-fs error (device dm-0): ext4_lookup:1047: inode #664780: comm vi: deleted inode referenced: 683837 Running online verification found tons of blocks that were out of sync. May 22 16:18:30 kernel: [942125.119936] block drbd2: Starting Online Verify from sector 0 May 22 16:18:35 kernel: [942129.459450] block drbd2: Out of sync: start=2056, size=16 (sectors) May 22 16:18:52 kernel: [942146.550253] block drbd2: Out of sync: start=10272, size=8 (sectors) May 22 16:21:19 kernel: [942293.893023] block drbd2: Out of sync: start=1374144, size=64 (sectors) May 22 16:21:20 kernel: [942295.189719] block drbd2: Out of sync: start=1387992, size=8 (sectors) May 22 16:21:20 kernel: [942295.193997] block drbd2: Out of sync: start=1388032, size=432 (sectors) May 22 16:21:21 kernel: [942295.292691] block drbd2: Out of sync: start=1388768, size=288 (sectors) May 22 16:21:21 kernel: [942295.377464] block drbd2: Out of sync: start=1389616, size=16 (sectors) May 22 16:21:21 kernel: [942295.380208] block drbd2: Out of sync: start=1389656, size=48 (sectors) May 22 16:21:23 kernel: [942297.476909] block drbd2: Out of sync: start=1410272, size=800 (sectors) May 22 16:21:34 kernel: [942308.667068] block drbd2: Out of sync: start=1525888, size=8 (sectors) May 22 16:21:34 kernel: [942308.669117] block drbd2: Out of sync: start=1525896, size=8 (sectors) May 22 16:21:34 kernel: [942309.102272] block drbd2: Out of sync: start=1530080, size=24 (sectors) May 22 16:21:34 kernel: [942309.239587] block drbd2: Out of sync: start=1531104, size=656 (sectors) May 22 16:21:35 kernel: [942309.681967] block drbd2: Out of sync: start=1536088, size=328 (sectors) May 22 16:21:39 kernel: [942313.464188] block drbd2: Out of sync: start=1574936, size=16 (sectors) No errors were logged by DRBD that would indicate to me that the data was trashed. My question is: - Was the possible mistake that I mentioned above an actual mistake? ( I am guessing that due to internal metadata, the partition in the domU was actually smaller than the space DRBD needed. So I accidentally made the partition on the secondary server slightly too small, and when the secondary node was promoted, it shreded some of DRBD's metadata somehow?) Could it have been the cause for this? I have no knowledge of how DRBD internally works, so I'm at a loss, but perhaps someone can shed some light on this? Thanks, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150523/c2df5817/attachment.htm>