[Csync2] Issues with synced git clones

Mon Dec 25 14:32:02 CET 2017

On Thu, Dec 21, 2017 at 02:43:58AM +0100, Dominik George wrote:
> What happends is hard to describe, but it looks like the git index gets
> broken. The effect is that all nodes in the cluster end up with a broken
> state, namely with a lot of untracked files in the repository.

Which should be fixable with a simple "git reset", as long as those git
repos are supposed to be identical after that sync.

> Some details that might be of interest:
> 
>  * In the synced directory tree, there are several clones of the same
>    git repository
>  * Changes are made only on one node
>  * csync is configured to prefer the younger copy of a file

> > but let me first ask:
> > git is a distributed version control system,
> > so why are you not using *git* to distribute stuff?
> > as in git push/fetch/pull/remote update and so on?
> 
> Because git is only a conincident.
> 
> This syncs home directories across a cluster of PXE servers (over a
> relatively slow ADSL link, so distributed filesystems are not an
> option), and I have no control over what users happen to have inside
> their $HOME.

If in a "healthy" git checkout you do
git ls-files --debug, it shows you what git caches in that index file.
among other things that is the inode number.
Any (even local) copy (or "sync") will change the inode number.

Git uses the ctime,mtime, ownership, some more stat info,
and specifically the inode number only as a "hint" that
the contentent "might" have been changed.

Whenever git reads the "index" ("cache") file, and checks the "stat"
information of the tracked files, for those files where the stat
information indicates a possible content change,
it will check the actual content (re-hash the file),
and record the new stat info in the index.

Which means that even for "innocent" git commands like "status",
the index file will change, if any of the (relevant) stat information of
the tracked files has changed, even if their content and names have not.

If you do multi-directional sync, and the "newer" file wins,
depending on when a git command runs in one of the supposedly
"unchanging" nodes, that index file may still have "old" content,
(from the last sync) but will get a timestamp of now.

If you now sync that in the other direction, because it has the younger
time stamp, it will win over the "correct" index file, overwriting it
with a "stale" version of the file list and file hashes, doing something
equivalent to "git reset some-point-in-the-past" of your repo checkout.

When synchronizing file trees,
you need to use special rules for meta data files containing stat
information about themselves or other parts of the tree.

I suggest you exclude the git index files, and tell users to "git reset
HEAD" when they encounter "strange" behaviour.  Better yet, exclude git
directories, and (have them) use git to distribute those.

You could also hack git, and make it not use "now" for an updated git
index file, but explicitly set utime() of the most recent tracked or
cached file. That way, it would at least not present "stale" content
with "new" timestamp.

    Lars