As I was archiving some documents lately, I thought about using git as my backup tool.
I considered other tools, like boar (not distributed), git-annex (had a git-annex binary not working due to bad dependency, and couldn't use it anymore; also I somehow lost some videos using it) and bup (still considering it :)).
Main advantages of git are that its a very stable and common program, supported on UNIX and Windows, it's fast, has checks for the integrity of the repository, and can be synced distributed easily using push and pull.
Disadvantages are that it's not a backup solution. It is not suitable for binary files (well, it does just fine, but it doesn't diff), can't handle large files well. Another disadvantage I encountered using git (after writing this script), is that when a checksum fails, you'll have to retrieve the correct object from somewhere else. And, finally, it doesn't track metadata like user id, permissions, modification time.
Especially the latter bothered me; there can be some valuable information in mtime and permissions. Hence, based on etckeeper I devised a small script which keeps track of uid, guid, atime, mtime and permissions. Upon committing, checkout and update it will store or restore these attributes.
As it tracks all files, commit, check-out and pull operations can take (a lot) more time, as it will stat or update all files metadata.
Quick usage instructions: download git-store-metadata.sh. Navigate to the root of the git-repository, and
1 2 3
# git-store-metadata.sh install # git-store-metadata.sh store # git push
In the other repositories, do
# git pull # git-store-metadata.sh install
As of now, the attributes should be kept up-to-date transparently.
I have not tested this script extensively, nor have I checked the edge cases yet. So, use at your own risk.
Edit: Preliminary par2 support was added for the .git/objects folder to protect against corruption of the repository. It allows you to generate recovery files that can detect and correct errors. These recovery files are stored inside the .git folder, so are not a part of any commit or repository, and are local only. As it takes a lot of time, I have to look for a good tradeoff between correctability and time.
# git-store-metadata.sh generate # git-store-metadata.sh repair
The par2 support is only on the local repository. I.e. if you push the remote repository no recovery files are created on the remote repository. This is mainly because it seems difficult to keep the par2 files up to date on the remote. For example, pushing to a remote repository sometimes succeeds and sometimes fails. However, if pushing to a remote succeeds, then the existing par2 files are not up to date with the new set of files, and repair is not possible anymore.
Another critical thing to keep in mind: after a garbage collection has been performed, the par2 metadata must be regenerated. However, as there is no hook after a garbage collect (there is a pre-auto-gc, which is only run after automatic, or "--auto" garbage collects), this can not be done automatically. After a gc, a verify and/or repair will fail.
Also if there are many files, gathering or restoring metadata might take a while, so maybe we can take the commits into account to determine what files to update.
Read further for a detailed working example, and the source code of the script.