Enourmous Git Repositories

If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?

One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.

I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had 10GB 7GB of such gibberish.(I realised that, having used du -s instead of du --apparent-size -s to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)

The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.

Results

Generating the initial data: this took all night, perhaps because I included a call to du inside the loop that generated the data, which would take an increasing amount of time on each iteration.

Creating an initial 10GB 7GB commit: 95 minutes

$ time git add .
real    90m0.219s
user    84m57.117s
sys     1m6.932s

$ time git status
real    1m15.992s
user    0m4.071s
sys     0m20.728s

$ time git commit -m "Initial commit"
real    4m22.397s
user    0m27.168s
sys     1m5.815s

The git log command is pretty instant, a git show of this commit takes a minute the first time I run it, about 5 seconds if I run it again.

Doing git add and git rm to create a second commit is really quick, git status is still slow, but git commit is quick:

$ time git status
real    1m19.937s
user    0m5.063s
sys     0m16.678s

$ time git commit -m "Put all z files in same directory"
real    0m11.317s
user    0m1.639s
sys     0m5.306s

Furthermore, git show of this second commit is quick too.

Next I used git daemon to serve the repo over git:// protocol:

$ git daemon --verbose --export-all --base-path=`pwd`

Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes

$ time git clone git://172.16.20.95/huge-repo
Cloning into 'huge-repo'...
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
Checking connectivity... done.
Checking out files: 100% (46345/46345), done.

real    22m17.734s
user    2m12.606s
sys     0m54.603s

Doing a sparse checkout of a few files: 15 minutes

$ mkdir sparse-checkout
$ cd sparse-checkout
$ git init .
$ git config core.sparsecheckout true
$ echo z-files/ >> .git/info/sparse-checkout

$ time git pull  git://172.16.20.95/huge-repo master
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
From git://172.16.20.95/huge-repo
 * branch            master     -> FETCH_HEAD

real    14m26.032s
user    1m9.133s
sys     0m22.683s

This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same git-daemon process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.

I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.

Also, I learned that if you are profiling file size, you should use du --apparent-size to measure that, because file size != disk usage!

Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).

Cleaning up stale Git branches

I get bored looking through dozens and dozens of stale Git branches. If git branch --remote takes up more than a screenful of text then I am unhappy!

Here are some shell hacks that can help you when trying to work out what can be deleted.

This shows you all the remote branches which are already merged, those can probably be deleted right away!

git branch --remote --merged

These are the remote branches that aren’t merged yet.

git branch --remote --no-merged

Best not to delete those straight away. But some of them are probably totally stale. This snippet will loop through each unmerged branch and tell you (a) when the last commit was made, and (b) how many commits it contains which are not merged to ‘origin/master’.

for b in $(git branch --remote --no-merged); do
    echo $b;
    git show $b --pretty="format:  Last commit: %cd" | head -n 1;
    echo -n "  Commits from 'master': ";
    git log --oneline $(git merge-base $b origin/master)..$b | wc -l;
    echo;
done

The output looks like this:

origin/album-art-to-libtracker-extract
  Last commit: Mon Mar 29 17:22:14 2010 +0100
  Commits from 'master': 1

origin/albumart-qt
  Last commit: Thu Oct 21 11:10:25 2010 +0200
  Commits from 'master': 1

origin/api-cleanup
  Last commit: Thu Feb 20 12:16:43 2014 +0100
  Commits from 'master': 18

...

Two of those haven’t been touched for five years, and only contain a single commit! So they are probably good targets for deletion, for example.

You can also get the above info sorted, with the oldest branches first. First you need to generate a list. This outputs each branch and the date of its newest commit (as a number), sorts it numerically, then filters out the number and writes it to a file called ‘unmerged-branches.txt’:

for b in $(git branch --remote --no-merged); do
    git show $b --pretty="format:%ct $b" | head -n 1;
done | sort -n | cut -d ' ' -f 2 > unmerged-branches.txt

Then you can run the formatting command again, but replace the first line with:

for b in $(cat unmerged-branches.txt); do

OK! You have a list of all the unmerged branches and you can send a mail to people saying you’re going to delete all of them older than a certain point unless they beg you not to.