If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?
One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.
I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had 10GB 7GB of such gibberish.(I realised that, having used du -s
instead of du --apparent-size -s
to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)
The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.
Results
Generating the initial data: this took all night, perhaps because I included a call to du
inside the loop that generated the data, which would take an increasing amount of time on each iteration.
Creating an initial 10GB 7GB commit: 95 minutes
$ time git add .
real 90m0.219s
user 84m57.117s
sys 1m6.932s
$ time git status
real 1m15.992s
user 0m4.071s
sys 0m20.728s
$ time git commit -m "Initial commit"
real 4m22.397s
user 0m27.168s
sys 1m5.815s
The git log
command is pretty instant, a git show
of this commit takes a minute the first time I run it, about 5 seconds if I run it again.
Doing git add
and git rm
to create a second commit is really quick, git status
is still slow, but git commit
is quick:
$ time git status
real 1m19.937s
user 0m5.063s
sys 0m16.678s
$ time git commit -m "Put all z files in same directory"
real 0m11.317s
user 0m1.639s
sys 0m5.306s
Furthermore, git show
of this second commit is quick too.
Next I used git daemon
to serve the repo over git:// protocol:
$ git daemon --verbose --export-all --base-path=`pwd`
Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes
$ time git clone git://172.16.20.95/huge-repo
Cloning into 'huge-repo'...
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
Checking connectivity... done.
Checking out files: 100% (46345/46345), done.
real 22m17.734s
user 2m12.606s
sys 0m54.603s
Doing a sparse checkout of a few files: 15 minutes
$ mkdir sparse-checkout
$ cd sparse-checkout
$ git init .
$ git config core.sparsecheckout true
$ echo z-files/ >> .git/info/sparse-checkout
$ time git pull git://172.16.20.95/huge-repo master
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
From git://172.16.20.95/huge-repo
* branch master -> FETCH_HEAD
real 14m26.032s
user 1m9.133s
sys 0m22.683s
This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same git-daemon
process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.
I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.
Also, I learned that if you are profiling file size, you should use du --apparent-size
to measure that, because file size != disk usage!
Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).