If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?
One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.
I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had
10GB 7GB of such gibberish.(I realised that, having used
du -s instead of
du --apparent-size -s to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)
The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.
Generating the initial data: this took all night, perhaps because I included a call to
du inside the loop that generated the data, which would take an increasing amount of time on each iteration.
Creating an initial
10GB 7GB commit: 95 minutes
$ time git add . real 90m0.219s user 84m57.117s sys 1m6.932s $ time git status real 1m15.992s user 0m4.071s sys 0m20.728s $ time git commit -m "Initial commit" real 4m22.397s user 0m27.168s sys 1m5.815s
git log command is pretty instant, a
git show of this commit takes a minute the first time I run it, about 5 seconds if I run it again.
git add and
git rm to create a second commit is really quick,
git status is still slow, but
git commit is quick:
$ time git status real 1m19.937s user 0m5.063s sys 0m16.678s $ time git commit -m "Put all z files in same directory" real 0m11.317s user 0m1.639s sys 0m5.306s
git show of this second commit is quick too.
Next I used
git daemon to serve the repo over git:// protocol:
$ git daemon --verbose --export-all --base-path=`pwd`
Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes
$ time git clone git://172.16.20.95/huge-repo Cloning into 'huge-repo'... remote: Counting objects: 339412, done. remote: Compressing objects: 100% (33351/33351), done. remote: Total 339412 (delta 5436), reused 0 (delta 0) Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done. Resolving deltas: 100% (5436/5436), done. Checking connectivity... done. Checking out files: 100% (46345/46345), done. real 22m17.734s user 2m12.606s sys 0m54.603s
Doing a sparse checkout of a few files: 15 minutes
$ mkdir sparse-checkout $ cd sparse-checkout $ git init . $ git config core.sparsecheckout true $ echo z-files/ >> .git/info/sparse-checkout $ time git pull git://172.16.20.95/huge-repo master remote: Counting objects: 339412, done. remote: Compressing objects: 100% (33351/33351), done. remote: Total 339412 (delta 5436), reused 0 (delta 0) Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done. Resolving deltas: 100% (5436/5436), done. From git://172.16.20.95/huge-repo * branch master -> FETCH_HEAD real 14m26.032s user 1m9.133s sys 0m22.683s
This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same
git-daemon process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.
I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.
Also, I learned that if you are profiling file size, you should use
du --apparent-size to measure that, because file size != disk usage!
Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).