Enourmous Git Repositories

If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?

One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.

I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had 10GB 7GB of such gibberish.(I realised that, having used du -s instead of du --apparent-size -s to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)

The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.

Results

Generating the initial data: this took all night, perhaps because I included a call to du inside the loop that generated the data, which would take an increasing amount of time on each iteration.

Creating an initial 10GB 7GB commit: 95 minutes

$ time git add .
real    90m0.219s
user    84m57.117s
sys     1m6.932s

$ time git status
real    1m15.992s
user    0m4.071s
sys     0m20.728s

$ time git commit -m "Initial commit"
real    4m22.397s
user    0m27.168s
sys     1m5.815s

The git log command is pretty instant, a git show of this commit takes a minute the first time I run it, about 5 seconds if I run it again.

Doing git add and git rm to create a second commit is really quick, git status is still slow, but git commit is quick:

$ time git status
real    1m19.937s
user    0m5.063s
sys     0m16.678s

$ time git commit -m "Put all z files in same directory"
real    0m11.317s
user    0m1.639s
sys     0m5.306s

Furthermore, git show of this second commit is quick too.

Next I used git daemon to serve the repo over git:// protocol:

$ git daemon --verbose --export-all --base-path=`pwd`

Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes

$ time git clone git://172.16.20.95/huge-repo
Cloning into 'huge-repo'...
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
Checking connectivity... done.
Checking out files: 100% (46345/46345), done.

real    22m17.734s
user    2m12.606s
sys     0m54.603s

Doing a sparse checkout of a few files: 15 minutes

$ mkdir sparse-checkout
$ cd sparse-checkout
$ git init .
$ git config core.sparsecheckout true
$ echo z-files/ >> .git/info/sparse-checkout

$ time git pull  git://172.16.20.95/huge-repo master
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
From git://172.16.20.95/huge-repo
 * branch            master     -> FETCH_HEAD

real    14m26.032s
user    1m9.133s
sys     0m22.683s

This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same git-daemon process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.

I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.

Also, I learned that if you are profiling file size, you should use du --apparent-size to measure that, because file size != disk usage!

Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).

Advertisements

About Sam Thursfield

Who's that kid in the back of the room? He's setting all his papers on fire! Where did he get that crazy smile? We all think he's really weird.
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Enourmous Git Repositories

  1. Pingback: Links 16/3/2016: GNU Linux-Libre 4.5, NVIDIA Distro Rumours | Techrights

  2. Rémi Cardona says:

    A couple of things:

    * make sure you use really recent versions of git, a lot of work is ongoing to better handle huge repos, we’ve hit plenty of issues when we (Gentoo) migrated our huge repo to git, so do try to stay really up to date
    * make sure you read into sparse/shallow checkouts, they may actually hurt more than a full clone: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomment-193772935

    Good luck!

  3. NickB says:

    Git is *not* designed for such large repos. While Subversion can easily handle 500gb+ repos, in Git you’d have to split a single >5gb+ repo to smaller ones.[*]

    [*]:https://svnvsgit.com/#git-scalability-for-larger-project-myth

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s