Enourmous Git Repositories

If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?

One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.

I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had 10GB 7GB of such gibberish.(I realised that, having used du -s instead of du --apparent-size -s to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)

The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.


Generating the initial data: this took all night, perhaps because I included a call to du inside the loop that generated the data, which would take an increasing amount of time on each iteration.

Creating an initial 10GB 7GB commit: 95 minutes

$ time git add .
real    90m0.219s
user    84m57.117s
sys     1m6.932s

$ time git status
real    1m15.992s
user    0m4.071s
sys     0m20.728s

$ time git commit -m "Initial commit"
real    4m22.397s
user    0m27.168s
sys     1m5.815s

The git log command is pretty instant, a git show of this commit takes a minute the first time I run it, about 5 seconds if I run it again.

Doing git add and git rm to create a second commit is really quick, git status is still slow, but git commit is quick:

$ time git status
real    1m19.937s
user    0m5.063s
sys     0m16.678s

$ time git commit -m "Put all z files in same directory"
real    0m11.317s
user    0m1.639s
sys     0m5.306s

Furthermore, git show of this second commit is quick too.

Next I used git daemon to serve the repo over git:// protocol:

$ git daemon --verbose --export-all --base-path=`pwd`

Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes

$ time git clone git://
Cloning into 'huge-repo'...
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
Checking connectivity... done.
Checking out files: 100% (46345/46345), done.

real    22m17.734s
user    2m12.606s
sys     0m54.603s

Doing a sparse checkout of a few files: 15 minutes

$ mkdir sparse-checkout
$ cd sparse-checkout
$ git init .
$ git config core.sparsecheckout true
$ echo z-files/ >> .git/info/sparse-checkout

$ time git pull  git:// master
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
From git://
 * branch            master     -> FETCH_HEAD

real    14m26.032s
user    1m9.133s
sys     0m22.683s

This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same git-daemon process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.

I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.

Also, I learned that if you are profiling file size, you should use du --apparent-size to measure that, because file size != disk usage!

Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).

Posted in Uncategorized | 3 Comments

Codethink is hiring!

We are looking for people who can write code, who match one of these job descriptions at least slightly, and who are willing to relocate to Manchester, UK (so you must either be an EU resident, or able to get a work permit for the UK.) Manchester is number 8 in Lonely Planet’s Best In Travel list for 2016, so really you’d be doing yourself a favour to move here. Remote working is possible if you have lots of contributions to public software projects that demonstrate your amazingness.

There is a nice symmetry to this blog post, I remember reading a similar one quite a few years ago, which led to me applying for a job at Codethink, and i’ve been here ever since, with various trips to exotic countries in between.

If you’re interested, send a CV & cover letter to jobs@codethink.co.uk.

Posted in Uncategorized | Leave a comment

CMake: dependencies between targets and files and custom commands

As I said in my last post about CMake, targets are everything in CMake. Unfortunately, not everything is a target though!

If you’ve tried do anything non-trivial in CMake using the add_custom_command() command, you may have got stuck in this horrible swamp of confusion. If you want to generate some kind of file at build time, in some manner other than compiling C or C++ code, then you need to use a custom command to generate the file. But files aren’t targets and have all sorts of exciting limitations to make you forget everything you ever new about dependency management.

What makes it so hard is that there’s not one limitation, but several. Here is a hopefully complete list of things you might want to do in CMake that involve custom commands and custom targets depending on each other, and some explainations as to why things don’t work the way that you might expect.

1. Dependencies between targets

point1-verticalThis is CMake at its simplest (and best).

cmake_minimum_required(VERSION 3.2)

add_library(foo foo.c)

add_executable(bar bar.c)
target_link_libraries(bar foo)

You have a library, and a program that depends on it. When you run CMake, both of them get built. Ideal! This is great!

What is “all”, in the dependency graph to the left? It’s a built in target, and it’s the default target. There are also “install” and “test” targets built in (but no “clean” target).

2. Custom targets

If your project is a good one then maybe you use a documentation tool like GTK-Doc or Doxygen to generate documentation from the code.

This is where add_custom_command() enters your life. You may live to regret ever letting it in.

cmake_minimum_required(VERSION 3.2)

        doxygen docs/Doxyfile
        cmake -E touch docs/doxygen.stamp
        "Generating API documentation with Doxygen"

We have to create a ‘stamp’ file because Doxygen generates lots of different files, and we can’t really tell CMake what to expect. But actually, here’s what to expect: nothing! If you build this, you get no output. Nothing depends on the documentation, so it isn’t built.

So we need to add a dependency between docs/doxygen.stamp and the “all” target. How about using add_dependencies()? No, you can’t use that with any of the built in targets. But as a special case, you can use add_custom_target(ALL) to create a new target attached to the “all” target:

    docs ALL
    DEPENDS docs/doxygen.stamp


In practice, you might also want to make the custom command depend on all your source code, so it gets regenerated every time you change the code. Or, you might want to remove the ALL from your custom target, so that you have to explicitly run make docs to generate the documentation.

This is also discussed here.

3. Custom commands in different directories

Another use case for add_custom_command() is generating source code files using 3rd party tools.

### Toplevel CMakeLists.txt
cmake_minimum_required(VERSION 3.2)


### src/CMakeLists.txt
        cmake -E echo "Generate my C code" > foo.c

### tests/CMakeLists.txt
        test-foo.c ${CMAKE_CURRENT_BINARY_DIR}/../src/foo.c

    NAME test-foo
    COMMAND test-foo

How does this work? Actually it doesn’t! You’ll see the following error when you run CMake:

CMake Error at tests/CMakeLists.txt:1 (add_executable):
  Cannot find source file:


  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx
CMake Error: CMake can not determine linker language for target: test-foo
CMake Error: Cannot determine link language for target "test-foo".

Congratulations, you’ve hit bug 14633! The fun thing here is that generated files don’t behave anything like targets. Actually they can only be referenced in the file that contains the corresponding add_custom_command() call. So when we refer to the generated foo.c in tests/CMakeLists.txt, CMake actually has no idea where it could come from, so it raises an error.


As the corresponding FAQ entry describes, there are two things you need to do to work around this limitation.

The first is to wrap your custom command in a custom target. Are you noticing a pattern yet? Most of the workarounds here are going to involve wrapping custom commands in custom targets. In src/CMakeLists.txt, you do this:


Then, in tests/CMakeLists.txt, you can add a dependency between “test-foo” and “generate-foo”:

add_dependency(test-foo generate-foo)

That’s enough to ensure that foo.c now gets generated before the build of test-foo begins, which is obviously important. If you try to run CMake now, you’ll hit the same error, because CMake still has no idea where that generated foo.c file might come from. The workaround here is to manually set the GENERATED target property:



Note that this is a bit of a contrived example. In most cases, the correct solution is to do this:

### src/CMakeLists.txt
add_library(foo foo.c)

### tests/CMakeLists.txt
target_link_libraries(test-foo foo)

Then you don’t have to worry about any of the nonsense above, because libraries are proper targets, and you can use them anywhere.

Even if it’s not practical to make a library containing ‘foo.c’, there must be some other target that links against it in the same directory that it is generated in. So instead of creating a “generate-foo” target, you can make “test-foo” depend on whatever other target links to “foo.c”.

4. Custom commands and parallel make

I came into this issue while doing something pretty unusual with CMake: wrapping a series of Buildroot builds. Imagine my delight at discovering that, when parallel make was used, my CMake-generated Makefile was running the same Buildroot build multiple times at the same time! That is not what I wanted!

It turns out this is a pretty common issue. The crux of it is that with the “Unix Makefiles” backend, multiple toplevel targets run as an independent, parallel make processes. Files aren’t targets, and unless something is a target then it doesn’t get propagated around like you would expect.

Here is the test case:

cmake_minimum_required(VERSION 3.2)

    OUTPUT gen
    COMMAND sleep 1
    COMMAND cmake -E echo Hello > gen

    my-all-1 ALL DEPENDS gen

    my-all-2 ALL DEPENDS gen

If you generate a Makefile from this and run make -j 2, you’ll see the following:

Scanning dependencies of target my-all-2
Scanning dependencies of target my-all-1
[ 50%] Generating gen
[100%] Generating gen
[100%] Built target my-all-2
[100%] Built target my-all-1

If creating ‘gen’ takes a long time, then you really don’t want it to happen multiple times! It may even cause disasters, for example running make twice at once in the same Buildroot build tree is not pretty at all.


As explained in bug 10082, the solution is (guess what!) to wrap the custom command in a custom target!

add_custom_target(make-gen DEPENDS gen)


Then you change the custom targets to depend on “make-gen”, instead of the file ‘gen’. Except! Be careful when doing that — because there is another trap waiting for you!

5. File-level dependencies of custom targets are not propagated

If you read the documentation of add_custom_command() closely, and you look at the DEPENDS keyword argument, you’ll see this text:

If DEPENDS specifies any target (created by the add_custom_target(), add_executable(), or add_library() command) a target-level dependency is created to make sure the target is built before any target using this custom command. Additionally, if the target is an executable or library a file-level dependency is created to cause the custom command to re-run whenever the target is recompiled.

This sounds quite nice, like more or less what you would expect. But the important bit of information here is what CMake doesn’t do: when a custom target depends on another custom target, all the file level dependencies are completely ignored.

Here’s your final example for the evening:

cmake_minimum_required(VERSION 3.2)


    OUTPUT gen1
    COMMAND cmake -E echo ${SPECIAL_TEXT} > gen1

    DEPENDS gen1

    OUTPUT gen2
    DEPENDS gen1-wrapper
    COMMAND cmake -E copy gen1 gen2

    all-generated ALL
    DEPENDS gen2

This is subtly wrong, even though you did what you were told, and wrapped the custom command in a custom target.

The first time you build it:

Scanning dependencies of target gen1-wrapper
[ 50%] Generating gen1
[ 50%] Built target gen1-wrapper
Scanning dependencies of target all-generated
[100%] Generating gen2
[100%] Built target all-generated

But then touch the file ‘gen1’, or overwrite it with something other text, or change the value of SPECIAL_TEXT in CMakeLists.txt to something else, and you will see this:

[ 50%] Generating gen1
[ 50%] Built target gen1-wrapper
[100%] Built target all-generated

There’s no file-level dependency created between ‘gen2’ and ‘gen1’, so ‘gen2’ never gets updated, and things get all weird.


You can’t just depend on gen1 instead of gen1-wrapper, because it may end up being built multiple times! See the previous point. Instead, you need to depend on the “gen1-wrapper” target and the file itself:

    OUTPUT gen2
    DEPENDS gen1-wrapper gen1
    COMMAND cmake -E copy gen1 gen2

As the documentation says, this only applies to targets wrapping add_custom_command() output. If ‘gen1’ was a library created with add_library, things would work how you expect.



Maybe I just have a blunt head, but I found all of this quite difficult to work out. I can understand why CMake works this way, but I think there is plenty of room for improvement in the documentation where this is explained. Hopefully this guide has gone some way to making things clearer.

If you have any other dependency-related traps in CMake that you’ve hit, please comment and I’ll add them to this list…

Posted in Uncategorized | 8 Comments

Some CMake tips

Sketch of John Barber's gas turbine, from his patentI spent the past few weeks converting a bunch of Make and Autotools-based modules to use CMake instead. This was my first major outing with CMake. Maybe there will be a few blog posts on that subject!

In general I think CMake has a sound design and I quite want to like it. It seems like many of its warts are due to its long history and the need for backwards compatibility, not anything fundamentally broken. To keep a project going for 16 years is impressive and it is pretty widely used now. This is a quick list of things I found in CMake that confused me to start with but ultimately I think are good things.

  1. Targets are everything

    CMake is pretty similar to normal make in that all the things that you care about are ‘targets’. Libraries are targets, programs are targets, subdirectories are targets and custom commands create files which are considered targets. You can also create custom targets which run commands if executed. You need to use custom targets feature if you want a custom command target to be tied to the default target, which is a little confusing but works OK.

    Targets have properties, which are useful.

  2. Absolute paths to shared libraries

    Traditionally you link to libfoo by passing -lfoo to the linker. Then, if libfoo is in a non-standard location, you pass -L/path/to/foo -lfoo. I don’t think pkg-config actually enforces this pattern but pretty much all the .pc files I have installed use the -L/path -Lname pattern.

    CMake makes this quite awkward to do, because it makes every effort to forget about the linker paths. Library ‘targets’ in CMake keep track of associated include paths, dependent libraries, compile flags, and even extra source files, using ‘target properties’. There’s no target property for LINK_DIRECTORIES, though, so outside of the current CMakeLists.txt file they won’t be tracked. There is a global LINK_DIRECTORIES property, confusingly, but it’s specifically marked as “for debugging purposes.”

    So the recommended way to link to libraries is with the absolute path. Which makes sense! Why say with two commandline arguments what you can say with one?

    At least, this will be fine once CMake’s pkg-config integration returns absolute paths to libraries

  3. Semicolon safety instead of whitespace safety

    CMake has a ‘list’ type which is actually a string with ; (semicolon) used to delimit entities. Spaces are used as an argument separator, but converted to semicolons during argument parsing, I think. Crucially, they seem to be converted before variable expansion is done, which means that filenames with spaces don’t need any special treatment. I like this more than shell code where I have to quote literally every variable (or else Richard Maw shouts at me).

    For example:

    cmake_minimum_required(VERSION 3.2)
    set(path "filename with spaces")
    set(command ls ${path})
    foreach(item ${command})
        message(item: ${item})


    item:filename with spaces

    On the other hand:

    cmake_minimum_required(VERSION 3.2)
    set(path "filename;with\;semicolons")
    set(command ls ${path})
    foreach(item ${command})
        message(item: ${item})



    Semicolons occur less often in file names, I guess. Most of us are trained to avoid spaces, partly because we know how broken (all?) most shell-based build systems are in those cases. CMake hasn’t actually solved this but just punted the special character to a less often used one, as far as I can see. I guess that’s an improvement? Maybe?

    The semi-colon separator can bite you in other ways, for example, when specifying CMAKE_PREFIX_PATH (library and header search path) you might expect this to work:

    cmake . -DCMAKE_PREFIX_PATH=/opt/path1:/opt/path2

    However, that won’t work (unless you did actually mean that to be one item). Instead, you need to pass this:

    cmake . -DCMAKE_PREFIX_PATH=/opt/path1\;/opt/path2

    Of course, ; is a special character in UNIX shells so must be escaped.

  4. Ninja instead of Make

    CMake supports multiple backends, and Ninja is often faster than GNU Make, so give the Ninja backend a try: cmake -G Ninja.

  5. Policies

    The CMake developers seem pretty good at backwards compatibility. To this end they have introduced the rather obtuse policies framework. The great thing about the policies framework is that you can completely ignore it, as long as you have cmake_minimum_required(VERSION 3.3) at the top of your toplevel CMakeLists.txt. You’ll only need it once you have a massive bank of existing CMakeLists.txt files and you are getting started on porting them to a newer version of CMake.

    Quite a lot of CMake error messages are worded to make you think like you might need to care about policies, but don’t be fooled. Mostly these errors are for situations where there didn’t use to be an error, I think, and so the policy exists to bring back the ‘old’ behaviour, if you need it.

If a tool is weird but internally consistent, I can get on with it. Hopefully, CMake is getting there. I can see there have been a lot of good improvements since CMake 2.x, at least. And at no point so far has it made me more angry than GNU Autotools. It’s not crashed at all (impressive given it’s entirely C++ code!). And it is significantly faster and more widely applicable than Autotools or artisanal craft Makefiles. So I’ll be considering it in future. But I can’t help wishing for a build system that I actually liked

Edit: you might also be interested in a list of common CMake antipatterns.

Posted in Uncategorized | 10 Comments

Tracker talk slides from GUADEC 2015

I did a talk entitled “Tracker: Introduction and Reflection” at GUADEC 2015. The slides are available from http://afuera.me.uk/talks/tracker-talk-2015.pdf. I don’t know if they will make much sense without the words but at least there are some nice photos in there.

Posted in Uncategorized | Leave a comment

Cleaning up stale Git branches

I get bored looking through dozens and dozens of stale Git branches. If git branch --remote takes up more than a screenful of text then I am unhappy!

Here are some shell hacks that can help you when trying to work out what can be deleted.

This shows you all the remote branches which are already merged, those can probably be deleted right away!

git branch --remote --merged

These are the remote branches that aren’t merged yet.

git branch --remote --no-merged

Best not to delete those straight away. But some of them are probably totally stale. This snippet will loop through each unmerged branch and tell you (a) when the last commit was made, and (b) how many commits it contains which are not merged to ‘origin/master’.

for b in $(git branch --remote --no-merged); do
    echo $b;
    git show $b --pretty="format:  Last commit: %cd" | head -n 1;
    echo -n "  Commits from 'master': ";
    git log --oneline $(git merge-base $b origin/master)..$b | wc -l;

The output looks like this:

  Last commit: Mon Mar 29 17:22:14 2010 +0100
  Commits from 'master': 1

  Last commit: Thu Oct 21 11:10:25 2010 +0200
  Commits from 'master': 1

  Last commit: Thu Feb 20 12:16:43 2014 +0100
  Commits from 'master': 18


Two of those haven’t been touched for five years, and only contain a single commit! So they are probably good targets for deletion, for example.

You can also get the above info sorted, with the oldest branches first. First you need to generate a list. This outputs each branch and the date of its newest commit (as a number), sorts it numerically, then filters out the number and writes it to a file called ‘unmerged-branches.txt’:

for b in $(git branch --remote --no-merged); do
    git show $b --pretty="format:%ct $b" | head -n 1;
done | sort -n | cut -d ' ' -f 2 > unmerged-branches.txt

Then you can run the formatting command again, but replace the first line with:

for b in $(cat unmerged-branches.txt); do

OK! You have a list of all the unmerged branches and you can send a mail to people saying you’re going to delete all of them older than a certain point unless they beg you not to.

Posted in Uncategorized | 1 Comment

.yml files are an anti-pattern

A lot of people are representing data as YAML these days. That’s good! It’s an improvement over the days when everything seemed to be represented as XML, anyway.

But one thing about the YAML format is that it doesn’t require you to embed any information in the file about how the data should be interpreted. So now we have projects where there are hundreds of different .yml files committed to Git and I have no idea what any of them are for.

YAML is popular because it’s minimal and convenient, so I don’t think that requiring that everyone suddenly creates an ontology for the data in these .yml files would be practical. But I would really like to see a convention that the first line of any .yml file was a comment describing what the file did, e.g.

# This is a BOSH deployment manifest, see http://bosh.io/ for more information

That’s all!

Posted in Uncategorized | Leave a comment