Status update 17/03/2023

Hello from my parents place, sitting on the border of Wales & England, listening to this excellent Victor Rice album, thinking of this time last year when I actually got to watch him play at Freedom Sounds Festival, which was one of my first adventures of the post-lockdown 2020s.

I have many distractions at the moment, many being work/life admin but here are some of the more interesting ones:

  • Playing in a new band in Santiago – Killo Karallo – recording some initial music which is to come out next week
  • Preparing new Vladimir Chicken music, also cooked and ready for release in April
  • Figuring out how we can grow the GNOME OpenQA tests while keeping them fun to work with. Here’s an experimental commandline tool which might help with that.
  • Learning about marketing, analytics, and search engine optimization.
  • Trying out the new LLaMA language model and generally trying to keep up with the ongoing revolution in content generation technology.

Also I got to see real snow for the first time in a few years! Thanks Buxton!



A small Rust program

I wrote a small program in Rust called cba_blooper. Its purpose is to download files from this funky looper pedal called the Blooper.

It’s the first time I finished a program in Rust. I find Rust programming a nice experience, after a couple of years of intermittent struggle to adapt my existing mental programming models to Rust’s conventions.

When I finished the tool I was surprised by the output size – initially a 5.6MB binary for a tool that basically just calls into libasound to read and write MIDI. I followed the excellent min-sized-rust guide and got that down to 1.4MB by fixing some obvious mistakes such as actually stripping the binary and building in release mode. But 1.4MB still seems quite big.

Next I ran cargo bloat and found there were two big dependencies:

  • the ‘clap‘ argument parser library
  • the ‘regex’ library, pulled in by ‘env_logger

I got ‘env_logger’ to shrink by disabling its optional features in Cargo.toml:

env_logger = { version ="0.10.0", default_features = false, features = [] }

As for ‘clap’, it seems unavoidable that it adds a few hundred KB to the binary. There’s an open issue here to improve that. I haven’t found a more minimal argument parser that looks anywhere near as nice as ‘clap’, so I guess the final binary will stay at 842KB for the time being. As my bin/directories fill up with Rust programs over the next decade, this overhead will start to add up but it’s fine for now.

It’s thanks to Rust being fun that this tool exists at all. It definitely easier to make a small C program but the story around distribution and dependency management for self-contained C programs is so not-fun that I probably wouldn’t even have bothered writing the tool in the first place.

Status update, 16/01/2023

The tech world is busy building “AI apps” with wild claims of solving all problems. Meanwhile it’s still basically an unsolved problem to get images and text to line up nicely when making presentation slides.

I’m giving a couple of talks at FOSDEM in February so i’ve been preparing slides. I previously used Reveal.js, which has some nice layout options (like r-stretch and r-fit-text), but pretty basic Markdown support such that I ended up writing the slides in raw HTML.

A colleague turned me onto Remark.js, a simpler tool with better Markdown support and a CLI tool (Backslide), but its layout support is less developed than Reveal.js so I ended frustrated at the work necessary to lay things out neatly.

In the end, I’ve built my own tiny tool based around python-markdown and its attr_list extension, to compile Markdown to Reveal.js HTML in a way that attempts to not be hopelessly annoying. It’s a work in progress, but hopefully I can work towards becoming less frustrated while making slides. The code is here, take it if you like it and don’t ask me any questions 🙂

Status update 21/09/22

Last week I attended OSSEU 2022 in Dublin, gave a talk about BuildStream 2.0 and the REAPI, and saw some new and old faces. Good times apart from the common cold I picked up on the way — I was glad that the event mandated face-masks for everyone so I could cover my own face without being the “odd one out”. (And so that we were safer from the 3+ COVID-19 cases reported at the event).

Being in the same room as Javier allowed some progress on our slightly “skunkworks” project to bring OpenQA testing to upstream GNOME. There was enough time to fix the big regressions that had halted testing completely since last year, one being an expired API key and the other, removal of virtio VGA support in upstream’s openqa_worker container. We prefer using the upstream container over maintaining our own fork, in the hope that our limited available time can go on maintaining tests instead, but the containers are provided on a “best effort” basis and since our tests are different to openqa.opensuse.org, regressions like this are to be expected.

I am also hoping to move the tests out of gnome-build-meta into a separate openqa-tests repo. We initially put them in gnome-build-meta because ultimately we’d like to be able to do pre-merge testing of gnome-build-meta branches, but since it takes hours to produce an ISO image from a given commit, it is painfully slow to create and update the OpenQA tests themselves. Now that Gitlab supports child pipelines, we can hopefully satisfy both use cases: one pipeline that quickly runs tests against the prebuilt “s3-image” from os.gnome.org, and a second that is triggered for a specific gnome-build-meta build pipeline and validates that.

First though, we need to update all the existing tests for the visual changes that occurred in the meantime, which are mostly due to gnome-initial-setup now using GTK4. That’s still a slow process as there are many existing needles (screenshots), and each time the tests are run, the Web UI allows updating only the first one to fail. That’s something else we’ll need to figure out before this could be called “production ready”, as any non-trivial style change to Adwaita would imply rerunning this whole update process.

All in all, for now openqa.gnome.org remains an interesting experiment. Perhaps by GUADEC next year there may be something more useful to report.

Team Codethink in the OSSEU 2022 lobby

My main fascination this month besides work has been exploring “AI” image generation. It’s amazing how quickly this technology has spread – it seems we had a big appetite for generative digital images.

I am really interested in the discussion about whether such things are “art”, because I this discussion is soon going to encompass music as well. We know that both OpenAI and Spotify are researching machine-generated music, and it’s particularly convenient for Spotify if they can continue to charge you £10 a month while progressively serving you more music that they generated in-house – and therefore reducing their royalty payments to record labels.

There are two related questions: whether AI-generated content is art, and whether something generated by an AI has the same monetary value as something a human made “by hand”. In my mind the answer is clear, but at the same time not quantifiable. Art is a form of human communication. Whether you use a neural network, a synthesizer, a microphone or a wax cylinder to produce that art is not relevant. Whether you use DALL-E 2 or a paintbrush is not relevant. Whether your art is any good depends on how it makes people feel.

I’ve been using Stable Diffusion to try and illustrate some of sound worlds from my songs, and my favourite results so far are for Don’t Go Into The Zone:

And finally, a teaser for an upcoming song release…

An elephant with a yellow map background

Status update, 16/08/2022

Building Wheels

For the first time this year I got to spend a little paid time on open source work, in this case putting some icing on the delicious and nourishing cake that we call BuildStream 2.

If you’ve tried the 1.9x pre-releases you’ll have seen it depends on a set of C++ tools under the name BuildBox. Some hot codepaths that were part of the Python core are now outsourced to helper tools, specifically data storage (buildbox-casd, buildbox-fuse) and container creation (buildbox-run-bubblewrap). These tools implement remote-apis standards and are useful to other build tools, the only catch is that they are not yet widely available in distros, and neither are the BuildStream 2 prereleases.

Separately, BuildStream 2 has some other hot codepaths written with Cython. If you’ve ever tried to install a Python package from PyPI and wondered why a package manager is running GCC, the answer is usually that it’s installing a source package and has to build some Cython code with your system’s C compiler.

The way to avoid requiring GCC in this case is to ship prebuilt binary packages known as wheels. So that’s what we implemented for BuildStream 2 – and as a bonus, we can bundle prebuilt BuildBox binaries into these packages. The wheels have a platform compatibility tag of “manylinux_2_28.x86_64” so they should work on any x86_64 host with GLIBC 2.28 or later.

Connecting Gnomes

I didn’t participate in GUADEC this year for various reasons, I’m very glad to see it was a success. I was surprised to see six writeups of the Berlin satellite event, and only two of the main event in Mexico (plus one talk transcript and some excellent coverage in LWN) – are Europeans better at blogging? 🙂

I again saw folk mention that connections *between* the local, online and satellite events are lacking – I felt this at LAS this year – and I still think we should take inspiration from 2007 Twitter and put up a few TVs in the venue with chat and microblog windows open.

The story is, that Twitter launched in 2006 to nobody, and it became a hit only after putting up screens at a US music festival in 2007 displaying their site where folk could live-blog their activities.

Hey old people, remember this?! Via webdesignmuseum.org

I’d love to see something similar at conferences next year where online participants can “write on the walls” of the venue (avoiding the ethically dubious website that Twitter has become).

On that note, I just released a new version of my Twitter Without Infinite Scroll extension for Firefox.

Riding Trains

I made it from Galicia to the UK overland, actually for the first time. (I did do the reverse journey already by boat + van in 2018). It was about 27 hours of travel spread across 5 days, including a slow train from Barcelona into the Pyrenees, and a night train onwards to Paris, and I guess cost around 350€. The trip went fine, in comparison to the plane I had booked to return which was cancelled by the airline without notification, so the return flight+bus ended up costing a similar amount and taking nearly 15 hours. No further comment on that but I can recommend the train ride!

High speed main line through the French Pyrenees

Status update 18/07/2022

Summer is here!

All my creative energy has gone into wrapping up a difficult project at Codethink, and the rest of the time I’ve been enjoying sunshine and festivals. I was able to dedicate some time to learning the basics of async Rust but I don’t have much to share from the last month. Instead, let me focus on some projects I’m keeping an eye on.

Firstly, in the Tracker search engine, Carlos Garnacho has landed some important features and refactors. The main one being stream-based serializers and deserializers.

This allows more easily backing up and importing data in and out the tracker-store, and cleaning up some cruft like multiple different implementations of Turtle. It seems ideal having a totally stream-based codec so you can process an effectively infinite amount of data, but there is a tradeoff if you serialize data triple-by-triple – the serialized output is much less human-readable and in some cases larger than if you do some buffering and group related statements together. For this reason we didn’t yet land the JSON-LD support.

Carlos also rewrote the last piece of Vala in libtracker-sparql into C. Vala makes some compromises that aren’t helpful for making a long-term ABI stable C library. There are more lines of code now and we particularly miss Vala’s async support, but this makes maintenance easier as we can now be sure that any misbehaving C code was generated by ourselves and not by the Vla compiler.

There are some other fixes for issues reported by various contributors, thanks to everyone that got involved in the 3.3.0 cycle so far 🙂

Meanwhile, in world of BuildStream there is a lot of activity for the final push towards a BuildStream 2.0 release. I’m only a bystander in the process and to me things look promising. The 2.0 API is finalized and frozen, there’s a small list of blockers remaining – any help to resolve these is welcome! – and then the door is open to a 2.0 release. That’s most likely to be preceded by one or more “1.97” release candidates to allow wider testing. Whatever happens the hope is that the upcoming Freedesktop SDK 22.08 release can be rolled with BuildStream “2.0”.

I’ve been using Bst 1.x for a while and finding it a very helpful tool. The REAPI support is particularly cool as it allows organisations to manage a single distributed build infrastructure that multiple tools can make use of – I hope it gains more traction, and I think it’s already helping to “sell” BuildStream to organisations that are looking at doing distributed builds.

Finally, Kate Bush’s music is super popular again, which is great and I want to share this tidbit from my own youth in the early 2000s right out of Sunderland doing an excellent Hounds of Love cover:

Status update, 17/06/2022

I am currently in the UK – visiting folk, working, and enjoying the nice weather. So my successful travel plans continue for the moment… (corporate mismanagement has led to various transport crises in the UK so we’ll see if I can leave as successfully as I arrived).

I started the Calliope playlist toolkit back in 2016. The goal is to bring open data together and allow making DIY music recommenders, but its rather abstract to explain this via the medium of JSON documents. Coupled with a desire to play with GTK4, which I’ve had no opportunity to do yet, and inspired by a throwaway comment in the MusicBrainz IRC room, I prototyped up a graphical app that shows what kind of open data is available for playlist generation.

This “calliope popup” app can watch MPRIS nofications, or page through an existing playlist. In future it could also page through your Listenbrainz listen history. So far it just shows one type of data:

This screenshot shows MusicBrainz metadata for my test playlist’s first track, which happens to be the song “Under Pressure”. (This is a great test because it is credited to two artists :-). The idea is to flesh out the app with metadata from various different providers, making it easier to see what data is available and detect bad/missing info.

The majority of time spent on this so far has been (re-)learning GTK and figuring out how to represent the data on screen. There was some also work involved making Calliope itself return data more usefully.

Some nice discoveries since I last did anything in GTK are the Blueprint UI language, and the Workbench app. Its also very nice having the GTK Inspector available everywhere, and being able to specify styling via a CSS file. (I’ve probably done more web sites than GTK apps in the last 10 years, so being able to use the same mental model for both is a win for me.). The separation of Libadwaita from GTK also makes sense and helps GTK4 feels more focused, avoiding (mostly) having 2 or 3 widgets for one purpose.

Apart from that, I’ve been editing and mixing new Vladimir Chicken music – I can strongly recommend that you never try to make an eight minute song. This may be the first and last 8 minute song from VC 🙂

Status update, 15/04/2022

As i mentioned last month, I bought one of these Norns audio-computers and a grey grid device to go with it. So now I have this lap-size electronic music apparatus.

Its very fun to develop for – the truth is I’ve never got on well with the “standard” tools of Max/MSP and Pure Data, I suspect flow-based programming just isn’t for me. As within an hour of opening the web-based Lua editor on the Norns device I already had a pretty cool prototype of something I hadn’t even intended to make. More on that when I can devote more time to it.

There is some work to make the open-source Norns software run on the Linux desktop in a container. It seems the status is “working but awkward.” I thought to myself – “How hard could it be to package this as a Flatpak?”

As soon as I saw use of Golang and Node.js I knew the answer – very hard.

In the world of Flatpak apps we are rigorous about doing reproducible builds from controlled inputs. The flatpak-builder tool requires that all the Go and JavaScript dependencies are specified ahead of time, and build tools are prevented from connecting to the internet at build time. This is a good thing, as the results are otherwise unpredictable. The packaging tools in the Go and Node.js ecosystems make this kind of attention to detail difficult to achieve.

There is a solution, which would be to build the components using BuildStream and develop plugins to integrate with the Node.js and Go worlds, so it can fetch deps ahead of time. There’s an open issue for Go plugin. For Node.js I am not sure if anything exists.

This led me to thinking – how come the open-source, community developed world of Flatpaks is so much better and doing reliable, reproducible builds than the corporate environments that I see in my day job?

One last cool thing – thanks to instructions here by @infinitedigits I was able to produce this amazing music video!

Status update, 16/02/2022

January 2022 was the sunniest January i’ve ever experienced. So I spent its precious weekends mostly climbing around in the outside world, and the weekdays preparing for the enourmous Python 3 migration that one of Codethink’s clients is embarking on.

Since I discovered Listenbrainz, I always wanted to integrate it with Calliope, with two main goals. The first, to use an open platform to share and store listen history rather than the proprietary Last.fm. And the second, to have an open, neutral place to share playlists rather than pushing them to a private platform like Spotify or Youtube. Over the last couple of months I found time to start that work, and you can now sync listen history and playlists with two new cpe listenbrainz-history and cpe listenbrainz commands. So far playlists can only be exported *from* Listenbrainz, and the necessary changes to the pylistenbrainz binding are still in review, but its a nice start.

Status update, 17/01/2022

Happy 2022 everyone! Hope it was a safe one. I managed to travel a bit and visit my family while somehow dodging Omicron each step of the way. I guess you cant ask for much more than that.

I am keeping busy at work integrating BuildStream with a rather large, tricky set of components, Makefiles and a custom dependency management system. I am appreciating how much flexibility BuildStream provides. As an example, some internal tools expect builds to happen at a certain path. BuildStream makes it easy to customize the build path by adding this stanza to the .bst file:

variables:
    build-root: /magic/path

I am also experimenting with committing whole build trees as artifacts, as a way to distribute tests which are designed to run within a build tree. I think this will be easier in Bst 2.x, but it’s not impossible in Bst 1.6 either.

Besides that I have been mostly making music, stay tuned for a new Vladimir Chicken EP in the near future.

Status update, November 2021

I am impressed with the well-deserved rise of Sourcehut, a minimalist and open source alternative to Github and Gitlab. I like their unbiased performance comparison with other JavaScript-heavy Git forges. I am impressed by their substantial contributions to Free Software. And I like that the main developers, Drew DeVault and Simon Ser, both post monthly status update blog posts on their respective blogs.

I miss blog posts.

So I am unashamedly copying the format. I am mostly not paid to work on Free Software but sometimes I am so the length of the report will vary wildly.

This month I got a little more Codethink time to work on openqa.gnome.org (shout out Javier Jardón for getting me that time). Status report here.

image

I spoke at the first ever PackagingCon to spread the good word about Freedesktop SDK and BuildStream.

As always, I did a little review and issue triage for Tracker and Tracker Miners.

And I have been spending time working on an audio effect. More about that in another post.

Beginning Rust

I have the privilege of some free time this December and I unexpectedly was inspired to do the first few days of the Advent of Code challenge, by a number of inspiring people including Philip Chimento, Daniel Silverstone and Ed Cragg.

The challenge can be completed in any language, but it’s a great excuse to learn something new. I have read a lot about Rust and never used until a few days ago.

Most of my recent experience is with Python and C, and Rust feels like it has many of the best bits of both languages. I didn’t get on well with Haskell, but the things I liked about that language are also there in Rust. It’s done very well at taking the good parts of these languages and leaving out the bad parts. There’s no camelCaseBullshit, in particular.

As a C programmer, it’s a pleasure to see all of C’s invisible traps made explicit and visible in the code. Even integer overflow is a compile time error. As a Python programmer, I’m used to writing long chains of operations on iterables, and Rust allows me to do pretty much the same thing. Easy!

Rust does invent some new, unique bad parts. I wanted to use Ed’s cool Advent of Code helper crate, but somehow installing this tiny library using Cargo took up almost 900MB of disk space. This appears to be a known problem. It makes me sad that I can’t simply use Meson to build my code. I understand that Cargo’s design brings some cool features, but these are big tradeoffs to make. Still, for now I can simply avoid using 3rd party crates which is anyway a good motivation to learn to work with Rust’s standard library.

I also spend a lot of time figuring out compiler errors. Rust’s compiler errors are very good. If you compare them to C++ compiler errors, then there’s really no comparison at all. In fact, they’re so good that my expectations have increased, and paradoxically this makes me more critical! (Sometimes you have to measure success by how many complaints you get). When the compiler tells me ‘you forgot this semicolon’, part of me thinks “Well you know what I meant — you add the semicolon!”. And while some errors clearly tell you what to fix, others are still pretty cryptic. Here’s an example:

error[E0599]: no method named `product` found for struct `Vec<i64>` in the current scope
   --> day3.rs:82:34
    |
82  |       let count: i64 = tree_counts.product();
    |                                    ^^^^^^^ method not found in `Vec<i64>`
    |
    = note: the method `product` exists but the following trait bounds were not satisfied:
            `Vec<i64>: Iterator`
            which is required by `&mut Vec<i64>: Iterator`
            `[i64]: Iterator`
            which is required by `&mut [i64]: Iterator`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.

What’s the problem here? If you know Rust, maybe it’s obvious that my tree_counts variable is a Vec (list), and I need to call .iter() to produce an iterator. If you’re a beginner, this isn’t a huge help. You might be tempted to call rustc --explain E0599, which will tell you that you might, for example, need to implement the .chocolate() method on your Mouth struct. This doesn’t get you any closer to knowing why you can’t iterate across a list, which is something that you’d expect to be iterable.

Like I said, Rust is lightyears ahead of other compilers in terms of helpful error messages. However, if it’s a goal that “Rust is for students”, then there is still lots of work to do to improve them.

I know enough about software development to know that the existence of Rust is nothing short of a miracle. The Rust community are clearly amazing and deserve ongoing congratulations. I’m also impressed with Advent of Code. December is a busy time, which is why I’ve never got involved before, but if you are looking for something to do then I can recommend it!

You can see my Advent of Code repo here. It may, or may not proceed beyond day 4. It’s useful to check your completed code against some kind of ‘model’, and I’ve been using Daniel’s repo for that. Who else has some code to show?

Writing well

We rely on written language to develop software. I used to joke that I worked as a professional email writer rather than a computer programmer (and it wasn’t really a joke). So if you want to be a better engineer, I recommend that you focus some time on improving your written English.

I recently bought 100 Ways to Improve Your Writing by Gary Provost, which is a compact and rewarding book full of simple and widely applicable guidelines to writers. My advice is to buy a copy!

You can also find plenty of resources online. Start by improving your commit messages. Since we love to automate things, try these shell scripts that catch common writing mistakes. And every time you write a paragraph simply ask yourself: what is the purpose of this paragraph? Is it serving that purpose?

Native speakers and non-native speakers will both find useful advice in Gary Provost’s book. In the UK school system we aren’t taught this stuff particularly well. Many English-as-a-second-language courses don’t teach how to write on a “macro” level either, which is sad because there are many differences from language to language that non-natives need to be aware of. I have seen “Business English” courses that focus on clear and convincing communication, so you may want to look into one of those if you want more than just a book.

Code gets read more than it gets written, so it’s worth taking extra time so that it’s easy for future developers to read. The same is true of emails that you write to project mailing lists. If you want to make a positive change to development of your project, don’t just focus on the code — see if you can find 3 ways to improve the clarity of your writing.

Natural Language Processing

This month I have been thinking about good English sentence and paragraph structure. Non-native English speakers who are learning write in English will often think of what they want to say in their first language and then translate it. This generally results in a mess. The precise structure of the mess will depend on the rules of the student’s first language. The important thing is to teach the conventions of good English writing; but how?

Visualizing a problem helps to solve it. However there doesn’t seem to be a tool available today that can clearly visualize the various concerns writers have to deal with. A paragraph might contain 100 words, each of which relate to each other in some way. How do you visualize that clearly… not like this, anyway.

I did find some useful resources though. I discovered the Paramedic Method, through this blog post from helpscout.net. The Paramedic Method was devised by Richard Lanham and consists of these 6 steps:

  1. Highlight the prepositions.
  2. Highlight the “is” verb forms.
  3. Find the action. (Who is kicking whom?)
  4. Change the action into a simple active verb.
  5. Start fast—no slow windups.
  6. Read the passage out loud with emphasis and feeling.

This is good advice for anyone writing English. It’ll be particularly helpful in my classes in Spain where we need to clean up long strings of relative clauses. (For example, a sentence such as “On the way I met one of the workers from the company where I was going to do the interview that my friend got for me”. I would rewrite this as: “On the way I met a person from Company X, where my friend had recently got me an interview.”

I found a tool called Write Music which I like a lot. The idea is simple: to illustrate and visualize the rule that varying sentence length is important when writing. The creator of the tool, Titus Wormer, seems to be doing some interesting and well documented research.

I looked at a variety of open source tools for natural language processing. These provide good ways to tokenize a text and to identify the “part of speech” (noun, verb, adjective, adverb, etc.) but I didn’t yet find one that could analyze the types of clauses that are used. Which is a shame. My understanding of this is an area of English grammar is still quite weak and I was hoping my laptop might be able teach me by example but it seems not.

I found some surprisingly polished libraries that I’m keen to use for … something. One day I’ll know what. The compromise library for JavaScript can do all kinds of parsing and wordplay and is refreshingly honest about its limitations, and spaCy for Python also looks exciting. People like to interact with a computer through text. We hide the UNIX commandline. But one of the most popular user interfaces in the world is the Google search engine, which is a text box that accepts any kind of natural language and gives the impression of understanding it. In many cases this works brilliantly — I check spellings and convert measurements all the time using this “search engine” interface. Did you realize GNOME Shell can also do unit conversions? Try typing “50lb in kg” into the GNOME Shell search box and look at the result. Very useful! More apps should do helpful things like this.

I found some other amazing natural language technologies too. Inform 7 continues to blow my mind whenever I look at it. Commercial services like IBM Watson can promise incredible things like analysing the sentiments and emotions expressed in a text, and even the relationships expressed between the subjects and objects. It’s been an interesting day of research!

Enourmous Git Repositories

If you had a 100GB Subversion repository, where a full checkout came to about 10GB of source files, how would you go about migrating it to Git?

One thing you probably wouldn’t do is import the whole thing into a single Git repo, it’s pretty well known that Git isn’t designed for that. But, you know, Git does have some tools that let you pretend it’s a centralised version control system, and, huge monolithic repos are cool, and it works in Mercurial… evidence is worth more than hearsay, so I decided to create a Git repo with 10GB of text files to see what happened. I did get told in #git on Freenode that Git will not cope with a repo that’s larger than available RAM, but I was a little suspicious given the number of multi-gigabyte Git repos in existance.

I adapted a Bash script from here to create random filenames, and the csmith program to fill those files with nonsense C++ code, until I had 10GB 7GB of such gibberish.(I realised that, having used du -s instead of du --apparent-size -s to check the size of my test data, it was only 7GB of content, that was using 10GB of disk space.)

The test machine was an x86 virtual machine with 2GB of RAM and 1CPU, with no swap. The repo was on a 100GB ext4 volume. Doing a performance benchmark on a virtual machine on shared infrastructure is a bad idea, but I’m testing a bad idea, so whatever. The machine ran Git version 2.5.0.

Results

Generating the initial data: this took all night, perhaps because I included a call to du inside the loop that generated the data, which would take an increasing amount of time on each iteration.

Creating an initial 10GB 7GB commit: 95 minutes

$ time git add .
real    90m0.219s
user    84m57.117s
sys     1m6.932s

$ time git status
real    1m15.992s
user    0m4.071s
sys     0m20.728s

$ time git commit -m "Initial commit"
real    4m22.397s
user    0m27.168s
sys     1m5.815s

The git log command is pretty instant, a git show of this commit takes a minute the first time I run it, about 5 seconds if I run it again.

Doing git add and git rm to create a second commit is really quick, git status is still slow, but git commit is quick:

$ time git status
real    1m19.937s
user    0m5.063s
sys     0m16.678s

$ time git commit -m "Put all z files in same directory"
real    0m11.317s
user    0m1.639s
sys     0m5.306s

Furthermore, git show of this second commit is quick too.

Next I used git daemon to serve the repo over git:// protocol:

$ git daemon --verbose --export-all --base-path=`pwd`

Doing a full clone from a different machine (with Git 2.4.3, over
intranet): 22 minutes

$ time git clone git://172.16.20.95/huge-repo
Cloning into 'huge-repo'...
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
Checking connectivity... done.
Checking out files: 100% (46345/46345), done.

real    22m17.734s
user    2m12.606s
sys     0m54.603s

Doing a sparse checkout of a few files: 15 minutes

$ mkdir sparse-checkout
$ cd sparse-checkout
$ git init .
$ git config core.sparsecheckout true
$ echo z-files/ >> .git/info/sparse-checkout

$ time git pull  git://172.16.20.95/huge-repo master
remote: Counting objects: 339412, done.
remote: Compressing objects: 100% (33351/33351), done.
remote: Total 339412 (delta 5436), reused 0 (delta 0)
Receiving objects: 100% (339412/339412), 752.12 MiB | 2.58 MiB/s, done.
Resolving deltas: 100% (5436/5436), done.
From git://172.16.20.95/huge-repo
 * branch            master     -> FETCH_HEAD

real    14m26.032s
user    1m9.133s
sys     0m22.683s

This is rather unimpressive. I only pull a 55MB subset of the repo, a single directory, but the clone still takes nearly 15 minutes. Cloning the same subset again from the same git-daemon process took a similar time. The .git directory of the sparse clone is the same size as with a full clone.

I think these numbers are interesting. They show that the sky doesn’t fall if you put a huge amount of code into Git. At the same time, the ‘sparse checkouts’ feature doesn’t really let you pretend that Git is a centralised version control system, so you can’t actually avoid the consequences of having such a huge repo.

Also, I learned that if you are profiling file size, you should use du --apparent-size to measure that, because file size != disk usage!

Disclaimer: there are better ways to spend your time than trying to use a tool for things that it’s not designed for (sometimes).

Cleaning up stale Git branches

I get bored looking through dozens and dozens of stale Git branches. If git branch --remote takes up more than a screenful of text then I am unhappy!

Here are some shell hacks that can help you when trying to work out what can be deleted.

This shows you all the remote branches which are already merged, those can probably be deleted right away!

git branch --remote --merged

These are the remote branches that aren’t merged yet.

git branch --remote --no-merged

Best not to delete those straight away. But some of them are probably totally stale. This snippet will loop through each unmerged branch and tell you (a) when the last commit was made, and (b) how many commits it contains which are not merged to ‘origin/master’.

for b in $(git branch --remote --no-merged); do
    echo $b;
    git show $b --pretty="format:  Last commit: %cd" | head -n 1;
    echo -n "  Commits from 'master': ";
    git log --oneline $(git merge-base $b origin/master)..$b | wc -l;
    echo;
done

The output looks like this:

origin/album-art-to-libtracker-extract
  Last commit: Mon Mar 29 17:22:14 2010 +0100
  Commits from 'master': 1

origin/albumart-qt
  Last commit: Thu Oct 21 11:10:25 2010 +0200
  Commits from 'master': 1

origin/api-cleanup
  Last commit: Thu Feb 20 12:16:43 2014 +0100
  Commits from 'master': 18

...

Two of those haven’t been touched for five years, and only contain a single commit! So they are probably good targets for deletion, for example.

You can also get the above info sorted, with the oldest branches first. First you need to generate a list. This outputs each branch and the date of its newest commit (as a number), sorts it numerically, then filters out the number and writes it to a file called ‘unmerged-branches.txt’:

for b in $(git branch --remote --no-merged); do
    git show $b --pretty="format:%ct $b" | head -n 1;
done | sort -n | cut -d ' ' -f 2 > unmerged-branches.txt

Then you can run the formatting command again, but replace the first line with:

for b in $(cat unmerged-branches.txt); do

OK! You have a list of all the unmerged branches and you can send a mail to people saying you’re going to delete all of them older than a certain point unless they beg you not to.

.yml files are an anti-pattern

A lot of people are representing data as YAML these days. That’s good! It’s an improvement over the days when everything seemed to be represented as XML, anyway.

But one thing about the YAML format is that it doesn’t require you to embed any information in the file about how the data should be interpreted. So now we have projects where there are hundreds of different .yml files committed to Git and I have no idea what any of them are for.

YAML is popular because it’s minimal and convenient, so I don’t think that requiring that everyone suddenly creates an ontology for the data in these .yml files would be practical. But I would really like to see a convention that the first line of any .yml file was a comment describing what the file did, e.g.

# This is a BOSH deployment manifest, see http://bosh.io/ for more information

That’s all!

Running Firefox in a cgroup (using systemd)

This blog post is very out of date. As of 2020, you can find up to date information about this topic in the LWN article “Resource management for the desktop”

I’m a long time user of Firefox and it’s a pretty good browser but you know how sometimes it eats all of the memory on your computer and uses lots of CPU so the whole thing becomes completely unusable? That is incredibly annoying!

I’ve been using a build system with a web interface recently and it is really a problem there, because build logs can be quite large (40MB) and Firefox handles them really badly.

Linux has been theoretically able to limit how much CPU, RAM and IO that a process can use for some time, with the cgroups mechanism. Its default behavour, at the time of writing, is to let Firefox starve out all other processes so that I am totally unable to kill it and have to force power-off on my computer and restart it. It would make much more sense for Linux’s scheduler to ensure that the user interface always gets some CPU time, so I can kill programs that are going nuts, and also for the out-of-memory killer to actually work properly. There is a proposal to integrate gnome-session with systemd which I hope would solve this problem for me. But in the meantime, here’s a fairly hacky way of making sure that Firefox always runs in a cgroup with a fixed amount of memory, so that it will crash itself when it tries to use too much RAM instead of making your computer completely unusable.

I’m using Fedora 20 right now, but probably any operating system with Linux and systemd will work the same.

First, you need to create a ‘slice’. The documentation for this stuff is quite dense but the concept is simple: your system’s resources get divided up into slices. Slices are heirarchical, and there are some predefined slices that systemd provides including user.slice (for user applications) and system.slice (for system services). So I made a user-firefox.slice:

[Unit]
Description=Firefox Slice
Before=slices.target

[Slice]
MemoryAccounting=true
MemoryLimit=512M
# CPUQuota isn't available in systemd 208 (Fedora 20).
#CPUAccounting=true
#CPUQuota=25%

This should be saved as /etc/systemd/system/user-firefox.slice. Then you can run systemctl daemon-reload && systemctl restart user-firefox.slice and your slice is created with its resource limit!

You can now run a command in this slice using the systemd-run command, as root.

sudo systemd-run --slice user-firefox.slice --scope xterm

The xterm process and anything you run from it will be limited to using 512MB of RAM, and memory allocations will fail for them if more than that is used. Most programs crash when this happens because nobody really checks the result of malloc() (or they do check it, but they never tested the code path that runs if an allocation fails so it probably crashes anyway). If you want to be confident this is working, change the MemoryLimit in user-firefox.slice to 10M and run a desktop application: probably it will crash before it even starts (you need to daemon-reload and restart the .slice after you edit the file for the changes to take effect).

About the --scope argument: a ‘scope’ is basically a way of identifying one or more processes that aren’t being managed directly by systemd. By default, systemd-run would start xterm as a system service, which wouldn’t work because it would be isolated from the X server.

So now you can run Firefox in a cgroup, but it’s a bit shit because you can only do so as the ‘root’ user. You’ll find if you try to use `sudo` or `su` to become your user again that these create a new systemd user session that is outside the user-firefox.slice cgroup. You can use `systemd-cgls` to show the heirarchy of slices, and you’ll see that the commands run under `sudo` or `su` show up in a new scope called something like session-c3.scope, where the scope created by systemd-run that is in the correct slice is called run-1234.scope.

There are various nice ways that we could go about fixing this but today I am not going to be a nice person, instead I just wrote this Python wrapper that becomes my user and then runs Firefox from inside the same scope:

#!/usr/bin/env python3

import os
import pwd


user_info = pwd.getpwnam('sam')
os.setuid(user_info.pw_uid)

env = os.environ.copy()
env['HOME'] = user_info.pw_dir

os.execle('/usr/bin/firefox', 'Firefox (tame)', env)

Now I can run:

sudo systemd-run --slice user-firefox.slice --scope
./user-firefox

This is a massive hack and don’t hold me responsible for anything bad that may come of it. Please contribute to Firefox if you can.

Viruses

I just did a bit of virus removal for a friend, one of the inventive police warning ones. I discovered a couple of useful tricks, since it’s been a while since I had to do this.

http://pogostick.net/~pnh/ntpasswd/ hosts a tool to manipulate a Windows registry from Linux system (Wine itself doesn’t use the real Windows registry format, so it can’t be used for this as I originally expected).

The virus did some trickery to prevent any .exe programs from running (other than some whitelisted ones), preventing access to cmd.exe, taskmgr.exe, regedit.exe or anything else you might be like to use to remove the virus. I forget the mechanism it uses to do this, but one simple fix is to rename regedit.exe to regedit.com.

It was then a disappointingly simple search through the registry to the classic HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Run key where the file tpl_0_c.exe had done a rather poor job of hiding.

I ran Fedora 17 from a memory stick before going back into Windows, and the transition from Gnome Shell on a netbook to Windows XP on a netbook made me understand a lot more some of the design decisions that went into the shell. Windows XP is really fiddly to use on such tiny and crappy hardware whereas the shell felt really comfortable. I’m still not at all convinced that we should be making GNOME run on tablets, but for netbooks it makes perfect sense to maximise things by default and make the buttons bigger. Sadly some of this is coming at the expense of usability on big desktops; I’m ok with having to configure stuff to get a comfortable UI, but the recent thread where side-by-side windows were suggested as a replacement for Nautilus’ split view makes me worried that we’re losing sight of desktop users completely. Side by side windows on a 25″ monitor aren’t remotely comparable to a split window, though they might be on a netbook. OS X has always been dreadful on large screens, we shouldn’t take GNOME down the same path.