The Fundamentals of OSTree

I’ve had some time at work to get my head around the OSTree tool. I summarised the tool in a previous post. This post is an example of its basic functionality (as of January 2014).

The Repository

OSTree repos can be in one of multiple modes. It seems ‘bare’ is for use on deployed systems (as the actual deployment uses hardlinks so disk space usage is kept to a minimum), while ‘archive-z2’ is for server use. The default is ‘bare’.

Every command requires a –repo=… argument, so it makes sense to create an alias to save typing.

$ cd ostree-demo-1
$ mkdir repo
$ alias ost="ostree --repo=`pwd`/repo"

$ ost init
$ ls repo
config  objects  refs  remote-cache  tmp
$ cat repo/config
[core]
repo_version=1
mode=bare

Committing A Filesystem Tree

Create some test data and commit the tree to ‘repo’. Note that none of the commit SHA256s here will match the ones you see when running these commands, unless you are in some kind of time vortex.

$ mkdir tree
$ cd tree
$ echo "x" > 1
$ echo "y" > 2
$ mkdir dir
$ cp /usr/share/dict/words words

$ ost commit --branch=my-branch \
    --subject="Initial commit" \
    --body="This is the first commit."
ce19c41036cc45e49b0cecf6b157523c2105c4de1ce30101def1f759daafcc3e

$ ost ls my-branch
d00755 1002 1002      0 /
-00644 1002 1002      2 /1
-00644 1002 1002      2 /2
-00644 1002 1002 4953680 /words
d00755 1002 1002      0 /dir

Let’s make a second commit.

$ tac words > words.tmp && mv words.tmp words

$ ost commit --branch=my-branch \
    --subject="Reverse 'words'"
    --body="This is the second commit."
67e382b11d213a402a5313e61cbc69dfd5ab93cb07fbb8b71c2e84f79fa5d7dc

$ ost log my-branch
commit 67e382b11d213a402a5313e61cbc69dfd5ab93cb07fbb8b71c2e84f79fa5d7dc
Date:  2014-01-14 12:27:05 +0000

    Reverse 'words'

    This is the second commit.

commit ce19c41036cc45e49b0cecf6b157523c2105c4de1ce30101def1f759daafcc3e
Date:  2014-01-14 12:24:19 +0000

    Initial commit

    This is the first commit.

Now you can see two versions of ‘words’:

$ ost cat my-branch words | head -n 3
ZZZ
zZt
Zz
error: Error writing to file descriptor: Broken pipe

$ ost cat my-branch^ words | head -n 3
1080
10-point
10th
error: Error writing to file descriptor: Broken pipe

OSTree lacks most of the ref convenience tools that you might be used to from Git. For example, where providing a SHA256 you cannot abbreviate it, you must give the full 40 character string. There is an ostree rev-parse command which takes a branch name and returns the commit that branch ref currently points too, and you may use ^ to refer to the parent of a commit as seen above.

Checking Out Different Versions of the Tree

The last command we’ll look at is ostree checkout. This command expects you to be checking out the tree into an empty directory, and if you give it only a branch or commit name it will create a new directory with the same name as that branch or commit:

$ ost checkout my-branch
$ ls my-branch/
1  2  dir  words
$ head -n 3 my-branch/words
ZZZ
zZt
Zz

By default, OSTree will refuse to check out into a directory that already exists:

$ ost checkout my-branch^ my-branch
error: File exists

However, you can also use ‘union mode’, which will overwrite the contents of the directory (keeping any files which already exist that the tree being checked out doesn’t touch, but overwriting any existing files that are also present in the tree):

$ ost checkout --union my-branch^ my-branch
$ ls my-branch/
1  2  dir  words
$ head -n 3 my-branch/words 
1080
10-point
10th

One situation where this is useful is if you are constructing a buildroot that is a combination of several other trees. This is how the Gnome Continuous build system works, as I understand it.

Disk Space

In a bare repository OSTree stores each file that changes in a tree as a new copy of that file, so our two copies of ‘words’ make the repo 9.6MB in total:

$ du -sh my-branch/
4.8M    my-branch/
$ du -sh repo/
9.6M    repo/

It’s clear that repositories will get pretty big when there is a full operating system tree in there and there is no functionality right now that allows you to expire commits that are older than a certain threshold. So if you are looking for somewhere to start hacking …

In Use

OSTree contains a higher layer of commands which operate on an entire sysroot rather than just a repository. These commands are documented in the manual, and a good way to try them out is to download the latest Gnome Continuous virtual machine images.

OS-level version control

At the time of writing (January 2014), the area of OS-level version control is at the “scary jungle of new-looking projects” stage. I’m pretty sure we are at the point where most people would answer the question “Is it useful to be able to snapshot and rollback your operating system?” with “yes”. Faced with the next question, “How do you recommend that we do it?” the answer becomes a bit more difficult.

I have not tried out all of the projects below but hopefully this will be a useful summary.

Snapper

The openSUSE project have created Snapper, which has a specific goal of providing snapshot, rollback and diff of openSUSE installations.

It’s not tied to a specific implementation method, instead there are multiple backends, currently for btrfs, LVM or ext4 using the ‘next4’ branch. Btrfs snapshots seems to be the preferred implementation.

The tool is implemented as a daemon which exposes a D-Bus API. There is a plugin for the Zypper package manager and the YaST config tool which automatically creates a snapshot before each operation. There is a cron job which creates a snapshot every hour, and a PAM module which creates a snapshot on every login. There is also a command-line client which allows you to create and manage snapshots manually (of course, you can also call the D-Bus API directly).

Creating a snapshot every hour might sound expensive but remember that Btrfs is a copy-on-write filesystem, so if nothing has changed on disk then the snapshot comes virtually free. Even so, there is also a cron job set up which cleans up stale snapshots in a configurable way.

The openSuSE user manual contains this rather understated quote:

Since Linux is a multitasking system, processes other than YaST or zypper may modify data in the timeframe between the pre- and the post-snapshot. If this is the case, completely reverting to the pre-snapshot will also undo these changes by other processes. In most cases this would be unwanted — therefore it is strongly recommended to closely review the changes between two snapshots before starting the rollback. If there are changes from other processes you want to keep, select which files to roll back.

OSTree

GNOME/Red Hat developer Colin Walters is behind OSTree, which is notable primarily because it provides snapshotting without depending on any underlying filesystem features. Instead, it implements its own version control system for binary files.

In brief, you can create snapshots and go back to them later on, and compare differences between them. There doesn’t seem to be a way to delete them, yet — unlike Snapper, which is being developed with a strong focus on being usable today, OSTree is being developed bottom-up starting from the “Git for binaries” layer. There is work on the higher level in progress; Colin recently announced a prototype of RPM integration for OSTree.

Beyond allowing local management of filesystem trees, OSTree also has push and a pull command, which allow you to share branches between machines. Users of the GNOME Continuous continuous integration system can use this today to obtain a ready-built GNOME OS image from the build server, and then keep it up to date with the latest version using binary deltas.

OSTree operates atomically, and the root tree is mounted read-only so that other processes cannot write data there. This is a surefire way to avoid losing data, but it does require major rethinking on how Linux-based OSes work. OSTree provides an init ramdisk and a GRUB module, which allows you to choose at boot time which branch of the filesystem to load.

There are various trade-offs between OSTree’s approach versus depending on features in a specific filesystem. OSTree’s manual discusses these tradeoffs in some detail. An additional consideration is that both OSTree and Btrfs are still under active development and therefore you may encounter bugs that you need to triage and fix. OSTree is roughly 28k lines of C, running entirely in userland, while the kernel portion of Btrfs alone is roughly 91k lines of C.

Docker

Docker is a tool for managing Linux container images. As well as providing a helpful wrapper around LXC for running containers, it allows taking snapshots of the state of a container. Until recently it implemented this using the AUFS union file system, which requires patching your kernel to use and is unlikely to make it into mainline Linux any time soon, but as of version 0.7 Docker allows use of multiple storage backends. Alex Larsson (coincidentally another GNOME/Red Hat developer) implemented a practical device-mapper storage backend which works on all Linux-based OSes. There is talk of a Btrfs storage backend too, but I have not seen concrete details since this prototype from May 2013.

It can be kind of confusing to understand Docker’s terminology at first so I’ll run through my understanding of it. The machine you are actually running has a root filesystem and some configuration info, and that’s a container. You create containers by calling docker run <image>, where an image is a root filesystem and some configuration info, but stored. You can have multiple versions of the same image, which are distinguished by tags. For example, Docker provide a base image named ‘ubuntu’ which has a tag for each available version. The rule seems to be that if you’re using it, it’s a container, if it’s a snapshot or something you want to use as a base for something else, it’s an image. You’ll have to train yourself to avoid ever using the term “container image” again.

Docker’s version control functionality is quite primitive. You can call docker commit to create a new image or a new tag of an existing image from a container. You can call docker diff to show an svn status-style list of the differences between a container and the image that it is based on. You can also delete images and containers.

You can also push and pull images from repositories such as the public Docker INDEX. This allows you to get a container going from scratch pretty quickly. However you can create an image from a rootfs tarball using docker import, so you can base your containers off any operating system you like.

On top of all this, Docker provides a simple automation system for creating containers via calls to a package manager or a more advanced provisioning system like Puppet (or just arbitrary shell commands).

Linux Containers (LXC)

Docker is implemented on top of Linux Containers. It’s possible to use these commands directly, too (although unlike Docker, where ignoring the manual and using the default settings will get you to a working container, using LXC commands without having read the manual first is likely to lead to crashing your kernel). LXC provides a snapshotting mechanism too via the lxc-clone and lxc-snap commands. It uses the snapshotting features of either Btrfs or LVM, so it requires that your containers (not just the rootfs, but the configuration files and all) are stored on either a Btrfs or LVM filesystem.

Baserock

The design of Baserock includes atomic snapshot and rollback using Btrfs. Not much time has been spent on this so far, partly because despite being originally being aimed at embedded systems Baserock is seeing quite a lot of use as a server OS.

The component that does exist is tb-diff, which is a tool for comparing two snapshots (two arbitrary filesystems, really) and constructing a binary delta. This is useful for providing upgrades, and could even be used to rebase in-development systems on top of a new base image. While all the above projects provide a ‘diff’ command, Snapper’s simply defers to the UNIX diff command (which can only display differences in text files, not binaries), and Docker’s doesn’t display contents at all, just the names of the files that have changes.

Addendum: Nix

It was remiss of me I think not to have mentioned Nix, which aims to remove the need for snapshots altogether by implementing rollback (and atomic package management transactions) at the package level using hardlinks. Nix is a complex enough beast that you wouldn’t want to use it in a non-developer-focused OS as-is, but you could easily implement a snapshot and rollback mechanism on top.