December 6, 2023March 19, 2024 Sam Thursfield

Calliope 10.0: creating music playlists using Tracker Miner FS

I just published version 10.0 of the open source playlist generation toolkit, Calliope. This fixes a couple of long standing issues I wanted to tackle.

SQLite Concurrency

The first of these only manifest itself as intermittent Gitlab CI failures when you submitted pull requests. Calliope uses SQLite to cache data, and a cache may be used by multiple concurrent process. SQLite has a “Write-Ahead Log” journalling mode that should excel at concurrency but somehow I kept seeing “database is locked” errors from a test that verified the cache with multiple writers. Well – make sure to explicitly *close* database connections in your Python threads.

Content resolution with Tracker Miner FS

The second issue was content resolution using Tracker Miner FS, which worked nicely but very slowly. Some background here: “content resolution” involves finding a playable URL for a piece of music, given metadata such as the artist name and track name. Calliope can resolve content against remote services such as Spotify, and can also resolve against a local music collection using the Tracker Miner FS index. The “special mix” example, which generates nightly playlists of excellent music, takes a large list of songs taken from Listenbrainz and tries to resolve each one locally, to check it’s available and get the duration. Until now this took over 30 minutes at 100% CPU.

Why so slow? The short answer is: cpe tracker resolve was not using the Tracker FTS (Full Text Search) engine. Why? Because there are some limitations in Tracker FTS that means we couldn’t use it in all cases.

About Tracker FTS

The full-text search engine in Tracker uses the SQLite FTS5 module. Any resource type marked with tracker:fullTextIndexed can be queried using a special fts:match predicate. This is how Nautilus search and the tracker3 search command work internally. Try running this command to search your music collection locally for the word “Baby”:

tracker3 sparql --dbus-service org.freedesktop.Tracker3.Miner.Files \
    -q 'SELECT ?title { ?track a nmm:MusicPiece; nie:title ?title; fts:match "Baby" }'

This feature is great for desktop search, but it’s not quite right for resolving music content based on metadata.

Firstly, it is doing a substring match. So if I search for the artist named “Eagles”, it will also match “Eagles of Death Metal” and any other artist that contains the word “Eagles”.

Secondly, symbol matching is very complicated, and the current Tracker FTS code doesn’t always return the results I want. There are at least two open issues, 400 and 171 about bugs. It is tricky to get this right: is ' (Unicode +0027) the same as ʽ (Unicode +02BD)? What about ՚ (Unicode +055A, the “Armenian Apostrophe”)? This might require some time+money investment in Tracker SPARQL before we get a fully polished implementation.

My solution the meantime is as follows:

Strip all words with symbols from the “artist name” and “track name” fields
If one of the fields is now empty, run the slow lookup_track_by_name query which uses FILTER to do string matching against every track in the database.
Otherwise, run the faster lookup_track_by_name_fts query. This uses both FTS *and* FILTER string matching. If FTS returns extra results, the FILTER query still picks the right one, but we are only doing string matching aginst the FTS results rather than the whole database.

Some unscientific profiling shows the special_mix script took 7 minutes to run last night, down from 35 minutes the night before. Success! And it’d be faster still if people can stop writing songs with punctuation marks in the titles.

Screenshot of a Special Mix playlist — Yesterday’s Special Mix.

Standalone SPARQL queries

You might think Tracker SPARQL and Tracker Miners have stood still since the Tracker 3.0 release in 2020. Not so. Carlos Garnacho has done huge amounts of work all over the two codebases bringing performance improvements, better security and better developer experience. At some point we need to do a review of all this stuff.

Anyway, precompiled queries are one area that improved, and it’s now practical to store all of an apps queries in separate files. Today most Tracker SPARQL users still use string concatenation to build queries, so the query is hidden away in Python or C code in string fragments, and can’t easily be tested or verified independently. That’s not necessary any more. In Tracker Miners we already migrated to using standalone query files (here and here). I took the opportunity to do the same in Calliope.

The advantages are clear:

no danger of “SPARQL injection” attacks, nor bugs caused by concatenation mistakes
a query is compiled to SQLite bytecode just once per process, instead of happening on every execution
you can check and lint the queries at build time (to do: actually write a SPARQL linter)
you can run and test the queries independently of the app, using tracker3 sparql --file. (Support for setting query parameters due to land Real Soon).

The only catch is some apps have logic in C or Python that affects the query functionality, which will need to be implemented in SPARQL instead. It’s usually possible but not entirely obvious. I got ChatGPT to generate ideas for how to change the SPARQL. Take the easy option! (But don’t trust anything it generates).

Next steps for Calliope

Version 10.0 is a nice milestone for the project. I have some ideas for more playlist generators but I am not sure when I’ll get more time to experiment. In fact I only got time for the work above because I was stuck on the sofa with a head-cold. In a way this is what success looks like.

February 17, 2023March 20, 2023 Sam Thursfield

Status update, 17/02/2023

This month I attended FOSDEM for the first time since 2017. In addition to eating 4 delicious waffles, I had the honour of presenting two talks, the first in the Testing & Automation devroom on Setting up OpenQA testing for GNOME.

GNOME’s initial OpenQA testing is mostly implemented now and it’s already found its first real bug. The next step is getting more folk interested within GNOME, so we can ensure ongoing maintenance of the tests and infra, and ensure a bus factor of > 1. If you see me at GUADEC then I will probably talk to you about OpenQA, be prepared!! 🙂

My second talk was in the Python devroom, on DIY music recommendations. I intermittently develop a set of playlist generation tools named Calliope, and this talk was mostly aiming to inspire people to start similar fun & small projects, using simple AI techniques that you can learn in a weekend, and taking advantage of the amazing resource that is Musicbrainz. It seemed to indeed inspire some of the audience and led to an interesting chat with Rob Kaye of the Metabrainz Foundation – there is more cool stuff on the way from them.

Here’s a fantastic sketch of the talk by Jeroen Heijmans:

I didn’t link to this in the talk, but apropos of nothing here’s an interesting video entitled Why Spotify Will Eventually Fail.

On the Saturday I met up with Carlos Garnacho and gatecrashed the GNOME docs hackfest, discussing various improvements around search in GNOME. Most of these are now waiting for developer time as they are too large to be done in occasional moments of evening and weekend downtime, get in touch if you want to find out more!

I must also shout out Marco Trevisan for showing me where to get a decent meal near Madrid Chamartín station on the way home.

Meanwhile at Codethink I have been getting more involved in marketing. Its a company that exists in two worlds, commercial software services on one side and community-driven open source software on the other, often trying our best to build bridges between the two. There aren’t many marketing graduates who are experts in open source, and neither many experienced software developers who want to work fulltime on managing social media, so we are still figuring out the details…

Anyway, the initial outcome is that Codethink is now on the Fediverse – follow us here! @codethink@social.codethink.co.uk

June 17, 2022 Sam Thursfield

Status update, 17/06/2022

I am currently in the UK – visiting folk, working, and enjoying the nice weather. So my successful travel plans continue for the moment… (corporate mismanagement has led to various transport crises in the UK so we’ll see if I can leave as successfully as I arrived).

I started the Calliope playlist toolkit back in 2016. The goal is to bring open data together and allow making DIY music recommenders, but its rather abstract to explain this via the medium of JSON documents. Coupled with a desire to play with GTK4, which I’ve had no opportunity to do yet, and inspired by a throwaway comment in the MusicBrainz IRC room, I prototyped up a graphical app that shows what kind of open data is available for playlist generation.

This “calliope popup” app can watch MPRIS nofications, or page through an existing playlist. In future it could also page through your Listenbrainz listen history. So far it just shows one type of data:

This screenshot shows MusicBrainz metadata for my test playlist’s first track, which happens to be the song “Under Pressure”. (This is a great test because it is credited to two artists :-). The idea is to flesh out the app with metadata from various different providers, making it easier to see what data is available and detect bad/missing info.

The majority of time spent on this so far has been (re-)learning GTK and figuring out how to represent the data on screen. There was some also work involved making Calliope itself return data more usefully.

Some nice discoveries since I last did anything in GTK are the Blueprint UI language, and the Workbench app. Its also very nice having the GTK Inspector available everywhere, and being able to specify styling via a CSS file. (I’ve probably done more web sites than GTK apps in the last 10 years, so being able to use the same mental model for both is a win for me.). The separation of Libadwaita from GTK also makes sense and helps GTK4 feels more focused, avoiding (mostly) having 2 or 3 widgets for one purpose.

Apart from that, I’ve been editing and mixing new Vladimir Chicken music – I can strongly recommend that you never try to make an eight minute song. This may be the first and last 8 minute song from VC 🙂

February 16, 2022 Sam Thursfield

Status update, 16/02/2022

January 2022 was the sunniest January i’ve ever experienced. So I spent its precious weekends mostly climbing around in the outside world, and the weekdays preparing for the enourmous Python 3 migration that one of Codethink’s clients is embarking on.

Since I discovered Listenbrainz, I always wanted to integrate it with Calliope, with two main goals. The first, to use an open platform to share and store listen history rather than the proprietary Last.fm. And the second, to have an open, neutral place to share playlists rather than pushing them to a private platform like Spotify or Youtube. Over the last couple of months I found time to start that work, and you can now sync listen history and playlists with two new cpe listenbrainz-history and cpe listenbrainz commands. So far playlists can only be exported *from* Listenbrainz, and the necessary changes to the pylistenbrainz binding are still in review, but its a nice start.

April 10, 2021July 30, 2021 Sam Thursfield

Calliope, slowly building steam

I wrote in December about Calliope, a small toolkit for building music recommendations. It can also be used for some automation tasks.

I added a bandcamp module which list albums in your Bandcamp collection. I sometimes buy albums and then don’t download them because maybe I forgot or I wasn’t at home when I bought it. So I want to compare my Bandcamp collection against my local music collection and check if something is missing. Here’s how I did it:

# Albums in your online collection that are missing from your local collection.

ONLINE_ALBUMS="cpe bandcamp --user ssssam collection"
LOCAL_ALBUMS="cpe tracker albums"
#LOCAL_ALBUMS="cpe beets albums"

cpe diff --scope=album <($ONLINE_ALBUMS | cpe musicbrainz resolve-ids -) <($LOCAL_ALBUMS)

Like all things in Calliope this outputs a playlist as a JSON stream, in this case, a list of all the albums I need to download:

{
  "album": "Take Her Up To Monto",
  "bandcamp.album_id": 2723242634,
  "location": "https://roisinmurphy.bandcamp.com/album/take-her-up-to-monto",
  "creator": "Róisín Murphy",
  "bandcamp.artist_id": "423189696",
  "musicbrainz.artist_id": "4c56405d-ba8e-4283-99c3-1dc95bdd50e7",
  "musicbrainz.release_id": "0a79f6ee-1978-4a4e-878b-09dfe6eac3f5",
  "musicbrainz.release_group_id": "d94fb84a-2f38-4fbb-971d-895183744064"
}
{
  "album": "LA OLA INTERIOR Spanish Ambient & Acid Exoticism 1983-1990",
  "bandcamp.album_id": 3275122274,
  "location": "https://lesdisquesbongojoe.bandcamp.com/album/la-ola-interior-spanish-ambient-acid-exoticism-1983-1990",
  "creator": "Various Artists",
  "bandcamp.artist_id": "3856789729",
  "meta.warnings": [
    "musicbrainz: Unable to find release on musicbrainz"
  ]
}

There are some interesting complexities to this, and in 12 hours of hacking I didn’t solve them all. Firstly, Bandcamp artist and album names are not normalized. Some artist names have spurious “The”, some album names have “(EP)” or “(single)” appended, so they don’t match your tags. These details are of interest only to librarians, but how can software tell the difference?

The simplest approach is use Musicbrainz, specifically cpe musicbrainz resolve-ids. By comparing ids where possible we get mostly good results. There are many albums not on Musicbrainz, though, which for now turn up as false positives. Resolving Musicbrainz IDs is a tricky process, too — how do we distinguish Multi-Love (album) from Multi-Love (single) if we only have an album name?

If you want to try it out, great! It’s still aimed at hackers — you’ll have to install from source with Meson and probably fix some bugs along the way. Please share the fixes!

December 18, 2020 Sam Thursfield

Calliope: Music recommendations for hackers

I started thinking about playlist generation software about 15 years ago. In that time, so much happened that I can’t possibly summarize it all here. I’ll just mention two things. Firstly, Spotify appeared, and proceeded to hire or buy most of the world’s music recommendation experts and make automatic playlists into a commodity. Secondly, I spent a lot of time iterating on a music tool I call Calliope.

Spotify or not?

Spotify’s discovery features can be a great way to find new music, but I’ve always felt like something was missing. The recommendations are opaque. We know broadly how they work, but there’s no way to know why it’s suggesting I listen to ska punk all day, or I try a podcast titled ‘Tu Inglés’, or play some 80’s alternative classics I’m already familiar with. It gets repetitive.

Some of the most original new music isn’t even available on Spotify. Most folk don’t release that small artists have to pay a distributor to get their music to appear on streaming services like Spotify and Apple Music, a dubious investment when the return for the artist might be a cheque for $0.10 and a little exposure. No wonder that some artists use music purchase sites like Bandcamp exclusively. Of course, this means they’ll never appear in your Discover Weekly playlist.

Algorithms decide which social media posts I see, whether I can get a credit card, and how much I would pay to insure a car. Spotify’s recommendation system is another closed system like the others. But unlike credit agencies and big social networks, the world of music has some very successful repositories of open data. I’ve been saving my listen history to Last.fm since 2006. Shouldn’t I do something with it?

Introducing Calliope

Calliope is an open source tool for hackers who want to generate playlists. Its primary goals are to be a fun side project for me and to produce interesting playlists from of my digital music collection. Recently it has begun fulfilling both of those goals so I decided it’s time to share some details.

Querying my music collection with Calliope

The project consists of a set of commandline tools which operate on playlist data. You use a shell pipeline to define the data pipeline. Your local music collection is queried from Tracker or Beets. You can mix in data from Last.fm, Musicbrainz and Spotify. You can output the results as XSPF playlists in your music player. The implementation is Python, but the commandline focus means it can interact with tools in any language that parses JSON.

The goal is not to replace Spotify here. The goal is to make recommendations open and transparent. That means you’re going to see the details of how they work. My dream would be that this becomes an educational tool to help us understand more about what “algorithms” (used in the journalistic sense) actually do.

I’m developing a series of example playlist generation scripts. I’m particularly enjoying “Music I haven’t listened to in over a year” — that one requires over a year of listen history data to be useful, of course. But even the “One hour random shuffle” playlist is fun.

A breakthrough this month was the start of a constraints-based approach for selecting songs. I found a useful model in a paper from 2006 titled “Fast Generation of Optimal Music Playlists using Local Search”, and implemented a subset using the Python simpleai library. Simple things can produce great results. I’m only scratching the surface of what’s possible with this model, using constraints on the duration property to ensure songs and playlists are a suitable length. I expect to show off some more sophisticated examples in future.

I’m not going to talk much more about it here — if it sounds interesting, read the documentation which I’ve recently been working on, clone the source code, and ask me if there’s any questions. I’m keen to hear what ideas you have.