Tracker 3.0: The Good and the Bad

This is part 4 of a series. Part 1 is here, part 2 is here, and part 3 is here. Come back next week for my thoughts on the future of desktop search.

I thought this was going to be the last blog post about Tracker 3.0, but it got rather long and I decided to turn this appraisal of the project and its design into a post of its own. So, get ready for some praise and some criticism!

To criticise the design and implementation of a software project we first need to understand the project requirements. What it is trying to do? I’ve written some goals and non-goals for the Tracker search engine, and proposed them for inclusion in the README. Now I’m going to compare the goals with the reality. I am of course a biased observer, and I welcome you to make your own assessment, but let’s dive in!

Goal 1. Provide real-time searching and browsing of desktop content.

Tracker achieves this with a background indexing service and a fast SQLite database. Queries are fast and SQLite works well.

For apps that want to use search, we provide a flexible API but not a simple API. You can see this in use in Nautilus, and yes that’s 3 screenfuls of code.

I have seen complaints about slow queries, which can happen if the database becomes enourmous (multiple GBs) or becomes corrupted, but these are caused by things mostly out of our control — configuration mistakes, filesystem issues, or SQLite bugs. Where Tracker generates performance complaints it’s almost always related to the background indexing. More on that later.

Searching via GNOME Shell is not quite as fast as Tracker itself, as it relies on spawning many app processes to return results. If more apps start storing data using Tracker SPARQL then we may be able to optimise this in future, by having one process that queries all available databases.

Goal 2. Provide searchable data storage for desktop apps.

This was always a goal but pre-3.0 it was never widely adopted. I wrote about the drawbacks of the old ‘central data store’ model in the first post of the series. A big blocker for apps was that a central data store means a central database schema. It discourages innovation if adding a feature to your app requires first designing it, then negotating a schema change with the Ontology Overloads, and then trying it out and see how it works.

Tracker 3.0 brings independent app databases, so our story has improved a lot, but would you store app data with Tracker SPARQL? Why not use SQLite directly?

It’s true that SQLite is a great storage solution which you should definitely use if it suits you. Here’s what Tracker SPARQL provides that SQLite does not:

  • Implicitly handled database migrations. These are good for simple cases but comes with some restrictions, in particular any migration that may cause data loss is prohibited.
  • The ability to publish app data as a D-Bus endpoint. This is a speculative benefit for the moment, but may allow richer search results and more efficient Shell searches in future.
  • Similar read/write performance to raw SQLite. The SPARQL translation overhead isn’t huge and can often be mitigated altogether with prepared queries.
  • Noticibly higher disk space usage compared to SQLite. See below for more about why this happens.
  • A GObject-based API simpler than SQLite’s own

3. Allow full-text search within common document types.

To do full-text search we need to load data from different file formats. The tracker-extract-3 miner is responsible and you can see supported formats in the code. It’s rare that I see bugs about this part of the code aside from obscure crashes, in fact recently we’ve removed more formats than we added (goodbye DVI, MIDI, and source code formats).

Tracker’s full-text search capability is built on the SQLite FTS5 module. We currently don’t expose the full power of the FTS5 query language, only supporting basic keyword matches. And FTS5 is not particularly powerful compared to larger scale engines like Lucene. However, we are in line with state-of-the-art in other desktop search engines.

4. Allow advanced queries using a standard query language.

It’s fundamental to the current design of Tracker that we use SPARQL to query and update the database. The nice thing about that is it’s a well-maintained standard, reasonably intuitive and with tutorials available online. It even supports Federated Queries out-of-the-box, something neither SQL nor GraphQL manage.

However, SPARQL is not the easiest language to work with — there are many ways to typo a query so that it produces no results without telling you why. It’s also verbose. When porting GNOME Photos I ended up creating a templating system and some page-long queries.

The Tracker Miners don’t have special access to the database — they use SPARQL like everyone else, and we could in principle replace the current database engine with one of the many other SPARQL databases in existence, if there were any suitable alternatives, which there aren’t. SPARQL is quite a high barrier for a database engine to support and those that do support it are either too heavy for desktop/embedded, like JENA and Virtuoso, or unfinished / abandoned like 4store and Oxigraph. In the medium term we are likely to stick with our SQLite + SPARQL translation layer approach.

Removing the “advanced query language” requirement would allow switching to a different query API, which would require rewriting Tracker Miner FS and all apps that use Tracker. It also wouldn’t open up many more options. Xapian is a possibility, but its query engine is much more limited than our current one, and it has its own issues . The state of the art in open-source web search is the Lucene search engine, but it’s written in Java and, for better or worse, there would be controversy if GNOME came to depend on the Java runtime. A C++ port of Lucene exists, but is sufficiently abandoned as to still be hosted on SourceForge.net.

5. Be secure and private by default.

Despite the name, Tracker’s privacy credentials are strong. Your data never leaves your machine. Tracker’s index is as private as the rest of your home directory.

Tracker Miner FS scans files in the background, including inside the Downloads folder, which means a bug in tracker-extract and tracker-miner-fs could exploited by an attacker who tricks you into downloading a file that triggers the bug. The risk is highest in tracker-extract which actually opens the files and parses them. To mitigate this, tracker-extract is sandboxed using SECCOMP, preventing it from making network connections or reading files other than the one it’s processing. SECCOMP isn’t an ideal solution and causes occasional breakages, but it’s an important safeguard.

Of course there is also content-based sandboxing provided for Flatpak apps which I wrote about in Tracker 3.0: What’s New?

6. Be efficient enough for desktop and mobile use

Tracker Miner FS does well in profiling and is being optimised even more. You will find many reports online about “Tracker high CPU usage”, however. It’s a common enough question that we mention it in the FAQ. So is Tracker slow or not?

In almost all cases, reports of high CPU usage are due to internal bugs, rather than problems in the design. And these bugs can be terrible! It’s disappointing to hear about Tracker Miners causing people’s laptop fans to turn on for hours. Such issues are tricky to reproduce and have varied causes — bugs in the complex miner-fs/extractor code, bugs in dependencies, or misguided attempts to index huge dumps of text like source code. I’m yet to see one that indicates a fundamental design flaw, so for now the solution seems to be relying on users who experience these problems taking the time to help diagnose, reproduce and fix them.

As for disk space, I am not happy about the amount of space the index can consume, but it’s part of a speed/size tradeoff we made which is currently balanced towards speed. An empty Tracker SPARQL database can start at 3.2MB of disk space. We store the database schema in the database as RDF data, which is where most of this initial overhead comes from. We also use a data storage structure that’s optimised for RDF querying and serialisation, and it’s not the most efficient way to store data in SQLite.

7. Be maintainable by a small team.

This is the last item on the list but it should be the first, because an unmaintainable project will inevitibly meet a sad end.

The bus factor of Tracker is currently high.

There are around 30,000 lines of code in libtracker-sparql, mostly devoted to translating SPARQL to SQL. Carlos has focused a lot on rewriting this and it’s impressively clean — see the SPARQL parser in particular — but debugging it can be tricky as it generates very long, hard to read SQL queries. It has fairly good test coverage, but more tests would be great.

The Tracker Miner FS code is also complex with about 15,000 lines of code, and it is written with a combination of event callbacks, signals and virtual functions that I like to call “object-oriented spaghetti code”. I tried to graph the flow of execution once and gave up, so instead I spend an hour or more re-discovering how it works every time I do some work there. Deferring to the main loop as much as possible was important in the 2000s when single-core CPUs were still common and a long-running task in a daemon process could cause freezes in the graphical desktop. Now it just overcomplicates the code.

The multiple layers of inheritance are due to an unrealized goal of creating different kinds of content crawlers using the existing miner-fs code. In theory a TrackerMinerFS could crawl anything that can be represented as a GFile. In practice this has never been done. GNOME Online Miners do their own thing, and indeed online resources often don’t have a filesystem-like structure. In Tracker 3.0 we made libtracker-miner private, which enables us to refactor it without being concerned with API stability, and less likely that anyone can reuse it. Perhaps it’s time to simplify this part of the code.

I’ve tried to focus the majority of my energy on making the project more maintainable, something I wrote about already. In the Bugzilla days it was painful to review and test merge requests, but now we have GitLab and automated testing, most merge requests are simple and fun to review. Testing?

Tracker: Tracker Miners:

The situation is OK but there is a long way to go.

Is Tracker the right answer?

A healthy project needs a stream of new contributors. I don’t currently see this happening.

One issue is that Tracker is not widely used outside GNOME. I try to maintain a list of apps and platforms using Tracker which illustrates this. (Please tell me if there’s something missing from the list!). That said, you can never be sure if more widespread use would help spread the work of maintenance, or simply bring more bug reports.

I write these blog posts in the hope that telling the reality of search in GNOME will bring some more understanding and involvement. On screen, the search interface is a small box, which makes it seem like the backend might be small and simple too — trust me that it isn’t.

It’s my opinion that Tracker remains the most suitable option for powering search in GNOME and further afield. (I’d be happy to be proved wrong, if the result is a better desktop search experience.) The main issue is fixing the last of the bugs that cause high resource usage. Look out soon for a testing initiative, in which we’ll try out recent speed improvements and try to catch and reproduce any remaining performance issues.

Come back next week for the final post, my thoughts on the next ten years of desktop search.

3 thoughts on “Tracker 3.0: The Good and the Bad

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.