This post took a lot of research. I wanted to properly answer the question: “Why does Tracker use SPARQL and RDF?”
Search, 90’s style
Gradually, the automated indexers got better. Google’s Pagerank algorithm was a breakthrough, using a statistical approach to determine ‘high quality’ websites and list those ones first.
The holy scriptures of the Web already proposed a different solution: not only should documents form webs, but the data inside them should too. Instead of a statistical model of ‘relevance’, the information would be present as structured data which you could search and query with a semantic search engine. In 1998 (the same year Google was founded) this idea was formalised as the Semantic Web.
At this point, perhaps you are curious what ‘semantic search’ means, or perhaps your eyes glaze over at the sight of the world ‘semantic’, or perhaps memories of the 2000s have already caused your fists to clench in fury.
The article “Why Machine Learning Needs Semantics Not Just Statistics” gives a good introduction: the word ‘semantic’ is usually used to highlight that, for information retrieval tasks, our current statistical approaches are primitive.
In essence, they are akin to a human shown patterns in a pile of numbers and asked to flag future occurrences without any understanding of what those numbers represent or what the decision involves.
This is one of the reasons that current deep learning systems have been so brittle and easy to fool despite their uncanny power. They search for correlations in data, rather than meaning.
You may be thinking: I know that you can implement today’s machine learning using match boxes, but that link only tells me what semantic search isn’t, it doesn’t tell me what it is. If so, you’re on the money. In the years following its inception, a frequent criticism of the Semantic Web was that it was under-specified and too “utopian”.
Teaching machines to understand meaning
Does that sound utopian? Well, maybe. A lot of digital ink has been spilled on this topic over the last 20 years in often heated debates. However, there are some level headed voices.
The idea’s proponents do not escape culpability for these utopian perceptions… Instead of the “let’s just build something that works” attitude that made the Web (and the Internet) such a roaring success … they’ve convinced people interested in these ideas that the first thing we need to do is write standards.
Certainly, the most visible output of the Semantic Web effort has been various standards. Some early efforts are laughable: you would win an ‘obfusticated data format’ contest with JSON Triples, and all the data formats that use XML (a markup language) as a syntax are questionable at best, or, to quote Swartz again, “scourges on the planet, offenses against hardworking programmers”.
Here is some fan-mail that RDF has received over the years.
- “It would be nice as a universal publishing format… far preferable to XML” (Aaron Swartz),
- “A deceptively simple data model [which] trivializes merging of data from multiple sources” (Ian Davis)
- “RDF is a shitty data model. It doesn’t have native support for lists. LISTS for fuck’s sake!” (Manu Sporny, creator of JSON-LD)
- “Someone should describe RDF in 500 words or less as a generalization of INI. That note would spread understanding of RDF, which is simple but often described so abstractly that it seems complicated.” (Mark Evans, Lambda the Ultimate)
I think RDF is a reasonable data model which maps closely to the more intuitive document/key/value model. (Until you want to make a list). A more important criticism is whether a data model is what we really needed.
Clay Shirky discussed this in a scathing criticism of the Semantic Web from 2003:
Since it’s hard to make machines think about the world, the new goal is to describe the world in ways that are easy for machines to think about… The Semantic Web takes for granted that many important aspects of the world can be specified in an unambiguous and universally agreed-on fashion, then spends a great deal of time talking about the ideal XML formats for those descriptions.
For a detailed history of the Semantic Web, I recommend this Two Bit History article. Meanwhile. we need to go back to the desktop world.
From desktop search…
The 2000’s were also a busy time for GNOME and KDE. During the 90’s desktop search was an afterthought but in the new millenium, perhaps driven by advances on the web, lots of research took place.
Microsoft introduced WinFS (described by Gates as his biggest disappointment), Apple released Spotlight, even Google briefly weighed in and the open source world responded as we always do with several incompatible projects all trying to do the same thing.
Here’s a release timeline of some free desktop search projects:
I was still in school when Eazel created Nautilus and went bust shortly after. They created Medusa to provide full-text search for Nautilus, but without funding the project didn’t get past an 0.5 release. The Xapian library also formed around this time from a much older project. Both aimed to provide background indexing and full-text search, as did the later Beagle.
Tracker began in late 2005, introduced by Jamie McCracken and focusing on a “non-bloated implementation, high performance and low low memory usage”. This was mostly a response to Beagle’s dependency on the Mono C# runtime. Tracker 0.1 used MySQL or SQLite but instead of exposing the SQL engine directly it would translate queries from RDF Query, an XML format which predates SPARQL and is not something you want to type out by hand.
…to the Semantic Desktop
In 2006 the NEPOMUK project began. The goal was not a search engine but “a freely available open-source framework for social semantic desktops”, put even less simply a “Networked environment for personal ontology-based management of unified knowledge”. The project had €17 million of funding, much of it from the EU. The Semantic Web mindset had reached the free desktop world.
I don’t know where all the money went! But one output was NEPOMUK-KDE, which aimed to consolidate all your data in a single database to enable new ways of browsing and searching. The first commit was late 2006. Some core KDE apps adopted it, and some use cases, ideas and prototypes emerged.
Meanwhile, Nokia were busy contracting everyone in the Free Software world to work on Maemo, an OS for phones and tablets which would mark the start of the smartphone era had a certain fruit-related company not beaten them to it.
Nokia began funding six developers to work on Tracker (rather a rare event for a small open source project), and planned to use it for media indexing, search, and app data storage. There was a hackfest where many search projects were represented, and a standardisation effort called XESAM which produced a query language still in use today by Recoll.
Presentations about Tracker from this era show a now-familiar optimism. There’s a plea to store all app data in Tracker’s database, with implied future benefits: data sharing between apps, tagging, and the vague promise of “mashups”. There are the various diagrams of RDF graphs and descriptions of what SPARQL is. But there’s also an increasing degree of pragmatism.
By 2009 it was clear that there was no easy route to the Semantic Desktop valhalla. As search engines and desktop databases become more widely deployed, more and more complaints about performance started to appear and as a project destined for low-powered mobile devices, Tracker had to be extra careful in this regard.
Where are they now?
A decade since all this great tech was developed, why aren’t you using a Nokia smartphone today? The so-called Elopocalyse marks the end of ‘semantic desktop’ investment. Twenty years later, the biggest change to search in GNOME came from the GNOME Shell overview design, which uses a simple D-Bus API with no ‘semantic desktop’ tech in sight.
Tracker is still here, powering full-text search behind the scenes for many apps, and its longevity is a testament to Nokia’s decision to work fully upstream and share their improvements with everyone. Writing a filesystem indexer is hard and we’re lucky to build on the many years of investment from them. Credit also lies with volunteer maintainers who kept it going since Nokia gave up, particularly Martyn Russell and Carlos Garnacho, and everyone who has contributed to fixing, testing, translating and packaging it.
The Nepomuk data model is still used in Tracker. There was an attempt to form a community to maintain it after the funding ended, but their official home hasn’t seen an update in years and so Tracker keeps its own copy with our local modifications.
NEPOMUK-KDE did not make it to 2020. An LWN commentator summarizes the issue:
Nepomuk was 1 big, powerful, triplet-capable database that was meant to hold everything … it got too big and would corrupt sometimes and was slow and unstable…
So when the funding ran out, different ppl worked on it for a long time, trying to make it perform better. They got, well, somewhere, pretty much made the pig fly, but the tech was inherently too powerful to be efficient at the ‘simple’ use case it had to do most of the time: file name and full text search.
Tracker has had its share of performance issues, of course, but the early focus on mobile meant that these were mostly due to coding errors, rather than a fundamentally unsuitable design built around an enterprise-scale database. In 2014 KDE announced the replacement Baloo, a Xapian-based search engine that provides full-text search and little else.
I’m reminded of air travel, where planes are slower than sixty years ago.
So why does Tracker use RDF and SPARQL, when you can provide full text search without it?
It’s partly for “historical reasons” — it seemed a good idea at the time, and it’s still a good enough idea that there’s no point creating some new and non-standard interface from scratch. SPARQL is a good standard which suits its purpose and is in wide use in government and sciences.
RDF is still widely used too. The idea of providing structured data in websites caught on where there’s a business case for it. This mostly means adding schema.org markup so your content appears in Google, and Open Graph tags so it displays nicely in Facebook. There are also some big open data repositories published as RDF.
Whatever your perspective, it’s pretty clear that machines still don’t understand meaning, and the majority of data on the web is not open or structured. But that doesn’t mean there’s nothing we can do to improve the desktop search experience!
Come back next week for the final part of this series, my thoughts on the next ten years of Tracker and desktop search in general.