Status update, 15/09/2023

Musically this has been a fun month. One of my favourite things about living in Galicia is that ska-punk never went out of fashion here and you can legitimately go to a festival by the sea and watch Ska-P. Unexpectedly brilliant and chaotic live show. I saw an interview recently where Angelo Moore of Fishbone was asked by a puppet what his favourite music is, and he answered: “I like … I like the Looney Tunes music”. Same energy.

I wrote already this month about my DIY media server and the openQA CLI tool. This post contains some brief thoughts about Nushell and then some lengthy thoughts about the future of the Web. Enjoy!

Nushell everywhere

I read a blog by serial shell innovator JT entited “The case for Nushell”. I’ve been using Nushell for data-focused work for a while and the post inspired me to make it my default shell in a few places.

Nushell is really comfortable to use these days, it’s very addictive the first time you construct a one-liner to pretty-print some JSON or XML, select the fields you want and output a table as Markdown that you can paste straight into a Gitlab issue. My only complaint is the autocomplete isn’t quite as good as the Fish shell yet. (And that you can’t type rm -R… like chown and chmod only accept -R, and now rm only accepts a lower case -r, how am I supposed to remember that guys???)

I have a load of arcane Bash knowledge that I guess I’ll have to hang onto for a while yet, particularly as my job mostly involves SSH’ing into strange old machines from the 1990s. Perhaps I can try and contribute Nushell binaries that run on HP-UX and Solaris. (For the avoidance of doubt, that previous sentence is a joke).

Kagi Small Web

There’s a new search engine on the block called Kagi which is marketed as “premium search engine”, you pay $0.05 per search, and in return the results are ad-free.

I like this idea. I signed up for the free trial 100 searches, and I haven’t got far with them.

It turns out most of the web searches I do, are things I could search on a specific site if I wasn’t so lazy. For example I search “rust stdio” when I could go to the Rust documentation on my local machine and search there. Or I search for a programming problem when I could clearly just search StackOverflow itself. DuckDuckGo has made me lazy; adding a potential $0.05 cost to searches firstly makes you realize how few you actually need to do. Maybe this is a good thing.

Anyway, Kagi. They just launched something named Kagi Small Web, which is announced here:

Kagi Small Web offers a fresh approach by promoting recently published content from the “small web.” We gather new content, published within the last week, from a handpicked list of blogs and surface it in multiple ways:

  • Directly within Kagi search results for applicable queries (existing Kagi members do not need to do anything, this will be automatic)
  • Via the new Kagi Small Web website
  • Through the Kagi Small Web RSS feed
  • Via our Search API, where results are now part of the news enrichment API

Initially inspired by a vibrant discussion on Hacker News, we began our experiment in late July, highlighting blog posts from HN users within our search results. The positive feedback propelled the initiative forward. Today, our evolving concept boasts a curated list of nearly 6,000 genuine websites featuring people with a wide variety of interests.

When I first saw this my mind initially jumped to the problematic parts. Who are these guys to suddenly define what the Small Web is, and define it as a a club of some 6,000 websites chosen by Hackers News? All sites must be in English, so is the Web only for English speakers now?? More importantly, why is my site not on the list? Why wasn’t I consulted??

There’s also something very inspiring about the project. I try to follow the rule “something is better than nothing”, and this project is a pretty bold step forwards, which inspired a bunch of thoughts about the future of The Web.

Google Search is Dying

Since about 2000, when you think of the Web, you think of Google.

Google Search has been dying a slow, public death for about the last ten years. Google has been too big to innovate since the early 2010s (with one important exception, the Emoji Kitchen).

Google Search remained king until now for two reasons: one, their tech for turning hopelessly vague search queries into useful results was better than anyone’s in the industry, and two, as of 2023, almost nobody else can operate at the scale needed to index all of the text on the Web.

I guess there’s a third reason too, which is spending billions of $$$ to be the default search provider nearly everywhere, to the point that the USA is running an antitrust hearing against them, but let’s focus on the technical aspects.

The current fad for large language models is going to bring big changes to the Web, for better or worse. One of those is that “intent analysis” is suddenly much easier than it was. Note, I’m not talking about prompting an LLM with a question
and presenting the randomly generated output as an answer. I’m talking about taking unstructured text, such as “trains to London” and turning it into an actionable query. A 1990’s era search engine would throw away the “to” return any website that contained “trains” and “London”. Google Search shows a table of live departure times for trains heading to London. (With some less useful things above and below, since this is the Google Search of 2023).

A small LLM such as Vicuna can kinda just DO this stuff, not perfectly of course, but its an order of magnitude easier than a decade ago. Perhaps Google kept their own LLM research internal for so long for fear of losing exactly this edge? The “We have no moat” memo suggests fear.

Onto the second thing, indexing all the content on the Web. LLMs don’t make this easier. They make it impossible.

Its now so easy to generate human-like text on the Web using machines, that it doesn’t make sense to index all the text on the Web any more. Much of it is already human-generated generated garbage aiming to game search ranking algorithms (see “A storefront for robots” for fun examples).

Very soon 99% of text on the web will be machine generated garbage. Welcome to the dark forest.

For a short time I was worried about this, but I think it’s a natural evolution of the Web. This is the end of the Olde World Wide Web. What comes next?

There is more than one Small Web

If you’ve read this far, firstly, thanks and well done, in 2023 its hard to read so many paragraphs in one go! I didn’t even put in single video.

Let me share the insight I had on thinking over Kagi Small Web. Maybe it’s obvious and maybe it isn’t.

A search engine of 6,000 websites is small-scale enough that one person could conceivably run it.

Let’s go back a step. How do you deal with a Web that’s 99% machine-generated noise? I imagine Google will try to solve this by using language models to detect if the page was generated by a language model, triggering another fairly pointless technological arms race against the spammers who will be generating this stuff. This won’t work very well.

The only way for humans to make sense of the new Dark Forest Web is to have lists of websites, curated by humans, and to search through those when we want to find information.

If you’re old you know that this isn’t a new idea. In fact, we kinda had all of this stuff in the form of web rings, link pages on other people’s websites, bookmarking services, link sites like Digg, Fark and Reddit, RSS feeds and feed readers. If you look at Kagi Small Web reader site it’s literally a web ring. It’s StumbleUpon. It’s Planet GNOME. But through the lens of 2023, it’s
also something new.

So I’m not going to start reading Kagi’s small web, though it may be great. And I’m going stop capitalising “small web”, because I think we’re going to curate millions of these collectively, in thousands of languages, in infinite online communities. We’re going to have open source tools for searching and archiving high quality online content. Who knows? Perhaps in 10 years we’ll have small web search tools integrated into GNOME.

Further Reading

This year, 2023, is the 25th Year of our Google, and The Verge are publishing a series of excellent articles looking forwards and backwards. I can recommend reading “The end of the Googleverse” as a starting point. Another great one: “Google and YouTube are trying to have it both ways with AI and copyright“.

One thought on “Status update, 15/09/2023

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.