Screen reading, part 2: looking at Coqui TTS

I did some research into open source text to speech solutions recently. As part of that I spent a while trying out Coqui TTS and, while I was trying to get it to sound nice, I learned a few things about voice synthesis.

Coqui TTS provides some speech models which are pre-trained on specific datasets. For text-to-speech, each dataset contains text snippets paired with audio recordings of humans reading the text.

Here is a summary of some of the available open datasets for English:

  • ljspeech: audio recordings taken from LibriVox audio books. One (female) speaker, reading text from 19th and 20th century non-fiction books, with 13K audio clips.
  • VCTK: 110 different speakers, ~400 sentences per speaker
  • Sam: a female voice actor (Mel Murphy) reading a script which was then post-processed to sound non-binary. There’s an interesting presentation about gender of voice assistants accompanying this dataset. ~8 hrs of audio.
  • Blizzard2013: A voice actor (Catherine Byers) reading 19th & 20th century audio books, focusing on emotion. Released as part of Blizzard 2013 challenge. ~200 hrs of audio.
  • Jenny: a single voice actor (again, female) reading various 21st century texts. ~30 hrs of audio.

There’s a clear bias towards female voices here, and a quick search for “gender of voice assistant” will turn up some interesting writing on the topc.

There are then different models that can be trained on a dataset. Most models have two stages, an encoder and a vocoder. The encoder generates a mel spectogram which the vocoder then uses to generate audio samples. Common models available in Coqui TTS are:

  • GlowTTS: a “flow-based” encoder model
  • Tacotron: a “DL-based” encoder model. Requires some trick to improve “attention”, the recommended being DDC (“double decoder consistency”).
  • VITS: an end-to-end model, combining the GlowTTS encoder and HiFiGAN vocoder.
  • Tortoise: a GPT-like end-to-end model, using the Univnet vocoder. Claimed to be very slow compared to VITS.

I tried some of the models, here are my notes. The testcase was a paragraph from LWN which is the sort of thing I want this to read aloud.

modelsizetime to rendernotes
en/sam/tacotron-DDC324MB48.5sglitchy in places, listenable
en/blizzard2013/capacitron-t2-c150_v2348MB13.6sgood quality speech, but sounds like a patchwork quilt of different speakers and environments
en/jenny/jenny1.7GB207.3very high quality, slow to render
en/ljspeech/vits–neon139MB47.7suneven but intelligible
en/multi-dataset/tortoise-v24.3GBdidn’t work – error in load_config

(Note that “Jenny” is in fact a VITS model.)

VITS seems to be the current state-of-the-art in publicly available models. This model is also what the Piper TTS engine uses.

The final installment, if I get to it, will be a quick guide on getting Piper TTS to work with Speech Dispatcher. See you then!

State of screen reading on desktop Linux

Reading a computer screen wears out your delicate eye-balls. I would like the computer to read some web-pages aloud for me so I can use my ears instead.

Here’s what I found out recently about the available text-to-speech technology we have on desktop Linux today. (This is not a comprehensive survey, just the result of some basic web searches on the topic).

The Read Aloud browser extension

Read Aloud is a browser extension that can read web pages out for you. That seems a nice way to take a break from screen-staring.

I tried this in Firefox and, it worked, but sounded like a robot made from garbage. It wasn’t pleasant to listen to articles like that.

Read Aloud supports some for-pay cloud services that probably sound better, but I want TTS running on my laptop, not on Amazon or Google’s servers.

Speech Dispatcher

The central component for text-to-speech on Linux is Speech Dispatcher. Firefox uses Speech Dispatcher to implement the TTS part of the Web Speech API. This is what the Read Aloud extension is then using to read webpages.

You can test Speech Dispatcher on your system using the spd-say tool, e.g.

spd-say "Let's see how speech-to-text works"

You might hear the old-skool espeak-ng voice robotically reading out the text. espeak was incredible technology when it was released in 1995 on RISC OS as a 7KB text-to-speech engine. It sounds a little outdated in 2023.

Coqui TTS

Mozilla did some significant open research into text-to-speech as part of the “Mozilla TTS” project. After making great progress they stopped development (you may have heard this story before), and the main developers set up Coqui AI to continue working on the project. Today this is available for you as Coqui TTS.

You can try it out fairly easily via a Docker image, the instructions are in the README file. I spent some time playing with Coqui TTS and learned a lot about modern speech synthesis, which I will write up separately.

The resource consumption of Coqui TTS is fairly high, at least for the higher quality models. We’re talking GBs of disk space, and minutes to generate audio.

It’s possible that GPU acceleration would help, but I can’t use that on my laptop as it requires a proprietary API that only works on a certain brand of GPU. It’s also likely that exporting the models from PyTorch, using TorchScript or ONNX, would make them a lot more lightweight. This is on the roadmap.

Piper

Thanks to an issue comment I then discovered Piper. This rather amazing project does TTS at a similar quality to Coqui TTS, but additionally can export the models in ONNX format and then use onnxruntime to execute them, which makes them lightweight enough to run on single-board computers like the Raspberry Pi (remember those ?).

It’s part of a project I wasn’t aware of called Home Assistant, which aims to develop an open-source home assistant, and is being driven by a company called Nabu Casa. Something to keep an eye on.


Thanks to Piper I can declare success on this mini-project to get some basic screen reading functionality on my desktop. When I get time I will write up how I’ve integrated Piper with Speech Dispatcher – it was a little tricky. And I will write up the short research I did into the different Coqui TTS models that are available. Speak soon!