desktop

model	size	time to render	notes
en/sam/tacotron-DDC	324MB	48.5s	glitchy in places, listenable
en/blizzard2013/capacitron-t2-c150_v2	348MB	13.6s	good quality speech, but sounds like a patchwork quilt of different speakers and environments
en/jenny/jenny	1.7GB	207.3	very high quality, slow to render
en/ljspeech/vits–neon	139MB	47.7s	uneven but intelligible
en/multi-dataset/tortoise-v2	4.3GB		didn’t work – error in `load_config`

model

size

time to render

notes

en/sam/tacotron-DDC

324MB

48.5s

glitchy in places, listenable

en/blizzard2013/capacitron-t2-c150_v2

348MB

13.6s

good quality speech, but sounds like a patchwork quilt of different speakers and environments

en/jenny/jenny

1.7GB

207.3

very high quality, slow to render

en/ljspeech/vits–neon

139MB

47.7s

uneven but intelligible

en/multi-dataset/tortoise-v2

4.3GB

didn’t work – error in load_config

Reading a computer screen wears out your delicate eye-balls. I would like the computer to read some web-pages aloud for me so I can use my ears instead.

Here’s what I found out recently about the available text-to-speech technology we have on desktop Linux today. (This is not a comprehensive survey, just the result of some basic web searches on the topic).

The Read Aloud browser extension

Read Aloud is a browser extension that can read web pages out for you. That seems a nice way to take a break from screen-staring.

I tried this in Firefox and, it worked, but sounded like a robot made from garbage. It wasn’t pleasant to listen to articles like that.

Read Aloud supports some for-pay cloud services that probably sound better, but I want TTS running on my laptop, not on Amazon or Google’s servers.

Speech Dispatcher

The central component for text-to-speech on Linux is Speech Dispatcher. Firefox uses Speech Dispatcher to implement the TTS part of the Web Speech API. This is what the Read Aloud extension is then using to read webpages.

You can test Speech Dispatcher on your system using the spd-say tool, e.g.

spd-say "Let's see how speech-to-text works"

You might hear the old-skool espeak-ng voice robotically reading out the text. espeak was incredible technology when it was released in 1995 on RISC OS as a 7KB text-to-speech engine. It sounds a little outdated in 2023.

Coqui TTS

Mozilla did some significant open research into text-to-speech as part of the “Mozilla TTS” project. After making great progress they stopped development (you may have heard this story before), and the main developers set up Coqui AI to continue working on the project. Today this is available for you as Coqui TTS.

You can try it out fairly easily via a Docker image, the instructions are in the README file. I spent some time playing with Coqui TTS and learned a lot about modern speech synthesis, which I will write up separately.

The resource consumption of Coqui TTS is fairly high, at least for the higher quality models. We’re talking GBs of disk space, and minutes to generate audio.

It’s possible that GPU acceleration would help, but I can’t use that on my laptop as it requires a proprietary API that only works on a certain brand of GPU. It’s also likely that exporting the models from PyTorch, using TorchScript or ONNX, would make them a lot more lightweight. This is on the roadmap.

Piper

Thanks to an issue comment I then discovered Piper. This rather amazing project does TTS at a similar quality to Coqui TTS, but additionally can export the models in ONNX format and then use onnxruntime to execute them, which makes them lightweight enough to run on single-board computers like the Raspberry Pi (remember those ?).

It’s part of a project I wasn’t aware of called Home Assistant, which aims to develop an open-source home assistant, and is being driven by a company called Nabu Casa. Something to keep an eye on.

Thanks to Piper I can declare success on this mini-project to get some basic screen reading functionality on my desktop. When I get time I will write up how I’ve integrated Piper with Speech Dispatcher – it was a little tricky. And I will write up the short research I did into the different Coqui TTS models that are available. Speak soon!

Sam Thursfield

Software and technology from Galicia, Spain

Screen reading, part 2: looking at Coqui TTS

State of screen reading on desktop Linux

The Read Aloud browser extension

Speech Dispatcher

Coqui TTS

Piper