Screen reading, part 2: looking at Coqui TTS

I did some research into open source text to speech solutions recently. As part of that I spent a while trying out Coqui TTS and, while I was trying to get it to sound nice, I learned a few things about voice synthesis.

Coqui TTS provides some speech models which are pre-trained on specific datasets. For text-to-speech, each dataset contains text snippets paired with audio recordings of humans reading the text.

Here is a summary of some of the available open datasets for English:

  • ljspeech: audio recordings taken from LibriVox audio books. One (female) speaker, reading text from 19th and 20th century non-fiction books, with 13K audio clips.
  • VCTK: 110 different speakers, ~400 sentences per speaker
  • Sam: a female voice actor (Mel Murphy) reading a script which was then post-processed to sound non-binary. There’s an interesting presentation about gender of voice assistants accompanying this dataset. ~8 hrs of audio.
  • Blizzard2013: A voice actor (Catherine Byers) reading 19th & 20th century audio books, focusing on emotion. Released as part of Blizzard 2013 challenge. ~200 hrs of audio.
  • Jenny: a single voice actor (again, female) reading various 21st century texts. ~30 hrs of audio.

There’s a clear bias towards female voices here, and a quick search for “gender of voice assistant” will turn up some interesting writing on the topc.

There are then different models that can be trained on a dataset. Most models have two stages, an encoder and a vocoder. The encoder generates a mel spectogram which the vocoder then uses to generate audio samples. Common models available in Coqui TTS are:

  • GlowTTS: a “flow-based” encoder model
  • Tacotron: a “DL-based” encoder model. Requires some trick to improve “attention”, the recommended being DDC (“double decoder consistency”).
  • VITS: an end-to-end model, combining the GlowTTS encoder and HiFiGAN vocoder.
  • Tortoise: a GPT-like end-to-end model, using the Univnet vocoder. Claimed to be very slow compared to VITS.

I tried some of the models, here are my notes. The testcase was a paragraph from LWN which is the sort of thing I want this to read aloud.

modelsizetime to rendernotes
en/sam/tacotron-DDC324MB48.5sglitchy in places, listenable
en/blizzard2013/capacitron-t2-c150_v2348MB13.6sgood quality speech, but sounds like a patchwork quilt of different speakers and environments
en/jenny/jenny1.7GB207.3very high quality, slow to render
en/ljspeech/vits–neon139MB47.7suneven but intelligible
en/multi-dataset/tortoise-v24.3GBdidn’t work – error in load_config

(Note that “Jenny” is in fact a VITS model.)

VITS seems to be the current state-of-the-art in publicly available models. This model is also what the Piper TTS engine uses.

The final installment, if I get to it, will be a quick guide on getting Piper TTS to work with Speech Dispatcher. See you then!

4 thoughts on “Screen reading, part 2: looking at Coqui TTS

  1. As someone with dyslexia who uses the “Read Aloud” extension with the Google Translate voice mentioned in the previous post, I’m really excited to see that someone is experimenting with different speech synthesizers and might be able to find a good one to use with Speech Dispatcher. It would be really nice to have a high-quality locally running screen reader.

    Rant (probably not relevant to this post, but here it is anyway):
    One thing to note is that a full screen reader like Orca, which combines the AT-SPI2 accessibility API and Speech Dispatcher, serves a very different use case from someone who only wants occasional text-to-speech functionality for articles (like me). To my knowledge, Orca will try to verbalize all the user interface elements because it’s designed to assist people with vision problems in using those interfaces. In my case, I can see my screen perfectly fine but have trouble processing the text, which is why I like the extension. I can simply select and right-click a piece of text and have the computer say it without having every event and button press verbalized, as in Orca. I think it’s useful to keep the use cases in mind when developing accessibility tech (though this was more experimenting and so probably not relevant), as it may result in a situation where Speech Dispatcher hooks into a great synthesizer, but the apps that use it (like Orca) don’t cover enough use cases.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.