I did some research into open source text to speech solutions recently. As part of that I spent a while trying out Coqui TTS and, while I was trying to get it to sound nice, I learned a few things about voice synthesis.
Coqui TTS provides some speech models which are pre-trained on specific datasets. For text-to-speech, each dataset contains text snippets paired with audio recordings of humans reading the text.
Here is a summary of some of the available open datasets for English:
- ljspeech: audio recordings taken from LibriVox audio books. One (female) speaker, reading text from 19th and 20th century non-fiction books, with 13K audio clips.
- VCTK: 110 different speakers, ~400 sentences per speaker
- Sam: a female voice actor (Mel Murphy) reading a script which was then post-processed to sound non-binary. There’s an interesting presentation about gender of voice assistants accompanying this dataset. ~8 hrs of audio.
- Blizzard2013: A voice actor (Catherine Byers) reading 19th & 20th century audio books, focusing on emotion. Released as part of Blizzard 2013 challenge. ~200 hrs of audio.
- Jenny: a single voice actor (again, female) reading various 21st century texts. ~30 hrs of audio.
There’s a clear bias towards female voices here, and a quick search for “gender of voice assistant” will turn up some interesting writing on the topc.
There are then different models that can be trained on a dataset. Most models have two stages, an encoder and a vocoder. The encoder generates a mel spectogram which the vocoder then uses to generate audio samples. Common models available in Coqui TTS are:
- GlowTTS: a “flow-based” encoder model
- Tacotron: a “DL-based” encoder model. Requires some trick to improve “attention”, the recommended being DDC (“double decoder consistency”).
- VITS: an end-to-end model, combining the GlowTTS encoder and HiFiGAN vocoder.
- Tortoise: a GPT-like end-to-end model, using the Univnet vocoder. Claimed to be very slow compared to VITS.
I tried some of the models, here are my notes. The testcase was a paragraph from LWN which is the sort of thing I want this to read aloud.
model | size | time to render | notes |
---|---|---|---|
en/sam/tacotron-DDC | 324MB | 48.5s | glitchy in places, listenable |
en/blizzard2013/capacitron-t2-c150_v2 | 348MB | 13.6s | good quality speech, but sounds like a patchwork quilt of different speakers and environments |
en/jenny/jenny | 1.7GB | 207.3 | very high quality, slow to render |
en/ljspeech/vits–neon | 139MB | 47.7s | uneven but intelligible |
en/multi-dataset/tortoise-v2 | 4.3GB | didn’t work – error in load_config |
(Note that “Jenny” is in fact a VITS model.)
VITS seems to be the current state-of-the-art in publicly available models. This model is also what the Piper TTS engine uses.
The final installment, if I get to it, will be a quick guide on getting Piper TTS to work with Speech Dispatcher. See you then!