Do Textual content-To-Speech Voices Really Sound Actual?
The secret behind authentic text-to-speech voices
These days quality isn’t one of the things you sacrifice when it comes to text-to-speech voices. It’s one of the things that you win. Text-to-speech now sounds so surprisingly real that most people cannot tell the difference between AI-generated text-to-speech and actual human speech. There are a few reasons why this is the case and where AI-powered text-to-speech shines.
Text-to-Speech for L&D Professionals: The Next Frontier of Storytelling
Learn how to create inclusive online training experiences that engage your remote learners.
What makes text-to-speech voices so unnatural … natural?
Here are a few ways to ensure that text-to-speech sounds less machine-like and more lifelike.
One of the reasons early text-to-speech effects sound robotic is because the software pronounces every single word exactly the same. Of course, when people speak, they vary the way they say words, even the exact same. They add inflections, different tones, and different centers of gravity.
“When you think of the human voice, what it naturally does is the inconsistencies,” said Matt Hocking, CEO of WellSaid Labs, an AI-powered text-to-speech platform for learning and development companies.
WellSaid Labs worked with hundreds of voice actors and fed their audio into the WellSaid Labs system. The result: WellSaid’s text-to-speech voices sound remarkably similar to the people they learned from. AI practiced speaking by listening to people speak – in many different ways, even for the exact same words.
Another characteristic of human language is that there are pauses. Humans need air, so naturally they pause to inhale, exhale, swallow, and start over. These pauses create rhythmic, natural-sounding variations. While early text-to-speech forgot that nuance (robots don’t normally need to pause for oxygen), it makes today’s text-to-speech sound much more lifelike.
In today’s text-to-speech editors, you can further simulate these pauses by adding commas, hyphens, periods, and ellipses and instructing the text-to-speech to pause just like a human would. These punctuation marks serve the TTS as notes rather than grammar – they instruct the text to pause, hold, and create natural stillness, just like humans do.
Of course, when you speak, you emphasize certain words with intonations. Today’s text-to-speech too. Because AI learned from people with intonations, AI incorporated them into their way of speaking. It’s like children are learning how to speak from the adults around them – only in this case the child is a very sophisticated data tool that can analyze many languages, languages, and voices at the same time.
If you want to call up specific words somewhere that may be unclear to text-to-speech, you can just jot it down in the editor. For example, you can put quotation marks around words, capitalize entire words, or capitalize parts of words if you want them to be highlighted. Today’s text-to-speech reads these punctuation marks just like a voice actor and understands where the intonation needs to be adjusted.
Another challenge for the early text-to-speech feature was that even the same words were pronounced differently depending on how they were used. Take the example of “reading”. The past tense is pronounced “red” while the present tense is pronounced “reed”. The text-to-speech of yesteryear may have missed the difference, but today’s text-to-speech captures the subtleties with ease.
In the event that words or acronyms are less clear, you can easily add a phonetic notation to the editor to make sure the text in the language picks up the nuance. This is exactly how you could help a voice actor. For example, instead of typing “COO”, you could spell “CO-O” so the reader could pronounce the acronym and not mix the letters together.
In many cases, text-to-speech platforms like WellSaid Labs process long words and numbers even better than human actors. For example, try reading the word “antidisestablishmentarianism” all at once. A text-to-speech voice can naturally join the syllables together to produce natural-sounding pronunciation that most speech actors could miss without a few practice runs.
There are also differences in pronunciation – not just for words that have been pronounced differently in the past or present – but depending on location or culture. For example, “caramel” can be pronounced either “care-a-mel” or “car-mel”. Likewise, “aunt” can be pronounced either as “ant” or “ont”. By adding a different spelling in a text-to-speech editor, the AI learns to grasp this quickly and to overwrite any inherent pronunciations of a speech actor.
What the research says
Obviously we’re big fans of text-to-speech. But what are the actual listeners saying?
In July 2019, the text-to-speech platform WellSaid Labs asked attendees to listen to a series of random recordings made by both synthetic and voice actors. For each file, the participants were asked:
“How natural (ie sounding human) is this recording?”
Each text-to-speech recording was then rated on a scale from 1 (poor: completely unnatural speech) to 5 (excellent: completely natural speech).
Voice actors achieved an average score of around 4.5, likely because some recordings obscured background noise or incorrect pronunciations.
WellSaid Labs achieved this in June 2020, with their synthetic TTS ranking just as high as that of actual human language actors. WellSaid Labs even hired a third party to review the results.
So the data (and the AI) speak for themselves: today’s synthetic text-to-speech sounds undeniably, shockingly human and – as is the nature of AI – only gets better.
To hear current examples of human-sounding TTS, check out Speech Actors Comparisons to Synthetic TTS for everything from complex words to numbers, acronyms, punctuation, and more. We think you will be shocked at how hard it is to tell the difference.
Download the eBook Text-to-Speech For L&D Pros: The Next Frontier In Storytelling to learn how to use AI speech generators for your remote learning programs and drive employee engagement. Also, attend the webinar to learn how to update eLearning voiceovers on time and under budget!