Machine Learning Is The Latest Stage Of Text To Speech Technology

Machine learning is drastically advancing the development of text to speech technology. Here's how, and why it's so important.

Matt James
September 19, 2019
56 Shares 3,256 Views

Machine learning has played a very important role in the development of technology that has a large impact on our everyday lives. However, machine learning is also influencing the direction of technology that is not as commonplace. Text to speech technology is a prime example.

Text to speech technology predates machine learning by over a century. However, machine learning has made the technology more reliable than ever.

The Progression of Text to Speech Technology in the Machine Learning Era

We live in an era where audiobooks are gaining more appreciation than the traditional pieces of literature. Thus, it comes as no surprise that the Text-to-Speech (TTS) technology is also rapidly becoming popular. It caters to those who need it most, including children who struggle with reading, and those who suffer from a disability. Big data is very useful in assisting these people.

There are other elements of speech synthetization technology that rely on machine learning. It is now so sophisticated that it can even mimic someone else’s voice.

Text to Speech (commonly known as TTS) is a piece of assistive technology (that is, any piece of technology that helps individuals overcome their challenges) that reads text out loud, and is available on almost every gadget we have on our hands today. It has taken years for the technology to develop to the point it is at today. Machine learning is changing the direction of this radical technology. However, its journey is one that started in the late eighteenth century.

Text to Speech – The Early Days

TTS is a complicated technology that has developed over a long period of time. It all began with the construction of acoustic resonators, which could only produce just the sounds of the vowels. These acoustics were developed in 1779, due to the dedicated work of Christian Kratzenstein. With the advent of semiconductor technology and improvements in signal processing, computer-based TTS devices started hitting the shelves in the 20th century. There was a lot of fascination surrounding the technology during its infancy. This was primarily why Bell Labs’ Vocoder demonstration found its way into the climactic scene of one of the greatest sci-fi flicks of all time – 2001: A Space Odyssey.

The Machine Learning Technology That Drives TTS

A couple of years ago, Medium contributor Utkarsh Saxena penned a great article on speech synthesis technology with machine learning. They talked about two very important machine learning approaches: Parametric TTS and Concatenative TTS. They both help with the development of new speech synthesizing techniques.

At the heart of it, a TTS engine has a front-end and a back-end component. Modern TTS engines are heavily dependent on machine learning algorithms. The front-end deals with converting the text to phonetics and meaningful sentences. The back-end uses this information to convert symbolic linguistic representation to sound. Good synthesizer technology is key to a good TTS system, which requires sophisticated deep learning neural analysis tools. The audio should be both intelligible and natural, to be able to mimic everyday conversation. Researchers are trying out various techniques to achieve this.

Concatenation synthesis relies on piecing together multiple segments of recorded speech to form coherent sentences. This technology usually gives way to the most natural-sounding speech. However, it loses out on intelligibility, leading to audible glitches as a result of poor segmentation. Formant synthesis is used when intelligibility takes precedence over natural language. This technology does not use human speech samples, and hence sounds evidently ‘robotic’. The lack of a speech-sample database means that it is relatively lightweight and best suited for embedded system applications. This is because power and memory resources are scarce in these applications. Various other technologies also exist, but the most recent and notable one is the use of machine learning. In fact, recorded speech data helps train deep neural networks. Today’s digital assistants use these extensively.

The Challenges

Contextual understanding of the text on the screen is one of the main challenges for TTS systems. More often than not, human readers are able to understand certain abbreviations without second thoughts. However, these are very confusing to computer models. A simple example would be to consider two phrases, “Henry VIII” and “Chapter VIII”. Clearly, the former should be read as Henry the Eighth and the latter should be read as Chapter eight. What seems trivial to us is anything but, for front-end developers working at TTS companies like Notevibes.

They use various predictive models to enhance the user experience. But there is a lack of standard evaluation criteria to judge the accuracy of a TTS system. A lot of variables go into the quality of a particular recording, and these variables are hard to control. This is due to the involvement of both analog and digital processing. However, an increasing number of researchers have begun to evaluate a TTS system based on a fixed set of speech samples.

That, in a nutshell (a rather big one at that), is an overview of Text to Speech systems. With increased emphasis on AI, ML, DL, etc., it is only a matter of time before we are able to synthesize true-to-life speech for use in our ever-evolving network of things.

Machine Learning is the Core of Speech to Text Technology

Machine learning is integral to the development of speech to text technology. New speech synthetization tools rely on deep neural algorithms to provide the highest quality outputs as this technology evolves.