It looks like Google’s DeepMind has recorded a breakthrough it says is better than 50 percent of existing technology. The UK based company DeepMind is aiming to develop computers with “super” artificial intelligence (AI) capabilities.
In a post on their website, they say they have now been able to create something as realistic as it can get to the human voice. Dubbed WaveNet, the system is able to correlate individual sound waves humans create and they compared their results to existing programs including Google’s, the say they have surpassed all of those at least by 50 percent thereby bringing us closer to a more realistic text to speech future.
So if you’ve read so far and you’re thinking, this is just another advanced recorder, no it isn’t quite because the aim is to teach machines how humans pronounce words in different languages and make them form new words of their own and the closer we get to the technology’s perfection, it means we can have a closer interactions with machines in future just as you would humans.
A large set of short recordings are fed into a computer and by combining these human voices and systems like WaveNet learn from these to form new words altogether and that’s what this technology is about. While companies like Apple are still silent on their plans for AI, at least we know that their digital assistant Siri will now be opened to developers but this milestone no doubt puts Google a step ahead into the AI future.
How does this differ really from Siri, Cortana or Alexa?
The first thing to note is that they are all digital assistants by tech companies that rely on artificial intelligence to help you out with queries. What happens with these assistants is that you engage them and they reply in human voice (we know the voice of Siri at least) and this all happens in a process called concatenative text to speech and is defined a system “where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.” Put in other words, this means that the current Siri and Cortana don’t have feelings which can expressed by humans in tones without altering an entire database and can only say what they have been told to tell you. While it has been largely successful in its own right, to make concatenative text to speech (TTS) have changing tones for example, you would need to have a humans/humans for that matter record every possible sound there is in different ways and that’s a daunting task The other way of doing this is through Parametric TTS which is considered too robotic.
Parametric Text To Speech (TTS)
This is a purely computer model which relies on programmed rules and don’t need human voice inputs and while this is so, output depends on the signal processing method used. As DeepMind put it, “contents and characteristics of the speech can be controlled via the inputs to the model.” This can be used in embedded systems with limited memory. When you look at the chart we provide below, you’ll it underperforms all other methods at least in English language but in Chinese Mandarin, it’s a different story but that not that good.
WaveNet
This is the new Google method which it says is next to the human voice when all other methods are stacked together on a chart
WaveNet works quite differently from the last two methods used in current AI systems by learning from human recordings and then independently creating its own different kinds of voices and words for that matter. So this builds on the concatenative TTS to make interaction with machine wear a “human face”. As humans we pause and breathe when talking and that’s something WaveNet does too. Taking this a step further, WaveNet is able to learn from sounds to develop a whole new content in a different way that appeals to a different context from the original content and that’s a huge step towards a whole new AI future, call it AI on steroids if you wish and you won’t be wrong. Here’s how they put it at Google, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.
Test Results
The scale of measurement is from 1 to 5 with 1 being unrealistic and 5 being most realistic based on listeners from 500 blind tests conducted by the team at DeepMind. Listeners rated WaveNet 4.21 in English and another 4.08 in Mandarin. The human speech scored 4.55 out of a possible 5 and that’s not even a perfect score for humans but this still shows how close WaveNet is getting to the human voice tone and greatly outperformed the concatenative and Parametric TTS methods. You can listen to the audio below for yourself;
Parametric TTS
Concatenative TTS
WaveNet
Challenges
It’s computationally expensive to take WaveNet commercial at the moment and as they put it, it requires a high sampling rate of 16,000 times per second for a single audio file. This means that the processing the analogue human sound into digital which the computer understands is cumbersome for the WaveNet output quality. Each sample forms prediction based on prior samples and that’s all part of the signal processing technique.
Future
DeepMind is responsible for AlphaGo which is a program developed for board game GO and beat the top ranked player this year in the game. All big tech companies have all announced steps to make their digital assistant services more attractive and WaveNet could eventually be the way to go. With better processing techniques, this could well become the future of AI with respect to digital assistants. About 20 percent of searches on Google are now voice based and this could make Google increase funding to this area of research eventually. Before tech giants started paying considerable attention to mobile, it took a while too.
But like the space and weapons race of the 60s and 70s, we may be seeing an AI race to the top by tech companies too and that’s a good thing.
DeepMind is British Artificial Intelligence company and was acquired by Google in 2014
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.