IBM provided quite the technological shockwave a short five months after Microsoft proudly announced its Speech Recognition Technology had reached a 5.9 percent word error rate (WER) – bringing the technology closer to matching human performance. IBM reported that it had pushed the technological envelope even further, delivering a remarkable 5.5 percent WER, thereby setting a new record for machine-based speech recognition previously held by the Microsoft system.
But what is the significance of WER? In the context of speech recognition and translation systems, WER is a pivotal measure of accuracy – a lower value reflects a higher degree of accuracy. The current human performance record is pinned at 5.1 percent.
IBM’s success was forged by integrating two specific language models: Long Short-Term Memory (LSTM) and the WaveNet technology provided by Google’s DeepMind. WaveNet was designed to generate speech that resembles human voice as closely as possible, whereas LSTM is a recurrent network unit, highly skilled at remembering values over long or short periods. LSTM’s strength lies in its ability to learn from history and, as a result, make faster predictions in time series.
IBM reports that the harmonious interplay between these two technologies enabled the achievement of a lower WER than Microsoft’s offering. However, the tech titans differ in their views on how these figures equate to human parity. Microsoft maintains that its 5.9 percent WER doesn’t quite match the performance of an average person during a speech recognition task, whereas IBM asserts that 5.1 percent is a more fitting representation of human parity, and that’s what they’re aiming for.
In the end, the objective for all players in the field, according to IBM, is to achieve ‘human parity’ – an error rate equivalent to two humans conversing. Many in the industry have claimed to have reached that coveted 5.9 percent WER mark, synonymous with human parity. However, IBM argues that this is still not a cause for celebration, as “Reaching human parity – meaning an error rate on par with that of two humans speaking, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.” They continue to challenge themselves and others in the groundbreaking race for ultimate speech recognition technology.
[Be sure to include visuals depicting LSTM and WaveNet models and an infographic comparing WERs of different companies. Include internal links to articles covering IBM and Microsoft’s advancements in speech recognition technology, WaveNet, and LSTM.]
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.