Introduction and constant progress
Computers are very close to understanding what you’re saying as well as another human could, even if they don’t yet know what you’re talking about.
“Speech recognition is really close to reaching parity with humans, in the next three years,” Xuedong Huang, Microsoft’s Chief Speech Scientist, told techradar pro.
“If we can achieve this goal it will be a major landmark for civilisation. Language is only something we humans understand and master. The moment a computer can transcribe your conversation over the phone almost as accurately as humans is a major landmark for AI.” And for the typical conversation over the phone, he believes we’ll get there in three years – at least in terms of recognising what’s being said.
“Transcription is different from understanding; understanding is a different story,” he cautions. “To understand the message, the subtlety of what’s being said – that’s a long way off. To understand intent and meaning, we still have a long way to go.”
He’s been working on speech recognition for over 30 years, and every year, he says, he’s seen consistent improvements. The benchmark researchers use to measure accuracy is making a transcription of two people talking on the telephone, and every year, he’s seen the error rate go down 20% from the previous year.
Thanks to deep learning, the best systems, like Cortana, are now making only twice as many errors as humans do. “The transcription error is around 8% now; that’s about twice as high as human error, which is around 4%. If we can maintain a 25% reduction every year – well, you do the math! I hope the last 4% is not too hard, and in the next three years we can achieve this.”
The recent advances in speech recognition are down to a relatively new machine learning technique, deep learning.
“Machine learning as a whole is important, but deep learning has been critical to these improvements,” Huang explains. Now Microsoft is making the Computational Network Toolkit (CNTK) it uses to build systems like Cortana’s speech recognition available, free, as open source on GitHub.
“We believe the work we’re doing internally can benefit the whole community. If you have better tools and better recipes, better dishes will be prepared. We believe the tools we’re sharing can accelerate the progress of AI.”
CNTK has previously been available to academic researchers, for non-commercial projects through the Codeplex site – now anyone can use it to build commercial systems. “We did it in a quiet way, to get feedback,” he says. “Now we’re trying to broaden the audience. This is one of our best kept secrets. We’re moving forward and making it more open.”
CNTK versus rivals
Like the other open source deep learning toolkits from Facebook, Google and various universities, CNTK uses GPUs for speed. Not only is it as fast or faster than the other toolkits when you run it on one PC with one GPU, it’s nearly twice as fast when you run it on a PC with two GPUs. It’s also the only toolkit that can run on multiple machines at once, and with eight GPUs on two PCs it’s about three times as fast as the competition.
CNTK is faster than other deep learning toolkits and it scales better because you can run it distributed across multiple machines (it should run well on the new Azure GPU service that’s currently in private preview). That performance is important for dealing with the massive amounts of data you need for problems like speech recognition.
“If you want to really develop artificial intelligence, you have to process data at web scale,” he says. “Google brags that they deal with a huge amount of data in a distributed way, but what they’ve open sourced is really a small toolset.”
“Since we adopted CNTK for experimenting with Cortana’s speech recognition, the productivity for the product team has increased by almost a factor of ten. It’s given them a huge boost. Before, it took them weeks to finish one experiment. They said before they adopted it they felt like they were driving a Volkswagen, after they switched it’s like driving a Ferrari.”
Speech recognition has been in Windows since Windows 95, Huang points out. “Thanks to Bill Gates’ vision, as early as the 90s, we invested early in speech recognition. The progress year by year in driving speech recognition errors down has been foundational – if the error rate is too high [to be useful], then having vision doesn’t help!
“But 20 years ago, Microsoft introduced the first speech API in Windows 95 and 20 years after that Microsoft added a range of AI tools going beyond speech into vision and understanding in Azure ML. With CNTK, it’s the same desire to enable developers to take advantage of technology.”
But the speech recognition it was designed to speed up isn’t the only thing CNTK is good at. Microsoft has been trying it out for image recognition as well and, Huang claims, “CNTK is on a par with the best toolset out there for image processing.”
Before, the Microsoft researchers and developers working on image recognition were using the popular Caffe tool from the University of Berkeley. Now they’re switching over to CNTK, and as the latest GPUs arrive its performance is just getting better.
Being good at more than one task isn’t usual for AI toolkits; they’re usually very specific. “Caffe is just beautiful for image processing,” says Huang, “but it’s almost impossible to adopt that for speech.” Huang is cautious about claiming that CNTK can handle all deep learning tasks – speech recognition, image recognition and natural language understanding are the three areas he’s focusing on, but he’s excited to see what people will do with it in other areas.
He concludes: “This tool is so powerful; it can absolutely deal with bigger challenges. The beauty of the tool is that when we get this into the hands of developers, something totally unexpected could happen that’s just beyond our imagination. I believe they will find very creative ways of using it.
“The Microsoft internal workloads that we’re building with CNTK are unbelievable. If you ask me what the next breakthrough will be, I’d say artificial intelligence – we’ll create truly intelligent services that will help people to do more and reach a new level we’ve never experienced in the past.”