How Amazon teaches Alexa, and what it hopes the virtual assistant will learn someday
Days after Amazon announced that it was bringing its Alexa to its iOS Amazon app, Amazon senior principal scientist Nikko Strom spoke at the AI NEXT tech conference in Bellevue, Wash., this weekend to share behind-the-scenes details of the company’s voice-enabled assistant and its broader artificial intelligence initiatives.
Strom, a founding member of the team that built Amazon Echo and Alexa, told the audience of AI scientists that the growing number of Alexa-based devices (not publicly disclosed by Amazon but estimated by Consumer Intelligence Research Partners at more than 8 million) has provided Amazon with a significant amount of data to use in improving and refining Alexa-powered devices.
“All of these things that make Alexa great and expanding all the time means that we get lots of data,” he said. “One of the things about this era is that people actually like using these (devices) — I’ve been in the industry for a long time and I worked on telephony systems and people didn’t really like to use them.”
Strom compared the amount of data that Amazon received from the millions of Alexa devices to what a 16-year-old might have heard during their young life. He said that in 16 years, a person might hear — and have “training data” about — as much as 14,016 hours of speech (based on the assumption that about 10 percent of what a person hears in a day is speech).
Amazon uses “large-scale distributed training” to analyze the voice data it gets from users Alexa-enabled devices in order to improve their speed and accuracy.
“We have all this data – we have thousands of hours of stored data from our customers in Amazon S3 (Amazon Simple Storage Service) and we train these models on AWS EC2 (Amazon Web Services Elastic Compute Cloud) instances,” he said, explaining that the company has to use “distributed training” across 80 GPU (graphical processing unit) instances in order to crunch the massive amount of data it receives.
This large-scale distributed training of the voice recognition model in Alexa allows Amazon to constantly make updates to accuracy and quality.
Strom also took time to address concerns about what, when and how Amazon collects voice data — and stressed that the company is only interested in the voice data necessary to run its services, not in the content of anyone’s conversations.
The issue recently came to public attention in an Arkansas murder case where Bentonville police issued a warrant demanding records for an Echo device belonging to a charged murder suspect. The case has prompted debate about how First Amendment rights should be protected when speech is stored on digital devices. In reply to a general question about the Alexa technology and privacy, Strom suggested that Alexa’s handling of voice data is not always entirely understood from people who read about it in the press.
“What people don’t always get in these articles is that it (Alexa) is listening for a wake word all the time. It is only listening for the wake word,” explained Strom. “It’s only when the blue ring starts spinning — that’s when Alexa has heard the wake word and starts recording you. Only the thing you say after that wake word is ever recorded.”
Alexa has branched out from Amazon’s Echo and Fire TV devices into a growing number of third-party products. (Amazon Photo)
The range of Alexa devices is also growing, with the technology now in use on smartphones, cars and refrigerators. In addition, Strom said that the number of third party “skills” (voice-activated apps enabled for use with Alexa) is growing so fast that it’s hard to keep up with. Amazon recently said Alexa surpassed 10,000 skills.
“Skills are super-exciting, but they are also a big challenge for us because there’s many of them and we don’t build them ourselves,” he said, as he frankly discussed the need to maintain strong communications with skills developers.
Finally, Strom hinted about what it might take to make Alexa a little smarter — so that it understands what someone means, not just what they say. To do that, Alexa would have to tackle emotions and intonation.
“Alexa cannot capture the emotion in your speech right now, but it can do something indirectly by capturing the meaning of what you say — which can be emotional,” he said. “It will recognize your curse words, for example. We have over 100 scientists working on Alexa on speech in general.”
Strom said the company didn’t yet have anything to announce on emotion recognition, but that it would continue to be an area of interest.
Amazon is enjoying strong sales for its Alexa-enabled devices – and a recent analysts from RBC Capital Markets recently estimated that sales of Alexa devices could hit $5 billion by 2020 (with another $5 billion in annual revenues predicted to come from shopping done via the voice assistant). But it is not alone in the market. Microsoft’s Cortana (included with Windows 10 and available on iOS and Android devices), Google’s Google Home and Google Assistant (which is pre-loaded on Android smartphones) are still strong competitors.