Recently there's been a lot of comparison between Apple's Siri and Google's Voice Search. Microsoft's voice breakthroughs have also captured headlines. After decades of research and false starts, the competition in voice interfaces is now heating up thanks to its appearance on mobile devices, and the race is on to shape the definitive voice interface for the mass market.
But if the ballads to Siri's limitations or sites dedicated to her often-hilarious interpretations of simple instructions are any indication, we still have a long way to go until a winner is declared. To succeed, the big players will need to conquer the human challenges to voice tech if they want to design a service that most people will happily incorporate into their daily routines.
For any advanced voice service, in addition to great voice recognition and interpretation, you need a compelling and simple interface that feels personal, context awareness that adds depth, and a very clever and fast backend that continuously learns the user's intent. No one service is the ultimate answer – yet.
If this type of voice assistant did exist, it would have the potential to make voice interaction go from niche to mainstream. Here's why.
The human component
Most people like to talk. But when faced with talking to machines, most of us get intimidated. Hand the biggest extrovert a microphone and they tend to clam up. Or just observe someone trying out a voice service for the first time. It simply doesn't feel (or look) easy or natural.
So why don't people like to talk to machines? Feedback (or the lack of it) is a big reason. When talking with another person, there are rich layers of feedback throughout the interaction – facial expressions, body language, tone of voice, and more. Constant real-time feedback is central in human communication, and both speaker and listener are active participants in the communication. With voice services, most of this feedback and interaction is stripped out.
Another reason that technical voice services failed to catch on earlier, even though they were common in computer programs, is that there's simply less need to use voice on computers compared to mobile devices. When using a computer, your hands are already committed, the QWERTY text input is pretty efficient, and seeing text as you type it also confirms it's correct. Voice input or output adds little value there. Smartphones offer the turning point for voice. When you're on the move, chances are high that your hands could do some other useful things if you can use speech to interact with your mobile in order to find things or get stuff done.