Vlingo: Voice Enable Any Mobile Application

People really hate cell phone keypads for data entry.

Anyone who’s called customer service knows voice guided phone applications aren’t new, but they’re a good way to navigate menus and enter text. And applications like Spinvox which incorporated speech recognition to turn verbal voicemails into written text messages, and TellMe, which uses voice recognition to power local search, are useful and popular.

Cambridge-based Vlingo wants to make voice enabling applications easier, by using their own speech-to-text J2ME/Brew application API (Windows/Symbian later this year). With the API, developers will be able translate a user’s voice to text, and use it in their application as if typed directly into the program. One of their first examples was for local search and shopping. Vlingo voice-enabled a text box on the program you could fill out by holding down the talk button and saying a phrase, like “Pizza in San Francisco”. The system then fills in the form with what you said, letting you modify the text normally if it gets it wrong.

In our trials the system generally worked with my Californian accent. However, an Australian accent had very little luck, highlighting the difficulties of internationalizing speech recognition. Often speech recognition companies make their jobs easier by limiting the vocabulary or training the system on a comprehensive lexicon of words and accents. But due to the breadth of their effort, Vlingo had to take a more general approach, using machine learning through statistical analysis so the system could work in a wider array of uses. There’s a demo below.
http://services.brightcove.com/services/viewer/federated_f8/271548276Their system starts with a basic statistical language model to make the best guess about what you say. It then improves upon that by taking into account context, and positive and negative user feedback down to the individual. Context helps the system by narrowing the number of possible words you said. For instance, if the context is an address, the number of possible street names is limited to the ones in the city. User feedback correcting the system’s output or leaving it be helps the system learn how you speak (e.g correcting Austin to Boston).

It’s a very ambitious project, but the team behind it comes with some significant experience in the speech recognition space. The two co-founders (Mike Phillips and John Nguyen) worked for SpeechWorks, which was acquired by ScanSoft, which then renamed itself Nuance. Nuance most recently paid $293 million for VoiceSignal, a company using speech recognition for mobile search in 21 languages.

Vlingo plans on monetizing the service by charging developers on a cost per month or per user basis. They’re a team of 13 with $6.5 million from CRV and Sigma Ventures.