Have you nominated someone for a Crunchie today? »
Vlingo: Voice Enable Any Mobile Application
by Nick Gonzalez on August 21, 2007

vlingologo.pngPeople really hate cell phone keypads for data entry.

Anyone who’s called customer service knows voice guided phone applications aren’t new, but they’re a good way to navigate menus and enter text. And applications like Spinvox which incorporated speech recognition to turn verbal voicemails into written text messages, and TellMe, which uses voice recognition to power local search, are useful and popular.

Cambridge-based Vlingo wants to make voice enabling applications easier, by using their own speech-to-text J2ME/Brew application API (Windows/Symbian later this year). With the API, developers will be able translate a user’s voice to text, and use it in their application as if typed directly into the program. One of their first examples was for local search and shopping. Vlingo voice-enabled a text box on the program you could fill out by holding down the talk button and saying a phrase, like “Pizza in San Francisco”. The system then fills in the form with what you said, letting you modify the text normally if it gets it wrong.

In our trials the system generally worked with my Californian accent. However, an Australian accent had very little luck, highlighting the difficulties of internationalizing speech recognition. Often speech recognition companies make their jobs easier by limiting the vocabulary or training the system on a comprehensive lexicon of words and accents. But due to the breadth of their effort, Vlingo had to take a more general approach, using machine learning through statistical analysis so the system could work in a wider array of uses. There’s a demo below.

Their system starts with a basic statistical language model to make the best guess about what you say. It then improves upon that by taking into account context, and positive and negative user feedback down to the individual. Context helps the system by narrowing the number of possible words you said. For instance, if the context is an address, the number of possible street names is limited to the ones in the city. User feedback correcting the system’s output or leaving it be helps the system learn how you speak (e.g correcting Austin to Boston).

It’s a very ambitious project, but the team behind it comes with some significant experience in the speech recognition space. The two co-founders (Mike Phillips and John Nguyen) worked for SpeechWorks, which was acquired by ScanSoft, which then renamed itself Nuance. Nuance most recently paid $293 million for VoiceSignal, a company using speech recognition for mobile search in 21 languages.

Vlingo plans on monetizing the service by charging developers on a cost per month or per user basis. They’re a team of 13 with $6.5 million from CRV and Sigma Ventures.

Advertisement

Comments rss icon

  • Would this work on the iPhone?

  • No. Since it’s only J2ME and Brew now. I don’t think you can access the iPhone’s voice capabilities through Safari on the iphone, so I think a voice enabled website is also out of the question.

  • hey this is something really good… but accuracy is going to be a big issue…. with different spoken languages…

  • Until today, this kind of voice recognition leaved something to desire…
    If this project really works will be a candidate to make lots of money.

  • Looks a good technology, though as Anujk said, accuracy is a big issue. From the video, it looked god however

  • It is interesting to test this technology.

  • This is a very complex technology. Don’t they take something from Nuance when leaving it? May be Nuance will want to sue them sometime.

  • I disagree that accuracy is going to be a big problem.

    Voice-to-Text has been around for a long time. Even on multi-gigahertz / uber RAM computers, the technology has accuracy challenges. Applying V2T to something with as few resources (memory, cpu, etc) as a mobile device is the very likely exacerbates the problem.

    Still, if I can dictate one word at a time then the words that are correct are words I don’t have to type. The ones that are not correct will cause me to do do two things:

    1. Allow me to edit them
    2. Train *ME* to pronounce certain words more articulately–or at least
    in a way that results in the desired behavior.

    My opinion: Cool.

    –Ray

  • Early on, Apple realized that appealing to consumer senses will and has yield tremendous profits. Voice embed application will make our lives easier and fun. If it make makes our lives easier and fun it will be a success.

    “Beam me up, Scotty!”

  • Can’t you just call the person you’re IMing? I mean, you’re carrying a phone. If cell phones had quicker voicemail access and better user interface (something like the iPhone comes to mind, but there could be improvements there too), this wouldn’t be necessary.

  • I not only agree with Andy, but I’d add that if you’ve already accustomed to texting, then you’ll likely continue to text. Vlingo will have a difficult time getting mobile consumers to CHANGE their texting behavior, let alone winning mindshare for voice based mobile search which already has lots of competition.

    Based on the video, the Vlingo service works at the prototype level, but it seems to me to be nice to have functionality, not a problem solving application.

  • I’ve been using the Tellme app (FYI – its called a multi-modal app – ie. Voice/Text-in – Text-out) and I simply love it. Now whenever I need to find a local business/restaurant, I just launch the app, and speak the name of my city and business and walla, I get the phone number AND map.

    Tellme probably works better because it is limited to the current vocab (limited grammar) (city/state and business) but I think it still uses some guesswork when it comes to business names so in that context it might be similar to Vlingo.

    A few startups were working on multi-modal apps during the dot-com days but mobile phones (J2ME) did not support it at that time and unfortunately all of them went under… so I’m glad to see this new crop of startups gaining real traction!

    Oh yeah although ScanSoft did not really rename itself Nuance. ScanSoft and Nuance merged and they decided to name the combined entity Nuance… which is not surprising since Nuance is the much better known brand.

  • Typical IVR system… Many mobile companies have been using this type of technology.

    Gearworks, a B2B LBS provider, has been using voice driven J2ME/Brew applications (complete with API) to capture work orders, time sheet info, and job status from mobile workers for the past 3 years.

    Customers include Pepsi and Roto Rooter.

  • Hello, my LG phone doesn’t have the technology down yet as I finally got frustrated and no longer use the voice command on it. Hopefully this is an improvement and it is as good as the video shows. :)

    Regards, Jared Blake
    JDHL Technologies – SEO Experts
    SEO Blog – Tips and Techniques

  • Looks like the recognition may be done on the network side. The app is only responsible for providing the context and audio to the back-end server. For general IM’ing I suspect all the processing can be done a cellphone running at 400MHz max.

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
bugbugbugbug
Techcrunch on Facebook