People really hate cell phone keypads for data entry.
Anyone who’s called customer service knows voice guided phone applications aren’t new, but they’re a good way to navigate menus and enter text. And applications like Spinvox which incorporated speech recognition to turn verbal voicemails into written text messages, and TellMe, which uses voice recognition to power local search, are useful and popular.
Cambridge-based Vlingo wants to make voice enabling applications easier, by using their own speech-to-text J2ME/Brew application API (Windows/Symbian later this year). With the API, developers will be able translate a user’s voice to text, and use it in their application as if typed directly into the program. One of their first examples was for local search and shopping. Vlingo voice-enabled a text box on the program you could fill out by holding down the talk button and saying a phrase, like “Pizza in San Francisco”. The system then fills in the form with what you said, letting you modify the text normally if it gets it wrong.
In our trials the system generally worked with my Californian accent. However, an Australian accent had very little luck, highlighting the difficulties of internationalizing speech recognition. Often speech recognition companies make their jobs easier by limiting the vocabulary or training the system on a comprehensive lexicon of words and accents. But due to the breadth of their effort, Vlingo had to take a more general approach, using machine learning through statistical analysis so the system could work in a wider array of uses. There’s a demo below.
It’s a very ambitious project, but the team behind it comes with some significant experience in the speech recognition space. The two co-founders (Mike Phillips and John Nguyen) worked for SpeechWorks, which was acquired by ScanSoft, which then renamed itself Nuance. Nuance most recently paid $293 million for VoiceSignal, a company using speech recognition for mobile search in 21 languages.
Vlingo plans on monetizing the service by charging developers on a cost per month or per user basis. They’re a team of 13 with $6.5 million from CRV and Sigma Ventures.









Would this work on the iPhone?
No. Since it’s only J2ME and Brew now. I don’t think you can access the iPhone’s voice capabilities through Safari on the iphone, so I think a voice enabled website is also out of the question.
hey this is something really good… but accuracy is going to be a big issue…. with different spoken languages…
Until today, this kind of voice recognition leaved something to desire…
If this project really works will be a candidate to make lots of money.
Looks a good technology, though as Anujk said, accuracy is a big issue. From the video, it looked god however
It is interesting to test this technology.
This is a very complex technology. Don’t they take something from Nuance when leaving it? May be Nuance will want to sue them sometime.
I disagree that accuracy is going to be a big problem.
Voice-to-Text has been around for a long time. Even on multi-gigahertz / uber RAM computers, the technology has accuracy challenges. Applying V2T to something with as few resources (memory, cpu, etc) as a mobile device is the very likely exacerbates the problem.
Still, if I can dictate one word at a time then the words that are correct are words I don’t have to type. The ones that are not correct will cause me to do do two things:
1. Allow me to edit them
2. Train *ME* to pronounce certain words more articulately–or at least
in a way that results in the desired behavior.
My opinion: Cool.
–Ray
Early on, Apple realized that appealing to consumer senses will and has yield tremendous profits. Voice embed application will make our lives easier and fun. If it make makes our lives easier and fun it will be a success.
“Beam me up, Scotty!”
Can’t you just call the person you’re IMing? I mean, you’re carrying a phone. If cell phones had quicker voicemail access and better user interface (something like the iPhone comes to mind, but there could be improvements there too), this wouldn’t be necessary.
I not only agree with Andy, but I’d add that if you’ve already accustomed to texting, then you’ll likely continue to text. Vlingo will have a difficult time getting mobile consumers to CHANGE their texting behavior, let alone winning mindshare for voice based mobile search which already has lots of competition.
Based on the video, the Vlingo service works at the prototype level, but it seems to me to be nice to have functionality, not a problem solving application.
I’ve been using the Tellme app (FYI – its called a multi-modal app – ie. Voice/Text-in – Text-out) and I simply love it. Now whenever I need to find a local business/restaurant, I just launch the app, and speak the name of my city and business and walla, I get the phone number AND map.
Tellme probably works better because it is limited to the current vocab (limited grammar) (city/state and business) but I think it still uses some guesswork when it comes to business names so in that context it might be similar to Vlingo.
A few startups were working on multi-modal apps during the dot-com days but mobile phones (J2ME) did not support it at that time and unfortunately all of them went under… so I’m glad to see this new crop of startups gaining real traction!
Oh yeah although ScanSoft did not really rename itself Nuance. ScanSoft and Nuance merged and they decided to name the combined entity Nuance… which is not surprising since Nuance is the much better known brand.
Typical IVR system… Many mobile companies have been using this type of technology.
Gearworks, a B2B LBS provider, has been using voice driven J2ME/Brew applications (complete with API) to capture work orders, time sheet info, and job status from mobile workers for the past 3 years.
Customers include Pepsi and Roto Rooter.
Hello, my LG phone doesn’t have the technology down yet as I finally got frustrated and no longer use the voice command on it. Hopefully this is an improvement and it is as good as the video shows.
Regards, Jared Blake
JDHL Technologies – SEO Experts
SEO Blog – Tips and Techniques
Looks like the recognition may be done on the network side. The app is only responsible for providing the context and audio to the back-end server. For general IM’ing I suspect all the processing can be done a cellphone running at 400MHz max.