Captchas are well known for keeping automated spammers out and letting humans in. However, ReCaptcha is a rather clever service using them to help digitize books scanned into the Internet Archive as well. It’s a project from the School of Computer Science at Carnegie Mellon.
The Internet Archive is home to over 200,000 scanned copies of classic books. Some of them are gorgeously crafted, like this children’s book, but fancy styling can make it difficult for computers to translate the books into an indexable digital text. Much like a Mechanical Turk application, ReCaptcha uses humans to translate images of scanned words that a computer couldn’t understand. Notably, Mechanical Turk has been used in the searches for Jim Gray and Steve Fossett.
The scanned words are placed alongside a normal captcha widget so users decode both words at the same time. The word can be run by multiple people to cut down on errors. Catchas also offer the opportunity to convert a lot of words. ReCaptcha’s founders, Luis von Ahn and Ben Maurer estimate that about 60 million CAPTCHAs are solved every day. Assuming that each CAPTCHA takes 10 seconds to solve, it’ this is over 160,000 human hours per day (that’s about 19 years).

To harness all this time and effort, ReCaptcha is opening their service through captcha widgets and an API. They also have a service for protecting email addresses posted online. You can protect your address by going here and entering it. ReCaptcha then gives you some code to paste your protected address to the web like this, n…@techcrunch.com. To get the address, click the three dots and answer the Captcha.
It’s great to see projects like this harnessing just a bit of our time to solve some important and complex problems.









I recently tried to use this service for a site that was targetting audience in 3 countries (US, Canada, and India). The performance (when using the solution hosted on their servers) was poor and forced us to have our own hosted version of captcha. Its easy to use, but I wish they provided a hosted version also readily accessible.
Welcome to four months ago!
What a brilliant idea! I wonder what other human-required problems could be CAPTCHA’ed?
Old news… Get with it.
It took you long enough. But, better late than never.
reCaptcha is an awesome idea and a system. I showed it to some of the web developers and decision makers at the company for which I work, and now it is being used pretty much anywhere there is a need to prevent automation and bots.
that’s awesome. why doesn’t techcrunch use this?
We use this service, and have found it amazingly good!
So now i have to type two words? How does this make my life easier?
They may as well send me the damn book for me to type out.
They’ve actually shown that it’s faster to do two English words than 6-8 random characters (as most CAPTCHAs ask you to do).
Didn’t anyone notice the obvious flaw? IT LETS MINOR MISSPELLINGS THROUGH. Replace lowercase i with l, for example, and see what happens.
Since this is a system to educate machines it is completely unsuited to verifying humanity. Any amateur CAPTCHA reading bot will get past this. Even when it’s wrong.
We use this service too. The idea is great and the api is easy to implement.
Hi Floogy,
I’m an engineer on reCAPTCHA.
We let minor misspelling through on purpose. Our testing has found that our CAPTCHAs are strong enough against OCR attacks that we can give users some slack (to account for typos).
We monitor our system for signs of abuse. This is much more secure than most CAPTCHAs because we have a global overview of any attack that might take place. We can take action based on such information.
Hi Karthik,
I’m an engineer on reCAPTCHA.
We have servers on the two coasts of the US, giving us excelent response time within the US and very good response times in other key locations. We’re investigating opening up data centers in other countries to reduce the latency of reCAPTCHA requests.
We’d always love to hear about performance issues. Email us at support@recaptcha.net and we’d be happy to take a look.
Look at the cost of spam. Captchas destroy the life’s work (160K hours, $2M lost productivity) of a person every day. It we didn’t have spammers we wouldn’t need them. Let’s build more jail cells for spammers.
We switched to reCaptcha because it has an audio version, and has worked very well for us.
So that’s why Facebook’s CAPTCHAs are so weird!
I guess it won’t take too long until somebody comes up with a Pwntcha-style hack to speed up that book digitalization process…
I tried reCaptcha for a while on my blog and was very happy with it. I’m not using it anymore due to the complaints of a regular reader who is dyslexic and was having trouble passing the captcha bar. If it hadn’t been for that, I’d still be using it.
This is a great service but I worry about the uptime and availability of the service.
Hi Ian,
I’m an engineer on reCAPTCHA.
We run our service out of two datacenters (and are looking into getting more) for uptime. Should there be unexpected downtime of a datacenter, our traffic automatically switches over. We’ve done drills of failover to make sure our techniques work.
Hi Ben,
Thank you for the info Ben. It’s great that you guys are taking availability seriously!
You should put up the information you posted here on recaptcha.net I think it will help you land larger sites. (It may already be there but I did not see it).
Again reCAPTCHA is a great service!
Great to see this finally get the publicity it deserves.
This has been around for a while now! I love the coverage of it, and I honestly think it is a very noble cause, but it’s a bit late to be covering it I think. We’ve been using it for quite a little while now on 20DC.com. But keep with it. Anything local to Pittsburgh and CMU I am very happy!
Maybe Ben can answer this. I never understood how this system would know the correct answer for the first person to receive a new unknown word. It can’t. So you always have to ask twice — give one word the system doesn’t yet know and one it does. And because of possible entry errors, the unknown word would have to be redundantly given to some number of people.
If that’s true, then why not just ask one CAPTCHA and ask people to volunteer 10 seconds of their time digitizing a new word? Why pretend the CAPTCHA is two words?
A little offtopic –
Just few days back I too had covered reCaptcha on our blog. I still have to test is on my wordpress blog though. I find akismet good enough to find spam hence don’t feel like adding captcha functionality
Avi : Yes, that’s how it works according to their website.
I was wondering how it could read out a new word?
Ok, I should have tried before asking, it reads out something entirely different and barely understandable.
I think it is a little unfair that Techcrunch does not hat tip people that guide them to the stories behind their articles. About 15 hours ago I thought they did, when I sent them the link to reCAPTCHA and I had already written about it on my blog…
This is a great idea! Just wonder how they manage to catalogue all these data that has been entered. Must be a really huge database some where.
Great concept. I tried using it on my site, but was very slow and non-responsive. May be I’ll go back if it improves.
BTW, if CAPTCHA is machine scanned text, how can you provide audio for the CAPTCHA words? Does a human read and digitize those? Then it defeats the the very purpose of humans helping digitize books, right?
It takes about 3 clicks (click into their website and try a captcha) to figure out that they do a different thing for audio captchas…
Pretty interesting… but the captchas it produces are tough!
I do not use CAPTCHA. I have a bullet proof alternative to CAPTCHA and it is less annoying and has no accessibility issues.
http://w3net.eu/?p=32
Attila, your solution only works because nobody else uses it and because spammers have not targetted your site. If somebody had an acutal incentive, it is trivial to write a program than can answer the questions you propose. CAPTCHAs are built to protect sties even if dedicated spammers target them.
It’s ‘centuries’ and ‘fulfils’.
Do I win?
So you finally got to reading back issues of Wired magazine? I thought TechCrunch was about breaking stories.
Fred, you are wrong. My contact form is submitted at least 3000 times a month and I’ve not received any SPAM yet. I store form fields of the submitted form into a database and log the IP address of the spammer. The page is spammed by at least 20 different spammers. One of the spammers submits the form 400 times each month. I’ve banned his IP adress recently to save traffic. If you need details I can send you the database. None of the spammers could figure out the correct answer yet.
Attila,
What Fred is saying is that the spammers attacking your site aren’t doing anything more than their standard automated attack. Many webmasters find that there are spammers (and other abusive persons) who are making a effort to attack their site specifically. In this case, a secure technique is needed.
The idea behind reCAPTCHA is that we choose a challenge which we know current technology can’t beat. A site using reCAPTCHA is secure against an attacker who actively targets that site.
Ben,
CAPTCHA could be an appropriate solution to protect sign up forms of popular web sites, I agree. But the bottom line is that CAPTCHA is not necessary for contact forms of less popular web sites. No spammer would invest time into attacking a single contact form, it has no sense.
I can assure you small sites (traffic wise) are spammed every single day by spammers an example is our contact form.
I am trying to integrate recaptcha on the form to stop this but am having some difficulty but i will persevers as the form is spammed every single day many times over.
personal and business computers doesn’t need to have similar characteristics