
Google has announced that it will now begin including scanned documents in its search results - a feat that requires an immense amount of processing power and advanced image recognition technology. Unlike standard text documents, scanned files don’t contain any text data that Google’s spiders can index. Instead, Google has employed Optical Character Recognition (OCR) technology, converting photos of words into digital text files.
In the past Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata - not the contents of the documents. From now on Google searches will include the text within these scanned images in normal search results. When you encounter a scanned document you’ll be able to view it in its original form as a PDF, or as a converted text file (click “View As HTML”).
Such technology has existed for quite a while, but accuracy has always been an issue - and the fact that Google is doing it on such massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers).
Here’s an example (the first result is a scanned document): Repairing Aluminum Wiring
For more, check out the announcement here.







See all



I may not be smart enough to fully understand the impact of this but it sure seems like it is off the charts huge… especially for SEO purposes.
http://www.ngagelive.com
Yeah, you’re just smart enough to know you’re an link spam idiot.
i need to find away to stop google from searching thru my scanned text docs
http://12tb.com
And your genius allows you to know when to use “an” in a sentence?
good now i can find scanned ebooks better.
Well, I think this is not particularly useful for SEO purposes, as you can do the same by using an OCR software yourself. This is more of a usability advantage for search engine users who now will be able to find documents whose owners did not consider providing content in a search engine friendly manner.
Hats off to more googomopoly business. But to say “a feat that requires an immense amount of processing power” is complete BS! Adobe pdf has been doing this efficiently for a while, and many others. Google still at 20x earnings and dropping is a sign the company is going down. I also recently received this email from them:
“Dear Publisher,
We understand that the recent economic turmoil has created a lot of uncertainty in the lives of AdSense publishers. During these difficult times, we’re continuing to invest in innovations that improve publisher monetization and advertiser value in the content network.”
blah blah… I hope Google goes down!
Google Text {seesmic_video:{”url_thumbnail”:{”value”:”http://t.seesmic.com/thumbnail/lMb3UqcI5J_th1.jpg”}”title”:{”value”:”Google Text ”}”videoUri”:{”value”:”http://www.seesmic.com/video/KLgBIk1hVi”}}}
if they could put this type of technology as part of their appliance they would probably be able sell it to the enterprise. heck I can think of a number of uses if you could push an image to an api.
I would think that most modern documents are represented in a search-friendly form, particularly if the authors want those documents to be found. I see this more as a way of addressing legacy documents. I’m sure this plays well with their massive book-scanning project.
I do wonder what the implications are for CAPTCHA scanning tasks.
So, if you host your own website, you can then use the query “:site yourdomain.com” and search image PDFs as OCR’ed documents? That is HUGE. Now, they need to implement this into Google Desktop. Adobe Acrobat, RIP…
Clever. Though there are free online OCR services already. Though I don’t know of any that proactively crawl your site.
Would you point me to a free online OCR service?
http://www.free-ocr.com for example.
sounds like they can make do with something sophisticated like EndNote which is pretty awesome at this kind of thing..
I meant Evernote ..my bad!
I wonder if people will try to do SEO with scanned documents now, by putting links in scanned documents?
that’d be so cool.
that is a good news , I think it is difficult
PDF’s “View as HTML” as been on Google for months.
For the images scanned thing, lets see.
Google are a schizophrenic company. Half the time they go to immense lengths to find new indexable content, such as scanning copyrighted books or - now - searching through image scans of text (how many are there of those)? Meanwhile, half the time they do their utmost to bury masses of content with sandboxes and penalties and their increasingly useless ranking algorithm.
Should be really useful in the long run… I wish Google could come to some agreement with academic databases however, I use Google Scholar to locate articles becuase it’s the easiest and most accurate searching method, then I have to go to my University student login, head through to the databases and find that article within their…
Perhaps the best solution would be if those academic databases could use a custom Google search engine, but till then it’s a major pain.
Can someone explain to me why it is “a feat that requires an immense amount of processing power and advanced image recognition technology”? Adobe Acrobat (among other programs) has had the ability to automatically OCR scanned images for years, and these programs are lightning fast these days on even a low-spec laptop. I know, this is what I do with all important paper docs that cross my desk these days. My desktop search tools index the OCR’d text in these docs really easily as well. These are all fairly lightweight, background processes these days.
Why is this so impressive?
Buying good OCR software is expensive.
Google has just saved us all a big expense… thanks Google!
Also making e-books from out of print material should be easier now.
Have fun everyone,
Michael
I hope they find a way for people who publish their email addresses as pictures to avoid getting these passed on to spammers. This may require a rule on the lines of:
1) Only images/scans containing substantial amount of text will be converted (>10 words), OR
2) Images with text formatted as email address won’t be converted.
Our they really doing OCR? The way I read this is that they our just exporting the OCR information that is stored in a PDF file. So if you post a PDF file with out the OCR embedded in the file this new Google feature will not really do anything for you. Now if they start performing OCR on normal images like PNG or JPEG I will be impressed.
If Google were to include OCR indexing with their Search Appliance, my company would have no trouble cutting a purchase order for one.
How long before someone figures out how to “optimize” scanned documents and pictures for SEO purposes? How long before Google has to adjust their algorithms to fend off said “scanner spammers”? How many times will Google have to adjust the algorithms once the SEOs figure out how to circumvent the changes? How long before every bit of Google’s index looks like an overgrown spam farm, aka Google Blog Search?
that’s good news!
Google has been experimenting with this feature via Google Books and Google Catalogs. Its pretty amazing what their OCR can do. Its beyond any desktop OCR application. Try a search yourself. I searched for Calvin Klein underwear and Google actually highlighted “calvin klein” on the model’s underwear!