It Turns Out That Google Even Has A Competitive Advantage In Scanning Books
by Erick Schonfeld on May 2, 2009

Google is serious about scanning books. Throughout the objections raised over the years by authors and publishers and the more recent delays in its settlement with the Authors Guild, Google has been scanning millions of books all along trying to digitize as many as it possibly can. It is so serious about capturing and indexing the knowledge stored in books that it has a patent, which was issued on March 24, 2009, on how to scan books faster than was previously possible.

The basic technique it uses involves two infrared cameras which determine how flat or curved each page to be scanned is and then adjusting the optical character recognition software it uses to read the text accordingly. In other words, the infrared cameras help figure out a book’s three-dimensional shape and then back out any resulting distortions. This results in much faster book scanning since each page doesn’t need to be flattened by glass plates and spines don’t need to be broken.

There are other book scanning projects besides the Google Book Project. The Internet Archive, for instance, runs 18 scanning centers around the world, which all together digitize only 1,000 books a day. I am not sure what kind fo technology the Internet Archive uses, but I wouldn’t be surprised if Google’s scanning operation is much faster. Those are billions of pages of high-quality information just waiting to be indexed and searched. For Google, the faster it can get those books scanned, the faster it can start to serve ads against those searches. Now, I wonder how it flips the pages.

(Hat tip to Buzznewsroom).

Advertisement

Responses

Comments rss icon

  • Now that Google has the patent, I’d be surprised if they don’t license it to the Internet Archive.

    • its serious time for publishers.

      • It’d be nice if they did, what would be better is if they made it possible for public users to use such technology (perhaps with their scanners) and the software they provided to add more of our own works into some sort of public library.

    • RIT has something similar, I interviewed a guy about it last year at the 1st annual Imagine RIT festival (2nd one was just yesterday) http://rit.edu/imagine/

      Here’s a picture of it, I’ll try and find the video if I can dig it up somewhere.

      http://www.flic...ube/2588101330/

    • I am not an expert in the “traditional” publishing industry.

      But wouldn’t it he easier and provide more quality books if Google got the original digital files (at least for recent books) than trying to scan the paper versions of them?

      • Like that’s likely to ever happen. Most of the books are not produced in a convenient electronic format (even Microsoft Word is NOT an open format.)

        The book publishers are NOT your friends.

        They would rather you, the person who has to shell out for it, went without after a book goes out of print than to let Google have at it.

        They would rather you, the person who has to shell out for it, went without rather than being able to find out about it on the internet.

        They have never cared about you.

        They will never care about you.

        They don’t even care about the books themselves since they don’t care to print them on acid free paper.

        I have NO IDEA what they might care about, but you’re not it.

        • Well as the husband of an author who has published 57 books, I can tell you that there are some very serious copyright issues involved. Google went ahead and scanned an ENTIRE book that my wife had published (and it still in circulation – not out of print), without making any effort to inform her or the publisher. It took a threat of legal action to get them to back off. Now if Google would compensate the author for their work, i would have no objection. I don’t even have any objection to them scanning and showing some sample pages, but the entire book was clearly out of line. They did remove the extra pages and now there are only sample pages which I feel is fair.

        • This does sound like digital books might well go the way of music and films – the fight between the digital copy providers (whether it’s iTunes store or BitTorrent) and the physical copy publishers. I’m not saying Google is right to infringe the copyright of authors and publishers, but at the same time, consumers will find some way to get their hands on free stuff that is easily available on the Internet rather than having to go out to the bookstore to buy the book. Google is probably trying to bridge that for the consumers, perhaps with lesser regard for the copyright holders. Perhaps a solution should be found for this, before even books go the way of music and movies, where you get publishers hauling digital content providers to court.

      • @Enzo Silva – Prior to the 70s, most books were not digital, so there are no PDF files of books written before that ! What Google is doing is going to the Universities and scanning the books stored in the stacks. I work for a software developer that makes solutions for people wanting to convert these scans into printable form (deskew, clean-up) – http://docs.goo...29gf2×23c9

    • Erik said…
      In other words, the infrared cameras help figure out a book’s three-dimensional shape and then back out any resulting distortions.

      That looks similar to the algorithm described in the following paper (accepted for publication in 2003):

      Correcting common distortions in camera-imaged library materials

      Any developer who is doing or intending to do some projects that require this functionality of image distortion correction when scanning books for digital archiving , then post back here, so I can give you some other titles (ie, published research papers) on how to do this. The majority of those research papers are found in various related computer vision journals, which will take time to search thru them.

      I had developed image distortion correction capability for scanning materials into digital archivings, so I can point any developer out of where to look for interesting algorithms to implement.

  • thats pretty damn interesting.

  • I always wonder: why couldn’t google reach certail deal with publisher so they get the electronic version, saving

    • Because not every book in the world has soft-copy for Google to deal with publishers

    • Because we’re about to be overrun by thieving criminal book pirates who want authors to starve in the street like dogs! We must form many organizations immediately with the express purpose to sue individuals who trade books online and to take down thegooglebay! I mean google…

  • This week on Democracy Now! there was an interesting interview with Brewster Kahle (from archive.org) about antitrust allegations stemming from Google’s book projects:

    http://www.demo...n_for_agreement

  • It’s so creative! Maybe this will change the xerox logical.
    With Google is always like this. Everything they’re doing they’ll improve.

  • use a digital camera and run OCR. trick is how fast/correctly you can flip pages.

  • I bet those little things (305 & 310 in diagram) are like air jets or something that flips the pages of the books.

    Then again, the pages could be flipped by Google wizards…

  • Why did they get a patent on this? I can think of multiple technologies that have used dual cameras to map topography. Facial recognition and Lasik come to mind.

    The setup with stereo cameras is nothing special. The patent should be for the software that adjusts the OCR algorithm and nothing more. You also cant patent algorithms so the book scanning should only be a trade secret.

    I am not a lawyer.

  • The music alone makes this book scanning (1500 pages per hour) ScanRobot video worth it…

    http://www.yout...h?v=hlOQuuLYavY

    Awww yeah… hard core scanning action. Check out the hot actuator shots.

  • Why not set it free. Youtube want your video content free, but Google book keep this only for them.

  • Google is getting bigger and bigger. Its search engine monopoly is difficult to break in coming times. With Google scanning so much on per day basis, I am sure the gap will be getting wider and wider.

  • The concept of Google Books always excited me. I had done a lot of reading on its background. The scanning process(use of cameras, infra red and what not) might not be the most exciting part of the process. A lot of Chinese companies does that and it can be outsourced.
    Google has its unique process of converting the scanned images into a light-weight format (which is not PDF). I think the patent has been filed for that process.
    Anyone with more information on this. Google has a few papers on this, i believe.

  • what what in the - May 2nd, 2009 at 11:12 pm PDT

    book scanning is interesting and all, but the publishers have everything recent in e copy. if they banned together, google’s monopoly wouldn’t exist very long.

    unfortunately though, i suspect print publishers drink the same soup the music industry does.

  • …I could have sworn Vernor Vinge portrayed a rather more dramatized account of competing digital archiving techniques in Rainbows End…

  • This is old news. NewScientist and various other outlets reported on this back in early April:

    http://www.news...rlds-books.html

  • Fundamentally this is an old trick. Photoshop artists use similar techniques everyday. Whoever patented this must be very proud.

  • This must be a boring job eih?
    Turning pages over for books to be scanned…
    Or do they have a machine for that?

  • “Now, I wonder how it flips the pages.”

    Sergey and Larry take turns.

  • So when can we start reading? can’t wait to see some samples of this scanning thing..

  • “It Turns Out That Google Even Has A Competitive Advantage In Scanning Books”

    That title is about as goofy as it gets. You can get a patent for your grandma’s diaper disposal if you wrote the claims through a good patent attorney. Patents are a dime dozen… most patents are only good when someone else tries to sue you.

    Competitive advantage – this is more like calling up a thief who can pick the lock – when the owners are out of town and you dont have to keys to open the door. Tell me how it improves the efficieny relative to other methods. Ever heard of calling up a publisher and getting a scanned copy from them. How much more efficient is this patent.

    Stop it already… you and the rest of the bunch who have pre-mature ejaculations just hearing the word “patent”.

  • I believe both Google and Microsoft (when Microsoft had the book digitizing project) outsourced the whole book scanning operation to Kirtas Technologies (http://www.kirtas.com/) who owns most of the patents on the apparatus Erick displayed in the article.

  • and btw…when you go to kirtas.com you will get the answer as to how the pages are being flipped

  • I actually saw a public facing tech-talk about this. They use humans to flip the pages because the best machines they can find still occasionally tear pages (something like 1 / 100 pages) and this technique wasn’t really developed for *faster* scanning, it was developed so that they could scan books from libraries that the libraries didn’t want destroyed. So they needed a way to scan the books that didn’t involve chopping the spine off.

  • I seem to remember that John Warnock (co-founder of Adobe Systems) formed another company (www.octavo.com) over a decade ago that, among other things, had software that did exactly this. I have no idea if they patented it, but they were doing this long ago.

  • Most of you have no idea what you are talking about.

  • There are several page flipping mechanisms…..

    A good/common one is suction priming and a spatula….

  • The Internet Archive’s approach to scanning is somewhat different than Google’s, as these two videos (Brewster Kahle presenting) demonstrate:

    http://www.ted....index.php/talks
    /brewster_kahle_builds_a_free_digital_library.html

    http://www.loc....ures/kahle.html

    (The TED video is about 20 minutes long, the other video, from a Library of Congress guest lecturer series, is over an hour.)

    As a long time Google/books user, the quality of their OCR work is much better of late than just a year ago.

  • Google, if you need someone to flip those pages I’m available ALL summer.

  • that is really cool. lol. i sure hope they aren’t wasting all this time and effort.

  • Compare to what Google use, I believe placing book in a V-shaped is much better for OCR because you have 100% curvature-free for the first place.

    Using a software to correct page curvatures is not 100% accurate. Try for yourself at http://snapter.atiz.com/. You can also check out this V-Shaped book scanner at http://atiz.com/

  • Really interesting….

  • Their competitive advantage comes from another place: they can take every scanned phrase and look it up in their index to see whether the OCR made any mistakes. The resulting OCR quality will be much better.

  • Nice. I’m glad someone is scanning this stuff.

  • no matter how fast they scan it they still have to turn the pages.

  • some one scanning but still there are people try to do some spam

  • So when can we start reading? can’t wait to see some samples of this scanning thing

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbugbug
Techcrunch on Facebook