In a blog post today Google says they’ve identified 1 trillion unique URLs on the web. It’s actually more, they say, but some web pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other.
What they note way down in the fourth paragraph, however, is that they don’t actually index all of those pages, so you can’t find them on Google. Estimates on the true size of the Google index are a mere 40 billion pages or so.
Why don’t they index all the pages they’ve found? Some of them are spam. But it’s also very expensive to index sites. And the fact that Google indexes many news sites, blogs and other rapidly changing web sites every 15 minutes makes all that indexing even more expensive. So they make value judgment on what to actually index and what not to. And most of the web is left out.
Google also says “But we’re proud to have the most comprehensive index of any search engine.”
That may be true today, but it probably won’t be true next week (check back here then). Google knows that as well as we do, and that’s why they posted this today.








Way to waffle on the post title. Very Favre of you.
c’mon, Michael. Spill the info. You know you want to.
MS+Y! ?
Google has the best engine until Sunday?
What time Sunday?
Hey….what will happen next week?
tell me
Powerset?
problem with index in google is if you want to remove your pages or do anything manually it definitely does not take 15 minutes and if in yahoo you at least could pay for expedited service in google youre stuck with their uber-anal anti webmaster policies (cuz they think they are the smartest thus they can teach everyone how dumb they are) and make life hell for noobies like me…. Besides what if i had pages say pulling from DB and i made them static. Google wants me to wait for a year or so till it figures out what happened before it finishes indexing the site. I am just almost pissed at google for taking over >this< world
Hey, what happens next week? tell us…
MS + Y! ?
Can’t wait for what will happen next week. Maybe MSFT is upto something, today they just announced BrowseRank. I think they desperately wanted Yahoo search to integrate this and launch it with 30% user base rather than small user base they have now.
And what makes that misleading?
I read through google’s post and found nothing misleading. They start out with a title that says the web is big, then they describe how big it is, then they say they don’t index all the pages…
hotbot is back
Proofreading FTL
Mike sure likes to tease. And I obviously have no idea what that’s about, but if it’s a larger index that someone is going to show off, I don’t think I’m impressed. Better ranking, yeah. Stockpiling of data, not so much.
Yes, I’ll check back next week…
Good scoop Mike, I wonder what’s going to happen next or what you meant by later this week?
TL – http://www.offu...rThanTechCrunch
My guess is it is MSFT indexing Facebook. If FB has 100/pages per user (easily do able when you consider photos, shares, notes, events, groups etc), then they could have 10b pages quite soon, which is 25% of Google’s index right there.
I don’t know of any public, impartial estimates that Google has only 40 billion pages index, Mike. Personally, I suspect they’re well larger than that. But I don’t really want to go back to the stupid size game as I covered in my own post today. Any search engine that trots out a claim they are bigger than someone else isn’t doing searchers a service. Size alone is an indication of nothing.
yo danny!!!
ask your gf/wife if size matters!!!! you don’t have to be the biggest, but size definitely matters… assuming you know wat the hell you’re doing!!!!
ain’t it amazin’ that tech can emulate real life!!!
peace!
It is time that people realize that having one massive company focussed on indexing the worlds information( 1 trillion pages today growing at a billion a day) is not going to work out – just because of the scale of the problem.
there will have to vertically focussed engines example Job search , health search etc that will be the starting points for your search. Now google could still show the first result to be a link to another search focussed under a vertical
Yahoo said 2005 that their index is 20 billion pages big. Google said a few weeks later, that their search index is more than three times larger than that of any other search engine. That would be 60 billion pages in 2005 …
Its not the size of the index that matters… it is how useful your results are to the average user… I guess no one can beat google on that…
Have you used Gigablast? I find it more useful and easier to use than Google or Yahoo.
You must be affiliated with Gigablast, which is not really comparable to the G/Y/M, either in recall or precision. Spelling check is a complete joke. I’m all for alt SEs, but the gap is actually widening over the last a few years, instead of narrowing.
well to say least i dont think google would have any problem in all to relaunch with a more user friendly categorization of the index… (for those not knowing all the commands such as, filetype: / site: / etc and operators)because really if you know how to use google — theres noone close to their well established and highly evolved automation and speed of service
one of the things i wonder why google does not have with such extensive list of search utilities and apis is ajax search… my friend who knows technology better than me thinks it may be the bandwidth issue. But is that really a possibility that google would even think about bandwidth issues….
the problem with ajax – similar to flash’s – is deep-linking. google can’t index s/t in has no acces to. fortunately, this problem has recently been solved, thanks to a bulgarian coder and his script allowing deep-linking in flash action scripts and ajax scripts. therefore you’ll see more and more ajax subpages and flash websites sub-pages indexed by google soon.
Whoa, what’s happening next week?
I know what’s happening next week too. It’s not that big a deal.
> What they note way down in the fourth paragraph, however, is tha they don’t actually index all of those pages, so you can’t find them on Google.
If “find” means “can be returned in a search result”, that’s wrong for both Google and Yahoo and probably Microsoft as well.
The text displayed with a link tells you what someone/something wants you to know about the where the link points. (Remember googlebombs.) That text can be indexed. If a query matches it, the linked doc can be returned as a search result.
It’s unclear when such docs are returned in search results. After all, GYM try to index important docs, so unindexed docs are, by default, less important, but that’s a policy question.
> And the fact that Google indexes many news sites, blogs and other rapidly changing web sites every 15 minutes makes all that indexing even more expensive.
I’d be surprised if more than 1-20% of Google’s indexing “spend” was on such sites. (And, I’ll bet that the rate is doc-specific, not site-specific, so they don’t waste time on slow-changing docs on hot sites.) If that’s true, those docs are less than 1% of Google’s index, so their indexing cost has a larger processing+bandwidth component than that of a doc that Google crawls once a day or less often. The latter’s cost will have a larger storage component.
Rebuilding the web graph several times a day is impressive. I’m surprised that they start from scratch each time as only a small fraction of the links change each day and the evaluation of old docs and links is fairly stable.
> I’m surprised that they start from scratch each time
My guess is that perhaps their index is sorted by rank to speed things up. If a search returned 2 million pages, you don’t want to look up each ranking to reorder the results.
Michael, let me guess, your post next week will have something to do with Microsoft?
You are the most biased blog writer I read on a daily basis with your strong preferences for Facebook/Microsoft and amusement to bash about Google. Every time I read a post like this, I think it will be last time I read your blog… I think the time is actually coming when this will be true.
do u know what club is after party?
oooh, Techcrunch is a thriller blog now
Am I the only one considering a short on GOOG?
Have u checked with your lawyers that leaking that much is Ok. Just kidding. Don’t know if you intended to, but your speculations will ultimately churn the rumor mill to its hilt
guy kawasaki bought altavista and collects user-submitted pages via twitter?
Mike, you come off as a bit pro Microsoft and anti Google. Why do I feel that way?
I am betting that ‘The One’ going to make the announcement is cuill.com
Did you know that “Couille” in french means testicles. it’s pronounced the same way as Cuill. this domain name will be a disaster in France.
Nice non-post.
wow. that was stupid. If you really thought the that post was misleading, you really have problems. I hope you were just trying to get some attentions for your little inside information with shocking title.
Is Cuill coming out of stealth mode next week?
Something big is coming out next week? Wow, I can’t wait. Cuill? Personally, I have to say it’s a bad choice for naming.
“couille” means testicle in french and is pronounced the same way. what a joke!
That’s hilarious!! hahaha.
I like this game
Come on, Google has river-cooled velcro-suspended server grids the Borg would envy.
As far as I’m concerned there’s only one search engine. Y!+MSN barely show up in my stats, less than 1% of my traffic. Y!+MSN are negligible at this point.
Anybody who claim they have big index than Google next week would be lying, as Google dominates all access log stats so far.
I recall reading about some ex-Googlers who had some better way of crawling the Internet. Could that be the big news next week?
Ok, I guess Cuill are those ex-Googlers.
Keep in mind that he did say ‘probably’. As it maybe, but maybe not. 1 trillion urls is a ton and even if Google has only 40 billion pages index (a figure anyone would highly dispute since a few years ago they were around that figure), anyone who could pass them would need massive computing power. Cuill wouldn’t be able to do that without raising some serious eyebrows and only Microsoft could do that. By indexing Facebook? Good lord, what a waste!
@Jason Bogovich @Momchil get a clue. Michael is one of the few tech writers who does not just take everything that Google’s PR folks tell him to write at face value. This blog post has been missrepresented (exactly as Google intended) all over the place (check out http://www.info...e_index_re.html – amazing how a supposed tech publication missed the basic point), and I think Michael nailed what Google was trying to do. Google is using exactly the same FUD tricks Microsoft popularized in the nineties to try to use angle to gain even a minor advantage over anybody trying to compete with them.
If Cuill is indeed launching next week, what are the odds that Google just happened to write a blog post about how many web pages they know about (after 2+ years of silence on this front) one week before a new startup on this front is launching? Ignoring for a second the ethics of the press folks under embargo who would have told Google that this was happening (happens all the time – press folks leaking embargoed stories to large companies so that they get future favors), what do think about a company who supposedly focuses on innovation and does not think about their competition of doing this to try to create confusion around the launch of a new company that is trying to create at least a bit more competition in the search space?
Amazing how quickly Google is turning into the new Microsoft, and thanks Techcrunch for being one of the few publications that keeps trying to report the facts vs. just copying press releases.
Those who actually crawl and index data would know that there is actually a link between number of discovered urls and number of them crawled – we have discovered 211 bln unique urls in crawl of around 43 bln pages (only successfully retrieved counted), this is pretty representative of the web scale: using 5 urls for 1 crawled page ratio we can arrive to figure of 200 bln pages crawled by Google in which they discovered 1 trln unique urls. Their index might be around 40 bln, however they certainly did crawl a lot more than that and only included best candidates into full text index.
So I reckon this article is as misleading as Googles post
good things is
光学元件
if your final comment was about CUIL then im sorry to say CUIL failed to impress me even a bit (even the ontology is something quite old now .. something like searchme.com is far better).
Visit the site. Thanks