Google Processing 20,000 Terabytes A Day, And Growing
Erick Schonfeld
30 comments »
A recent white paper by some Google engineers puts some numbers around the massive amount of computation that Google does every day to index the Web, process search results, and serve up ads, among other things. As oflast September, Google was processing 20,000 terabytes of data (20 petabytes) a day. This large-scale computing capability is a big part of Google’s competitive advantage over Yahoo, Microsoft, and everyone else.
Niall Kennedy reports the breakdown of how Google’s large-scale computing has grown, and estimates that hardware cost for each large-scale computing job (known as MapReduce) is about $1 million. The number of such jobs grew nearly an order of magnitude (10X) between 2004 and 2006, and then another order of magnitude a year and half later. See the chart below:






Impressive, though not unsurmountable…
Fascinating article, thank you.
In case anyone else misinterpreted this statement from Erick:
“hardware cost for each large-scale computing job (known as MapReduce) is about $1 million.”
That seems to imply a cost for every time the job runs.
The referenced article says:
“The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.”
So it is a one-time cost for the hardware that a single job runs
on.
Amazing technology!
is the white paper link pointing to an urchin tracker some sort of bad joke?
One copy is here, per mention from a commenter on niallkennedy’s blog -
http://burtonator.files.wordpr.....7-dean.pdf
(if it’s still there)
20 petabytes is just for MapReduce problems, I would imagine their search processes 10x to 100x that amount daily.
Your whitepaper is broken…
Here’s the real link.
http://feedblog.org/2008/01/06.....-clusters/
Technicle, please don’t deep link
… I can’t view stats on that page
Kevin
By ‘real link’ I mean the real link to the whitepaper:
http://feedblog.org/2008/01/06.....-clusters/
It’s pretty good. Mostly a refresher.
Niall pried out some good info though.
Onward!
I know these numbers are huge, but can someone put these numbers into context? What’s a comparison the average person could understand?
Love them or hate them, Google is the most impressive computer/tech company. Though I would be really interested to see how they secure their system. I wonder what the private thoughts are from Yahoo and Microsoft when they see this. Does anyone know how this would compare to say a government agency system like monitoring weather etc?
The huge costs can’t be an advantage over yahoo or ms since MS could afford to have that much power should they want .
The reason Goog is still better than yahoo/ms thus only marginaly is because good still delivers more accurate .
Somebody needs to come up a new search engine. Goog is the best technology can offer yet we have all the ebay spam and shopping spam coming up with every search.
———
http://www.xenbet.com
Kevin - sorry.. I just “Copy link location” while mouseover’ing the link… would’ve pointed it to your blog article but there was no ref of it in http://www.niallkennedy.com/bl.....mment94130 where i learned of the link…
my apology!
> What’s a comparison the average person could understand?
a) 20PB or 20,000TB == 40,000 disk drives worth of data, assuming 500GB/drive (latest max-capacity per drive is 1TB/drive, ie., 1000GB/drive)
or another perspective -
b) 1,500 disk drives working perfectly in parallel over 24 hours to turn up that much of data, assuming 1.5Gbits/sec SATA drives all working [IMPOSSIBLY] perfectly to their theoretical design limits, ie., some 150MBytes/sec (practically, probably 30%-perfect efficiency, or even less, thus, at least talking about 5,000 drives working in parallel for 3600×24 seconds — just the data i/o part.)
Very roughly, of course.
Thanks Technicle.
The same techniques are in used by many other organizations on large volumes of data.
My team at Yahoo! has invested in making the open source Apache Hadoop map-reduce framework run at very large scale (http://lucene.apache.org/hadoop/).
This is used by Yahoo! and other organizations with similar data analysis needs. No one is doing as much map-reduce computation as Google, to the best of my knowledge, but others are to running individual jobs at the same scale on similar data sets.
Eric, can you please explain when’s the best scenario to use map-reduce? thanks
To put things into perspective, it would be very interesting to have comparable measures from Yahoo and other Google competitors.
> Eric, can you please explain when’s the best scenario to use map-reduce?
A couple of links:
http://en.wikipedia.org/wiki/MapReduce#Uses
http://open.blogs.nytimes.com/tag/hadoop/
is 20 petabytes/day compressed or uncompressed size?
I think the increase of daily use of Google products has some effect on it. Sure everyone uses Google to well Google, but think about when people use Adsense or Gmail. Don’t they then use Google has their primary search engine? I know I do, I switched from Yahoo! to Google after my increase of use in Google products.
@2 Shawn
“The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.”
Is this the equivalent of saying a car costs $20,000 to own, but excludes the cost of gas, maintenance, and a chauffeur?
Hadoop is also being used by Barney Pell at Powerset…
Anyone have an idea what the Google datacenter (or perhaps a dataplex) costs to operate on an annual basis?
From these numbers you can deduce that they have over 11 000 machines used for these job (exact number depending on percentage of comoputing power used and machine downtime).
Actually 11,081 machine years / (2217 jobs x 395 sec = .0278 years) implies 399,000 machines. Since this doubles about every 6 months I guess they are up to about 600K machines by now.