Yahoo is following in Google’s footsteps again in search. Today, it is shifting a crucial part of its search engine to Hadoop, software that handles large-scale distributed computing tasks particularly well. Hadoop is an open-source implementation of Google’s MapReduce software and file system. It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them. (Update: A second try at explaining this. What MapReduce and Hadoop do is break up a computation problem into mangeable chunks and distribute them to different processors—that is the “map” part, it is mapping the data. Once all of the individual results are in, they are combined into one big result—that is the reduce part. Search engines, in turn, use this technique to literally map the Web. Sorry for any confusion my paraphrasing might have caused.)
Yahoo is replacing its own software with Hadoop and running it on a Linux server cluster with 10,000 core processors. The Hadoop software does the same job 34 percent faster than the old software. Yahoo is also providing some other interesting stats that gives us a view into the computing infrastructure behind its search engine:
Some Webmap size data:
* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes
Compare this to some data from Google on its MapReduce computing infrastructure (it is not quite apples to apples, but Google was processing 20 petabytes a day back in September, 2007 and outputting 14,000 terabytes in compressed data per month):
Hadoop is a project of the Apache Software Foundation. It also works for large-scale computing problems beyond search. For instance, IBM is using Hadoop as a foundation for its cloud computing initiative. Competing with Google using open-source software where it can is a smart move on Yahoo’s part, especially when that software outperforms its own.









To be fair, Yahoo! has been one of the largest proponents and developers on Hadoop, it’s not like they’re suddenly switching gears.
Maybe they’ll move to skynet after hadoop. Or, maybe skynet will do it FOR them!
http://www.info...apreduce-skynet
“It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them.”
No. MapReduce is a generic computation framework, and is certainly not specific to web crawling and indexing. While the incarnation popularized by Google is similar to the one implemented by Hadoop, the paradigm is much older than Google. Take a look:
http://en.wikip.../wiki/MapReduce
While the incarnation popularized by Google is similar to the one implemented by Hadoop, the paradigm is much older than Google. Take a look:
what do they mean by hadoop … I mean how did they arrive at this name ??
my guess .. had object oriented programming (for breakfast ..may be!!??)
sounds crazy!!
As was already said,
“It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them.” Is probably the single most incorrect description of MapReduce I’ve ever read. It reminds me of the student that gets cold called by the professor to ask what a definition. “MapReduce… ugh, that when you reduce something to a map”
For one, the “map” in “MapReduce” does not mean the same thing as the map in maps.google.com.
For second, http://en.wikip.../wiki/MapReduce
Journabloggers need to do more research before writing about something they obviously know nothing about.
I’m looking forward to an Uncov post (fingers crossed) mocking you for this level of blunder.
Yeah, Erick, better fix that MapReduce gaff quickly. A quick Google search on it would have saved a great deal of embarrassment here.
A good move, whether producing better search results or not — presumably it will.
i hope the postgres people get all up in hadoop’s grill about using an elephant in the logo.
Erik, this is a pretty lousy post. I thought Duncan was horrible at understanding technology, but you are obviously even worse.
No “apples” to “apples”! Those numbers aren’t even oranges to apples.
This post is just plain lousy.
Everyone seems to be a little hard on Eric.., granted, that is a pretty good quote to poke fun at.., but maybe you should be commenting over at Fred Wilsons blog.
By the way AK.., Hadoop is named after a stuffed elephant of Doug Cutting’s daughter.., he is the creator of Lucene,Nutch and Hadoop, and is currently employed by Microsoft.., uh.., I mean Yahoo!
Yahoo! has been dedicating engineers for Hadoop for a while now.., initial focus has been on scaling Hadoop so it would be able to operate on very very large clusters. They have been utilizing it during these initial phases to process logs/stats for various groups within the company. They were, and probably still are running ~10K machine cluster where diferent groups within the company shove a bunch o log data for processing.
A key point here isnt that Yahoo! is riding the cotails of Google in map reduce type processing.., but that Yahoo! is dedicated major resources to furthering this project that is becoming widely used by many startups for processing large amounts of data.., usually log data for stats (I know of Facebook and AdMob using it). By Yahoo! taking the reins and giving back.., they are providing the development community with a very cool.., and powerful tool.
I have posted a nice little presentation Yahoo! did at OSCON last year on hadoop.., link here:
http://kaiyzen.com/?p=77
This is great news.. as just this morning I was wondering how come Google is referring 10 times the amount of traffic to my site than Yahoo is. I hope that this new algorithm will help my targeted audience to be able to locate my site easier, when entering certain keywords pertaining to my site.
Interesting that it is more efficient – should increase overall search throughput – shouldn’t affect quality of results though. Not that that matters much. As far as I (a sample of one) can tell, Google’s constant tinkering with search algorithms has led to worse and worse results over the past two years. Yahoo’s results have been better for well over a year now. It may not be Google’s fault entirely. There is so much attempted manipulation of its results that playing defense against that may be hurting the overall quality.
Well, this is not like an impulse idea from Yahoo!. It was obvious this day would come already when they hired Doug Cutting over a year ago. Yahoo!, with Doug leading the way, have since been the single most proponents and developers of Hadoop.
Hey, I totally bought the MapReduce quote…looks like you can’t slip much past this crowd though.
Yahoo! does still have lots of bright engineers.. only a few noisy guys at the top make themselves look stupid.
It really runs the other way. Yahoo needed to update their infrastructure and decided to invest in an open source infrastructure rather than a proprietary one. Yahoo now has the advantage that undergraduates at schools like UCB and UW are being taught to program using Hadoop.
The Yahoo webmap is one Hadoop application, but it is far from the only one at Yahoo. One of the side benefits of having a good distributed computing infrastructure is that it becomes much easier to write ad-hoc programs that run on a large cluster of machines. There are more than 40,000 cores running Hadoop at Yahoo.
Hadoop was the name of Doug Cutting’s *son’s* stuffed elephant.
@3 and @6, I clearly state in the post that Hadoop “also works for large-scale computing problems beyond search”
Also the “It” in that sentence you both cite refers to Hadoop, not MapReduce, and is a paraphrase of Yahoo’s own description of what Hadoop does:
http://develope...ion-hadoop.html
“The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
“The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.”
I read that to mean that Hadoop creates a map of the Web and reduces (i.e. compresses) it into a manageable set of data that can be placed into a database so that the ranking algorithms can do their work.
If this is wrong, someone please explain (and, no, that does not mean linking to Wikipedia). Explain it in English.
In other news… Yahoo’s YUI 2.5 is released.
http://yuiblog....i-250-released/
TechCrunch… It takes all the technologies on the Web 2.0 found by Erick and “crunches” them down into inaccurate blog posts.
@19 –
1) http://feedblog...large-clusters/
(and click on the pdf file)
2) http://wiki.apa...HadoopMapReduce
the map step breaks the problem down into many small chunks and sends them to individual boxes for computation.
each box then does computation on it’s small chunk of data
the reduce step take the result from all the small chunks and combines them back into one big solution.
that’s the rough explanation of what’s going on.
Erick, the confusion comes from your use of the words “map” and “reduce” in a different way than what they mean in MapReduce. In MapReduce, “map” refers to mapping a function over a set of data. This is an operation that can be quite easily parallelized, which is why Google is able to bring such a large array of processors to bear. The “reduce” part takes all the results of the mapping step and recombines them into the end result (or set of results).
Your paraphrasing would be fine if you just didn’t use the words “map” and “reduce”.
@Metabass: And apparently it takes even less to comment on here.
@Erick: Still an interesting post. Thanks.
I don’t get the fuss.
Yahoo can do one thing to gain on Google, get rid of that slow page design on the search pages. Get rid of Javascript, gazillion cute images. Show only 7 results and 3 more links on the bottom. That will make it load faster than Google. And they should not index subdomains and biz and info sites. Most of these are spam. But of course, I’m probably not the first to suggest these improvements. I’m sure their engineers do suggest similar changes but these suggestions die in huge bureaucracy Yahoo has become.
For those interested, there is a well written explanation (in classic Joel style) of how “Map” and “Reduce” work here:
http://www.joel...2006/08/01.html
Hi Folks, thanks for the write-up. A couple of comments:
Comparing a single job to all jobs run at google is not too informative. Google clearly has a larger plant than we do and has undoubtedly run much larger jobs than ours, but we have been producing a comparably sized web search index for many years and the point of our announcement was that Hadoop can now support jobs of the scale needed to build Yahoo or Google scale services. Doing some math on the Google numbers you quoted…
Their average job produces 14,000TB/2,217,000 jobs -> 0.0063TB / job compared to our 300TB job. So we can conclude that we have run a job 47,000 times larger than their average job! Again, the point is that we can do full web work on Hadoop, not to compare our plant to theirs.
Also our team is very proud of our investment in the Hadoop platform. It is our contributions to the Hadoop project that have made it possible to do full scale web search work on it. We’ve been working towards this milestone for several years.
http://develope...hoo-hadoop.html
http://radar.or...s-bet-on-h.html
Thank you Qian Wang for setting me straight. I’ve updated the post with a clarification.
Haha, today’s XKCD comic is just so perfect for this thread:
http://xkcd.com/386/
awesome.
Interesting article, looks like you missed another similar story at the begining of FEB when Hypertable.org launched…..Seeing that they have had nearly 300 downloads and are talking with some very hip Valley companies…. why would you not of seen or covered this…(http://onotech.blogspot.com/)
Fred could not possibly be right could he….
You guys are the lamest of all Google fanboys. Yahoo! follows google’s footsteps? Yahoo! has been one of the largest proponents and developers on Hadoop.
Google has nothing to do with Hadoop. They do closed source software. They should be marching in Yahoo!’s footsteps … but, wait. Why would google do anything in open source? They are a black box *non* evil empire.
Happy 1984 to all the fanboys.
Thanks for this post and the clarification. Yahoo itself played a little bit with the words map and reduce so ordinary people understand and remind it. Good to see that Yahoo made some decissions in the past that now add value to the web. Combine this with some tough Microsoft deal makers and distribution opportunities and the other Company will get some serious competition the first time. Yahoo, you made my day ;
Reading through the shear size and amount of data is mindboggling.