Slowly but surely, Amazon keeps adding capabilities to its cloud computing services. What started out as pay-by-the-drink storage (S3) and computational processing (EC2), now includes a simple database (SimpleDB), a content delivery network (CloudFront), and computer-to-computer messaging (SQS). And today Amazon added a web-scale file system data processing engine with Amazon Elastic MapReduce. (It is a framework for accessing data stored in file systems and databases).
This is actually a big deal because it allows developers to better take advantage of the massive computing power Amazon has to offer and create applications which process huge reservoirs of data (conveniently stored in Amazon S3) in parallel. MapReduce is the name of the data processing framework Google created to index and search the Web. It literally breaks up huge computational tasks and spreads them to different servers. This is called mapping the data. Once each processor is done with its portion of the math problem, it sends the result back so that all the different partial answers can be combined and then “reduced” into one final answer.
Amazon is using Hadoop, which is the open-source version of MapReduce. Yahoo also started using Hadoop last year. While Google and Yahoo use this technique for searching the Web, it can be used for any data-intensive computational problem. Amazon lists the following examples: “web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.” Indeed, Hadoop is also the underlying technology used by IBM in its Blue Cloud initiative.
There is even a startup called Cloudera, which offers its own Hadoop computational services on top of Amazon’s EC2. They just got a huge competitor. But more startups can now create Web-scale applications at a fraction of the cost they could before.











Amazon gets more and more interesting every day. Most people think of it as an e-commerce play, but its recent transformation into an infrastructure play has been truly remarkable and a great example of a company thinking very much outside of the box.
Anjali Sen
Usually your writeups are excellent. But this is one is quite inaccurate. Map Reduce is not a file system. I’d start by reading the wiki entry here:
http://en.wikip.../wiki/MapReduce
okay, what would you call it then? And don’t say “framework”
mapreduce really is a ‘framework’
they took some old parallel processing ideas and put a big ‘approved by google’ stamp on it, calling it mapreduce
This is not the first time that TechCrunch has failed to understand what MapReduce actually is before telling the world how awesome it is.
The core of Hadoop is a File System, or am I reading this wrong: http://en.wikip...org/wiki/Hadoop
Yeah, you’re reading it wrong. Hadoop is a processing engine that can sit on top of any one of several file systems. One of these file systems is HDFS, which is not the same as Hadoop core.
gotcha, added a clarification to the post. Headline should be read as a metaphor.
What Notevenclose said.
It’s amazing that Amazon is doing all of this stuff. Weren’t they just an on-line retailer a couple of years ago? Now Google and Microsoft are following them into cloud computing.
Not a file system at all. Not a database. It’s a distributed parallel data processing engine.
yup, point taken. But it provides access to file systems and databases. In effect, it creates a massive file system from all the individual files systems that it accesses.
Erick,
Stop now please.
Can someone that understands technology be assigned to the articles about Cloud Computing?
oo maybe we can get sarah lacy on it!
How about you start your own tech blog – and write your own articles. Let’s see if your blog gets any visitors, period.
Eric, no.
Erick, consider a career in killing yourself. Good God. Just stop.
Joe Bauers, good god – this is the internet. Cut the reporter some slack, and if you don’t like TechCrunch, then get the f$%k out of there.
The main point of MapReduce is to easily parallelize a repetitive task without having to worry about data management concerns.
More simply put, it gets you out of the worries of managing individual files and computers, and instead lets you treat your cluster of computers as a single entity.
This is a huge boon to developers, because it lets them focus on the actual processing without worrying about infrastructure.
The distributed filesystem is a different entity entirely, by the way. It just happens to play nicely with MapReduce.
By dfs in your statement do you mean HDFS specifically or distributed file systems in general? Map/Reduce basically moves the calculation to the physical location of the data rather than the other way round, as in most other parallel programming paradigms. So, even though it may run on local or non-distributed file systems, its parallel processing power can only be properly utilized if used over HDFS or similar such distributed file systems.
You don’t necessarily need to have a distributed filesystem for MapReduce to be effective. If you have processes that are very computationally expensive but don’t require a whole lot of data, then you can even feed your MapReduce job with a single fileserver / database server.
For the vast majority of use cases, however, you are correct – MapReduce will attempt to minimize the amount of time wasted with data transfer by attempting to execute where the data is. This was Google’s main use case when they developed MapReduce – they needed a way to index all the data they had spidered in a reasonable amount of time. Indexing is a computationally cheap task, and is heavily I/O-bound. Thus, reducing the amount of I/O needed results in a huge performance gain.
@Tobias You expressed my original argument better than me.
Coming to the case of using a single machine, the no of map jobs are dependent on the no of splits performed on the data provided. It defaults to no of input files if each file is below the threshold value defined. A single machine though, unless you implement your own multi-threaded MapRunnable class, would run one map job at a time. So parallelism is not performed actually. Yes, as you say, computational expenses are taken care of far better than in other system.
I’m new to this so my understanding may be wrong but would like to know from you if you have found significant case studies of single machine m/r jobs.
ok, so would a “Web-Scale Data Processing Engine” be more accurate?
Yeah, for sure.
Yeah, I think that’s a fair use of terminology. It has a number of inherent benefits to developers, but the main advantage is that it lets you scale out easily. Scaling is a notoriously difficult problem, so any sort of infrastructure that helps that is a huge help.
All right, I changed the headline and fixed the post. My file system metaphor was off mark. Thanks for the quick feedback everyone.
Where exactly is this ‘Cloud Computer’ you speak of?
I have decided not to blog about this based on the fact that I simply have no idea what it is. It would have to read, “Hadoop: Amazon’s New Parallel Data Thingy”. Actually, that’s kind of entertaining. I might run with it.
I dont get it either… Is this Hadoop comparable to the Google Appengine concept?
What??? a file system? This is a major screw up on TC’s part. Writers should make sure that if they don’t understand the technology they are reporting on, then either don’t write or just read the wikipedia first. I guess sometimes is more important to be first on the news than to be correct.
Oh stop it what??
The fact that people have jumped in to correct the mistakes already makes this a valuable thread for those trying to understand hadoop.
When I want to deep dive into hadoop I’d go to the source not TC.
What I gained from this post is the fact that Amazon has also started offering hadoop…
With you on that one K. How many times does it need to be said that Erick’s initial impression was wrong? The man has admitted it and changed things accordingly. As you said, if one wants to understand Hadoop or Map/Reduce, TC is not the place. In fact that piece of initial misconception does not take anything away from the original intent of the article.
Erick,
I love reading your finance/business pieces where you really add critical insight and value.
But no offense, your pieces on MapReduce (and technology in general) are only getting worse:
http://www.tech...mbraces-hadoop/
That’s not the only example of where your technology focused articles are wholly inaccurate.
“wholly inaccurate”? Comparing MapReduce to a file system was only partially inaccurate, but ok. I fixed it. What else is wrong?
You try writing up this stuff in 15 minutes.
map-reduce is not even partially a file system. that’s like saying a BBQ is a screwdriver.
if you can’t take the heat, spend more than 15 minutes researching.
sigh
“You try writing up this stuff in 15 minutes.”
That’s the point. You can’t write this stuff up in 15 minutes and expect to be remotely accurate, especially given that you don’t fully understand the subject.
Wish I had more than 15 minutes. You get what you pay for, Raj. Considering, I think, I was more than remotely accurate.
Please correct all references to MapReduce being a file system. It is simply not. The Hadoop project has a file system included, but Amazon is using S3 as the file system instead. I would describe MapReduce as a new programming paradigm for processing huge quantities of data in parallel over many machines.
Yeah Techcrunch, I love you regularly but this is really sub-calibre reporting. mistakes still present in the article:
+ map-reduce is not a file system, it’s an algorithm.
+ hadoop is not “the open source version of MapReduce”. it is a framework that some people use for map-reduce. it can also be used for other things.
+ the explanation of map-reduce is pretty hokey. (split onto different servers? not necessarily…) suggest the author take a few mins to read the wikipedia article and provide a more accurate description.
anyway, unfortunate credibility hit here.
Yahoo has been using, endorsing and co-developing Hadoop for quite a while – much longer than a year. It runs probably the biggest existing Hadoop clusters. Also, Google didn’t invent map/reduce.
The last part is a factual distinction or just plain semantics? Sanjay Ghemawat and Jeffrey Dean’s original paper on Map/Reduce came out of their work inside Google. Google was also the first company to actually implement this programming paradigm. So for all means and purposes Google invented it unless you want to make a strict distinction between Google and its employees.
The concept of map and reduce came from functional programming languages where data are immutable. Google didn’t invent map/reduce. They applied and popularized the concept to large scale distributed computing.
You are referring to the concepts of mapping and reducing in functional programming languages. On the other hand this blog post and I are referring to Map/Reduce the framework/paradigm and that definitely was invented at Google. Martin might have meant the same as you but his reference to map/reduce in that manner seemed to point to the framework and not the concepts of mapping and reducing. Hence my reply.
Now, I suggest that you go through this paper: http://www.cs.v...educe/paper.pdf
to verify that while Lisp and Haskell might have inspired m/r, the mappers and reducers in Google’s version(and by extension in Hadoop’s) do not correspond to the exact same roles in traditional functional programming languages. That in part leads m/r to extensive scaling. In my view Google’s application is ground breaking enough to be treated as a significantly new idea and not just a mere application of old ones.
Pretty cool
just great!
Slowly? They be coming out with stuff a lot. Or maybe I just notice it. Dunno
The pricing is the most interesting aspect of the announcement. Apparently the hadoop nodes are shared between many users.
This is significant as building large scale crawlers/classifiers just got a lot cheaper ($0.4 to $0.06 per hour on large instances) on this platform.
http://macmaniapodcast.com
Do not be too hasty to jump on Eric for calling Hadoop a file system, after all it contains a distributed file system as well as a Map-Reduce engine.
A much more serious error was caused when database luminaries David DeWitt and Michael Stonebreaker, people who should definitely know better, attacked Map-Reduce as a bad database system. See:
http://www.data...-step-back.html
Prices sounds very atractive to me.
I laugh at the cyclic nature of technology. Apollo Computers did this same thing back in the 80’s.
http://en.wikip...Apollo_Computer
Nice to see good ideas stick around and get implemented in newer ‘clouds’.
That is really funny; I didn’t realize Apollo Computers was involved in of batch computing.
I am somewhat familiar with their name because of their DSEE version control system which eventually became (or inspired?) IBM’s ClearCase. Not that ClearCase is universally loved, but it still is pretty common to see it in enterprises today.
Following the genealogy of technology can be an interesting thing.
I’m surprised it costs *more* for this rather than less since it is trivial to use the hadoop tools to do this yourself and this way is far less flexible. I was hoping it was going to be offered at a discount to full EC2.
I think they do offer this as a discount to full EC2; for example, a CPU hour of Amazon Elastic MapReduce time on a small CPU instance is 0.015 cents per hour versus 0.10 cents per hour for a ‘regular’ small CPU EC2 instance.