With Hadoop, Amazon Adds A Web-Scale Data Processing Engine To Its Cloud Computer
by Erick Schonfeld on April 2, 2009

hadoop-logo.png

Slowly but surely, Amazon keeps adding capabilities to its cloud computing services. What started out as pay-by-the-drink storage (S3) and computational processing (EC2), now includes a simple database (SimpleDB), a content delivery network (CloudFront), and computer-to-computer messaging (SQS). And today Amazon added a web-scale file system data processing engine with Amazon Elastic MapReduce. (It is a framework for accessing data stored in file systems and databases).

This is actually a big deal because it allows developers to better take advantage of the massive computing power Amazon has to offer and create applications which process huge reservoirs of data (conveniently stored in Amazon S3) in parallel. MapReduce is the name of the data processing framework Google created to index and search the Web. It literally breaks up huge computational tasks and spreads them to different servers. This is called mapping the data. Once each processor is done with its portion of the math problem, it sends the result back so that all the different partial answers can be combined and then “reduced” into one final answer.

Amazon is using Hadoop, which is the open-source version of MapReduce. Yahoo also started using Hadoop last year. While Google and Yahoo use this technique for searching the Web, it can be used for any data-intensive computational problem. Amazon lists the following examples: “web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.” Indeed, Hadoop is also the underlying technology used by IBM in its Blue Cloud initiative.

There is even a startup called Cloudera, which offers its own Hadoop computational services on top of Amazon’s EC2. They just got a huge competitor. But more startups can now create Web-scale applications at a fraction of the cost they could before.

Advertisement

Responses

Comments rss icon

  • Amazon gets more and more interesting every day. Most people think of it as an e-commerce play, but its recent transformation into an infrastructure play has been truly remarkable and a great example of a company thinking very much outside of the box.

    Anjali Sen

  • Usually your writeups are excellent. But this is one is quite inaccurate. Map Reduce is not a file system. I’d start by reading the wiki entry here:

    http://en.wikip.../wiki/MapReduce

  • What Notevenclose said.

  • It’s amazing that Amazon is doing all of this stuff. Weren’t they just an on-line retailer a couple of years ago? Now Google and Microsoft are following them into cloud computing.

  • Not a file system at all. Not a database. It’s a distributed parallel data processing engine.

  • The main point of MapReduce is to easily parallelize a repetitive task without having to worry about data management concerns.

    More simply put, it gets you out of the worries of managing individual files and computers, and instead lets you treat your cluster of computers as a single entity.

    This is a huge boon to developers, because it lets them focus on the actual processing without worrying about infrastructure.

    • The distributed filesystem is a different entity entirely, by the way. It just happens to play nicely with MapReduce.

      • By dfs in your statement do you mean HDFS specifically or distributed file systems in general? Map/Reduce basically moves the calculation to the physical location of the data rather than the other way round, as in most other parallel programming paradigms. So, even though it may run on local or non-distributed file systems, its parallel processing power can only be properly utilized if used over HDFS or similar such distributed file systems.

        • You don’t necessarily need to have a distributed filesystem for MapReduce to be effective. If you have processes that are very computationally expensive but don’t require a whole lot of data, then you can even feed your MapReduce job with a single fileserver / database server.

          For the vast majority of use cases, however, you are correct – MapReduce will attempt to minimize the amount of time wasted with data transfer by attempting to execute where the data is. This was Google’s main use case when they developed MapReduce – they needed a way to index all the data they had spidered in a reasonable amount of time. Indexing is a computationally cheap task, and is heavily I/O-bound. Thus, reducing the amount of I/O needed results in a huge performance gain.

        • @Tobias You expressed my original argument better than me. :D
          Coming to the case of using a single machine, the no of map jobs are dependent on the no of splits performed on the data provided. It defaults to no of input files if each file is below the threshold value defined. A single machine though, unless you implement your own multi-threaded MapRunnable class, would run one map job at a time. So parallelism is not performed actually. Yes, as you say, computational expenses are taken care of far better than in other system.
          I’m new to this so my understanding may be wrong but would like to know from you if you have found significant case studies of single machine m/r jobs.

    • ok, so would a “Web-Scale Data Processing Engine” be more accurate?

  • All right, I changed the headline and fixed the post. My file system metaphor was off mark. Thanks for the quick feedback everyone.

  • I have decided not to blog about this based on the fact that I simply have no idea what it is. It would have to read, “Hadoop: Amazon’s New Parallel Data Thingy”. Actually, that’s kind of entertaining. I might run with it.

  • What??? a file system? This is a major screw up on TC’s part. Writers should make sure that if they don’t understand the technology they are reporting on, then either don’t write or just read the wikipedia first. I guess sometimes is more important to be first on the news than to be correct.

    • Oh stop it what??
      The fact that people have jumped in to correct the mistakes already makes this a valuable thread for those trying to understand hadoop.
      When I want to deep dive into hadoop I’d go to the source not TC.
      What I gained from this post is the fact that Amazon has also started offering hadoop…

      • With you on that one K. How many times does it need to be said that Erick’s initial impression was wrong? The man has admitted it and changed things accordingly. As you said, if one wants to understand Hadoop or Map/Reduce, TC is not the place. In fact that piece of initial misconception does not take anything away from the original intent of the article.

  • Erick,

    I love reading your finance/business pieces where you really add critical insight and value.

    But no offense, your pieces on MapReduce (and technology in general) are only getting worse:

    http://www.tech...mbraces-hadoop/

    That’s not the only example of where your technology focused articles are wholly inaccurate.

  • Please correct all references to MapReduce being a file system. It is simply not. The Hadoop project has a file system included, but Amazon is using S3 as the file system instead. I would describe MapReduce as a new programming paradigm for processing huge quantities of data in parallel over many machines.

  • Yeah Techcrunch, I love you regularly but this is really sub-calibre reporting. mistakes still present in the article:

    + map-reduce is not a file system, it’s an algorithm.

    + hadoop is not “the open source version of MapReduce”. it is a framework that some people use for map-reduce. it can also be used for other things.

    + the explanation of map-reduce is pretty hokey. (split onto different servers? not necessarily…) suggest the author take a few mins to read the wikipedia article and provide a more accurate description.

    anyway, unfortunate credibility hit here.

  • Yahoo has been using, endorsing and co-developing Hadoop for quite a while – much longer than a year. It runs probably the biggest existing Hadoop clusters. Also, Google didn’t invent map/reduce.

    • The last part is a factual distinction or just plain semantics? Sanjay Ghemawat and Jeffrey Dean’s original paper on Map/Reduce came out of their work inside Google. Google was also the first company to actually implement this programming paradigm. So for all means and purposes Google invented it unless you want to make a strict distinction between Google and its employees.

      • The concept of map and reduce came from functional programming languages where data are immutable. Google didn’t invent map/reduce. They applied and popularized the concept to large scale distributed computing.

        • You are referring to the concepts of mapping and reducing in functional programming languages. On the other hand this blog post and I are referring to Map/Reduce the framework/paradigm and that definitely was invented at Google. Martin might have meant the same as you but his reference to map/reduce in that manner seemed to point to the framework and not the concepts of mapping and reducing. Hence my reply.
          Now, I suggest that you go through this paper: http://www.cs.v...educe/paper.pdf
          to verify that while Lisp and Haskell might have inspired m/r, the mappers and reducers in Google’s version(and by extension in Hadoop’s) do not correspond to the exact same roles in traditional functional programming languages. That in part leads m/r to extensive scaling. In my view Google’s application is ground breaking enough to be treated as a significantly new idea and not just a mere application of old ones.

  • Slowly? They be coming out with stuff a lot. Or maybe I just notice it. Dunno

  • The pricing is the most interesting aspect of the announcement. Apparently the hadoop nodes are shared between many users.

    This is significant as building large scale crawlers/classifiers just got a lot cheaper ($0.4 to $0.06 per hour on large instances) on this platform.

  • Do not be too hasty to jump on Eric for calling Hadoop a file system, after all it contains a distributed file system as well as a Map-Reduce engine.

    A much more serious error was caused when database luminaries David DeWitt and Michael Stonebreaker, people who should definitely know better, attacked Map-Reduce as a bad database system. See:
    http://www.data...-step-back.html

  • Prices sounds very atractive to me.

  • I laugh at the cyclic nature of technology. Apollo Computers did this same thing back in the 80’s.

    http://en.wikip...Apollo_Computer

    Nice to see good ideas stick around and get implemented in newer ‘clouds’.

    • That is really funny; I didn’t realize Apollo Computers was involved in of batch computing.

      I am somewhat familiar with their name because of their DSEE version control system which eventually became (or inspired?) IBM’s ClearCase. Not that ClearCase is universally loved, but it still is pretty common to see it in enterprises today.

      Following the genealogy of technology can be an interesting thing.

  • I’m surprised it costs *more* for this rather than less since it is trivial to use the hadoop tools to do this yourself and this way is far less flexible. I was hoping it was going to be offered at a discount to full EC2.

    • I think they do offer this as a discount to full EC2; for example, a CPU hour of Amazon Elastic MapReduce time on a small CPU instance is 0.015 cents per hour versus 0.10 cents per hour for a ‘regular’ small CPU EC2 instance.

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbugbug
Techcrunch on Facebook