Free: Pack Of MySpace Branded Playing Cards »
Amazon Web Services Gets Another Hiccup
by Erick Schonfeld on April 7, 2008

amaxon-web-services-logo.pngAmazon’s Web Services experienced another hiccup today. Early this morning, its Elastic Compute Cloud (EC2) went down for about an hour for at least some customers in the U.S. This follows a major outage of its S3 storage service in February. Companies big and small use EC2 as a virtual data center to run jobs on Amazon’s computers. Customers began reporting problems on the EC2 developer forum at 1:51 AM PT. The problem seemed to be resolved about an hour and a half later.

Amazon does not guarantee 100 percent uptime for its Web Services, although it does strive to achieve that. And data centers go down all the time, no matter who is hosting your data. But more and more companies are relying on Amazon to be able to scale their computing resources on demand and do it cheaply by paying only for what they need. Many Web startups are building their entire businesses on top of Amazon’s Web Services, and even an hour of unavailability is unacceptable. At least this one happened during the middle of the night.

The outage is a reminder that, as Amazon CTO Werner Vogels said last week after a speech he gave about uncertainty, “Everything fails all the time.” And it comes on the eve of what could be Google’s entry into the on-demand computing infrastructure business with the expected announcement of its BigTable cloud database service tonight. As big tech companies such as Amazon, Google, IBM, and others start to compete around web services, reliability will be one of the main features they will compete around. (The other one will be price).

ec2-help.png

Advertisement

Comments rss icon

  • People do live outside of California, in different time zones, and even on the other side of the planet, where it’s daytime. For some people this did not happen in the middle of the night. It appears the Valley Bubble and egocentrism of its residents has slipped out of even you! :-)

  • I trust and like amazon, good to see they responded quickly.

    -Check out my website for ways to make money online
    http://mikesmon...b.blogspot.com/

  • I was fairly impressed at the rate to which they responded to the problem. They posted in the aws forums every 15 minutes, which certainly helped my piece of mind.

  • The EC2 model is very promising – and it’s not just used to “run jobs”, but rather can be used as a web server(s) on an ongoing basis.

    We are planning to use it, but will retain one “real-life” server just in case. This hybrid EC2/traditional hosting model should deliver a lot of advantages over putting all eggs in the one basket. We’ll see!

  • Yes, it bit us also early this morning – Lookery.

    Rex
    Publisher Relations, Lookery

  • Doofus, you are right. There is never a good time for your Website to go down because business is global. But from what I could gather it appeared that the outage was affecting customers in the U.S. only.

    Did any EC2 customers overseas who are TC readers have downtime?

  • I want to know if these guys whos instances could not be reached got charged? There things happen and can’t be helped but lets not forget this is a fairly young service.

  • “Did any EC2 customers overseas who are TC readers have downtime?”

    Yes, we are from Portugal and some of our instances were unreachable. AFAIK, every instance running on us-east-1a datacenter was down.

  • With Internet based business going global, many website owners would tolerate a downtime of even few minutes. However, we have to be realistic that we can never take things such as electricity, network connections for granted. Every component in the system has a definite life and can lead to breakdown anytime. All we can hope of a good backup plan by a company like Amazon.

  • Vogels made a promise at TNW that they do their utmost to keep AWS running because it’s the same technology that Amazon itself runs on.

    Be that as it may, is the Amazon.com main site ever affected when an AWS goes down? I get the impression that it isn’t.

  • @Steve: I’d have to disagree with you.

    EC2 should NOT be used for general purpose web hosting. It was designed to be a compute service that runs distributed time-INsensitive compute jobs with no guarantees of uptime. The fact that Amazon doesn’t offer an SLA on EC2 is proof of this.

    I suspect that most people using EC2 for general purpose hosting could get a better service/value elsewhere, especially if they don’t really need the compute power of EC2 or the storage of S3.

    Building a complex hybrid system that automatically detects EC2 outages and fails over to a mirrored backup environment is possible, but it adds a lot of cost and complexity — probably overkill for most people needing web hosting.

    My advice to anyone looking for hosting for a business critical site is this: get a service with an SLA commensurate to the financial impact to you if the service went down. If your site is down for 12 hours, what does that cost you? Does your provider’s SLA reflect that cost? In the case of EC2, I’d say no.

  • @Alper

    Considering that this EC2 downtime was only 1 out of 3 datacenters is easy to not have any downtime at all on a specific website, if you have it running on multiple instances across those 3 datacenters.

  • The Techcrunch site was taking a dump last week. How bout a story on how MT has outages?

  • @6

    our EC2 server went down, we are based in australia

  • ” At least this one happened during the middle of the night.”

    Good job no-one in Europe uses EC2 then.

  • We have some sites on Slicehost and some on EC2. It frustrates me that we never get a clear explanation from Amazon. We’re considering moving everything to Slicehost, but it’s always good to keep your stuff spread out (to avoid problems like this).

  • We have a big number of EC2 and S3 operations and majority of them were affected – of course we did what we have to do: implement our contingency measures and see what we can do during the downtime. It took an hour and a half and 5 minutes after everything went back, we were back in business.

    PS. We’re based in the Philippines

  • The other question here is: will we still use EC2 and the rest of the AWS offering? Yes we do. The platform itself is compelling for us to build an offering that a traditional setting like a colo or a DC can’t offer (elasticity) in a very quick and timely manner, so we’re sticking to it. Now if Google or others offer something similar, then we will harness those services as well to complement existing AWS offerings.

  • @CBass: EC2 was designed as completely general compute utility. It is great for hosting high volume reliable web sites. We have many dozens of customers running web sites on EC2 and the reliability despite the few publicized outages has been much better than anything I’ve seen in 8+ years of managing datacenters worldwide. The feature set is now also starting to exceed what’s available in most colo facilities. Regarding SLA, I’ve never seen an SLA having a damages clause that comes even within orders of magnitude of the lost business (unless your business is really small). For me the best SLA is track record and demonstrated commitment and ability to fix issues promptly and to eradicate root causes. So far Amazon has done a very good job at that.

    @Jake: Up to now Amazon has always posted a follow up to issues explaining what happened. It’s just too early for them to get through all the internal root cause analysis to post it. In the past they have posted post-mortems within a day.

    What we can tell so far is that what happened is a network issue disrupting connectivity to the outside world. No servers were terminated due to this and internal connectivity seems to have been fine. Also, more than one failure zone was affected, which is not good. I’m looking forward to hearing the explanation of what happened from Amazon and how they’re going to prevent recurrence…

  • “It is great for hosting high volume reliable web sites. ”
    “For me the best SLA is track record and demonstrated commitment and ability to fix issues promptly and to eradicate root causes. So far Amazon has done a very good job at that.”

    Couldn’t agree more.

  • I was trying to purchase a gift for a friend . . . to no avail. So, when will it be up and running again?

  • SoftOtNot.com gets most of its visitors in the middle of the night. They would have screwed me big time … sort of ;)

  • It worries me that more than one zone went down at the same time.

    This really worries me as I thought “zones” was the answer to all our “one down and we’re all down” worries.

  • Thanks for the great article! A quick clarification in Amazon’s defense, though.

    I wanted to point out that although this post is correct, EC2 customers would not have experienced complete failure if they were properly utilizing the tools which Amazon makes available to ensure a fail-safe hosting setup.

    As of a couple weeks ago, Amazon began allowing their EC2 customers to select the zone in which their servers are located. In Amazon’s vernacular, the word “zone” is interchangeable with “physical location”.

    There is a great article on how to achieve fault tolerant hosting using AWS at the link below…
    http://blog.rig...lability-zones/

    If you setup your cloud computing the right way it’s very possible to achieve nearly 100% reliable hosting. In my opinion, that is why Amazon doesn’t provide an SLA. They count it the responsibility of the customer to be well-versed enough in networking to know how to protect themselves.

    In my opinion, AWS is not a plug-and-play solution for a company which is casual about it’s hosting. I think AWS makes that pretty clear by calling their product Elastic Computing Cloud and not something a little less academic.

  • Brendan. I agree.

    The benefit of seperate “zones” is that you run your main app/database in “zone 1″ and have a live backup app/database in “zone 2″ (your backup zone)

    If “zone 1″ goes down, you simple swap to “zone 2″.

    However this morning, more than one zone went down which pulled down customers main app/database and their backup ones..

  • Whatever, it’s still better than what companies who use to run all this stuff internally are used to. It used to be they had to suffer the downtime (it is inevitable…even if God was your IT guy), lose the revenue, AND pay for the tech team to put it back up.

  • FYI, the Amazon folks just posted a pretty detailed analysis of what happens. See http://develope...=0&start=75
    It did affect multiple availability zones independently. How humiliating. I’ll repeat what I’ve said elsewhere: We still hear comments about the lack of an SLA. All I can say is that for me the best SLA is a track record and demonstrated commitment and ability to fix issues promptly and to eradicate root causes. So Amazon has done a very good job at that.
    Now we just want this event not to repeat…

  • wow what an uproar ove inablity to shop!

  • I’m still amazed that people tie their business so closely to EC2. Especially when it’s so expensive compared to the alternatives, and you can achieve the elasticity by being set up to use EC2 if/when you ever get that massive, sudden surge that’s big enough to make EC2 cost effective.

    Last time I looked at prices, you’d need to use your instances on average 6 hours or less a day for EC2 to be able to compete with equivalent capacity at a managed hosting facility. Even less if you have enough servers to justify managing your own colo. If you use EC2 for your basic capacity, you’re paying 2-4 times more than you need just in case of a spike or sudden growth that you could still handle with EC2 if you hosted your main setup somewhere cheaper. Doesn’t make any sense to me.

    Dependence on EC2 is a big red flag – it shows people haven’t looked closely at the cost.

  • Will these reliability issues give more traction to AppEngine?

    Matt / Kusiri

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
bugbugbug