When cloud-computing services like Amazon’s S3 go down, as it did again this weekend, it raises the question of whether they are ready for prime time. If these services keep going down, can startups rely on them enough to build their businesses on top of them? At Fortune’s Brainstorm conference in Half Moon Bay today, in response to a question from the audience, Amazon CEO Jeff Bezos said that the company takes reliability and uptime very seriously, and knows that it will be the basis of competition in this budding industry. He said:
I think we are at the dawn of what will be an important industry. Important industries are rarely built by one company. Companies that can demonstrate high-availability track records will have bigger parts of this market. Market mechanisms will push companies to reliability.
When we have an outage like yesterday, we see that as a crucial driver. We won’t be satisfied until we have uptimes and availabilities that are statistically indistinguishable from perfection.
When we have a problem, we know the proximate cause, we analyze from there and find the root cause, we will find the root fix and move forward.
Perfection is not necessary. Amazon’s S3 customers would probably be happy with just more redundancy so that they don’t have to suffer eight-hour outages. Amazon will no doubt get a pass as it learns how to scale these services, but how long will that goodwill last? There are other alternatives out there, with more coming on every day, even if they are not as well known.
(On a different subject, when talking about the Kindle, he had a nice quote about the importance of devices becoming invisible:
We had a microwave oven that would beep every minute until I turned it off. I called it a self-important device.
This is a favorite theme of his, but it is a good design principle. He didn’t mention anything about the next-generation Kindle though).





one of the most painful 8 hours of my life (only semi-sarcastic). It was pretty stressful, and even though I am still going to use s3 — I am looking into a backup system asap so I can just ‘flip a switch’ when this happens, so half my site doesn’t go down.
Wow. 8 hours of down time? Has anyone tried to monetize that down time for the companies affected?
Google App Engine is WORSER.. it just refuses to x% of your queries.. if they think it takes up more compute time… or that ur data model is relational (yes.. it hates relational models)
Amazon atleast is upfront abt it.. they tell you that its DOWN..means DOWN…. App Engine keeps u in LIMBO..
At least they didn’t LOSE HALF THE FILES!!!
http://www.techcrunch.com/2008.....its-doors/
If Amazon’s design goals are for 99.99% up time then it failed..
Personally I love App Engine. I can’t say enough good things about it. To the person who mentions a % of your queries being blocked, then are you sure it’s not a quota problem? And if it is then talk with the Google guys about raising your quota. I did a few million requests yesterday and had 1 request error. I call that pretty damn close to perfection. And sure the approach to designing an App Engine data model is different - but I’ll take massive, fluid scalability any day.
365 * 24 = 8760
8/8760 = 9.132*10^-4.
This is still 99.99 uptime Dave.
Server hardware and bandwidth is so cheap now many startups simply don’t need S3 or CDN for content delivery or scalable computation.
TL - http://offur.com/BetterThanTechCrunch
I’m still planning to use S3, Amazon’s one of the few places I trust to actually improve after a problem, and historically their downtime stats have been phenomenal. Still, tough for users at the time.
S3 is not so bad, amazon have really good services so im confident
brian
http://www.themostpowerfulcompany.com
Mike for a guy who is outsourcing to India, doesn’t know any code what would you suggest for hosting a facebook app. Right now I plan on using joyent even though the price is fairly high and I am not quite sure what makes them better than godaddy for hosting a facebook app. I will admit to knowing nothing about hosting and would love some advice on what to look for from you or your readers.
Hey Eric,
I interpreted “indistinguishable from perfection” differently than you. I don’t think he meant “nothing less than perfection” (ie perfection, which is unatainable) , I think he meant perfection *for all practical purposes*. Subtle difference I know, but when I read your headline, I had to read the article, because I didn’t think Bezos would say that, and now I don’t believe he actually did.
Of course, the thrust of his statement (and your article) is that reliablity has to be off the charts if people are people are going to base thir businesses on you, and on that we all gree
Shawn
Hellooo, Amazon, have you ever heard of systems monitoring? GroundSource, anyone? Sheesh, they need some serious systems monitoring love over there.
Knowing about a problem and fixing a problem are two different things.
Oops, make that GroundWork.
8 hours seems like a long time to fix any outage in a datacenter, providing no one has driven a truck through your building of course…. but even then, reverse the truck up. I wonder what their internal procedures are like for dealing with stuff like this, or are there just a load of tech guys running around paniking?
Of course, there is another time dimension that runs about four or five times faster than the normal space time continuem called the “oh crap, my server’s down” dimension
They’re doing a pretty good job in my opinion though so far.
Where is competition as compared to Amazon S3. When they launched they were the first. It will evolve and become better.
I remember a couple of years ago Amazon had a 199 Xbox (199?) and it brought down their servers, and for days you could feel it - they probably have come a long way but nothing is guaranteed.
Cloud computing has its downside, as users are at the provider’s mercy. In that link, they comment on whether Cloud Computing is a benefit or a threat.
“Perfection is not necessary ….”. Unfortunately perfection is necessary. Reliability theory tells us that as the mean time between failures approaches infinity, so does the mean time to repair. If it does not fail, you cannot fix it. That is the paradox all Cloud Computing live with.
Thus failures must be allowed to occur but do not let your users know. To reach that level of redundancy and network control takes time and commitment. An eight hours user outage means S3 has a long ways to go in both.
Cloud storage has its place for certain types of content and organizations at a certain level of maturity, but if you need more control over your content and the storage infrastructure there are alternatives. A little company out of TX called Caringo provides a clustered storage software that delivers the type of storage infrastructure that underlies these services an an affordable price. They also claim you can start with a small cluster and grow it seamlessly at your pace (1TB or 10TBs at a time) using standard, commodity x86 server hardware (pick a vendor). Its interesting and I’m considering their CAStor product as a secondary/backup site that I manage for my content in case S3 goes down again. I also expect I’ll need to move to my own storage as my user community and capacity requirements grow.
It looks like cool stuff and they have a free online developer program too. You can integrate and test applications with an online cluster. If you’re interested you can take a look at it here http://www.caringo.com/partners_01.html.