Amazon Web Services suffered a major outage this morning, affecting the thousands of Websites that rely on its storage (S3) and cloud computing (EC2) services. Startups including Twitter, SmugMug, 37Signals, and AdaptiveBlue, for instance, use Amazon’s S3 storage service to store all the data for their Websites. Reports started coming in across the Web, email, and Twitter about the outage (Twitter only uses S3 for file hosting, not its main messaging application). The major difficulties seem to have been fixed, but some issues persist. The outage started at around 4:30 AM PT.
This could just be growing pains for Amazon Web Services, as more startups and other companies come to rely on it for their Web-scale computing infrastructure. But even if the outage only lasted a couple hours, it is unacceptable. Nobody is going to trust their business to cloud computing unless it is more reliable than the data-center computing that is the current norm. So many Websites now rely on Amazon’s S3 storage service and, increasingly, on its EC2 compute cloud as well, that an outage takes down a lot of sites, or at least takes down some of their functionality. Cloud computing needs to be 99.999 percent reliable if Amazon and others want it to become more widely adopted.
Update: A response from Amazon PR:
For one of our services, the Amazon Simple Storage Service, one of our three geographic locations was unreachable for approximately two hours and was back to operating at over 99% of normal performance before 7 a.m. pst. We’ve been operating this service for two years and we’re proud of our uptime track record. Any amount of downtime is unacceptable and we won’t be satisfied until it’s perfect. We’ve been communicating with our customers all morning via our support forums and will be providing additional information as soon as we have it.





maybe a “truck” drove into a “power thingy” and the “chillers” didn’t “cycle” properly…
…nah, that’d be stupid!
Oh Damn! My start-up company was going to rely totally on Amazon S3. But now we will have to think again.
Thanks for the news Erick !
For sure. We’d switch in a heartbeat for 99.999 uptime at their current prices. They don’t claim that level of uptime for good reason. Today at least you get what you pay for. Their 99.9 SLA isn’t good enough for critical web apps. Skype vs POTS …
@ghaus. nothing is 100%. this is like their first major outage correct?
I was told recently to consider Amazon Clouds for my startup. I’ll reconsider it now.
Cue Nelson Muntz catch phrase.
“Amazon has a goal of 99.9% uptime with this service. This translates to 45 minutes per month of expected downtime. Achieving 99.9% uptime is a significant challenge—but worth striving towards!”
Twitter says they only saw 2 to 3 minutes of downtime.
Your lame ass poor startups can handle that I think…
don’t switch to Nirvanix, it is extremely buggy. cloud computing is not ready yet.
unfortunate.
99.9% uptime -> 0.1% downtime allowance…
if you have money, don’t use it. if not, what other options do you have? you get what you pay for.
Not sure why the commenters are ready to bail on Amazon’s services. Do you have better options at the prices Amazon charges?
BTW…last I read Smugmug doesn’t rely on S3 as their primary storage service.
They aren’t ready to bail. They are just wankers who want to say they have a startup so they can show their friends their name on the exclusive Techcrunch comment board. They don’t have businesses to bail.
There is no better options to date… until Google readies their GDrive, or until Microsoft completes buying Yahoo! (whichever is sooner)
I’m the CEO & Chief Geek at SmugMug.
We do rely on S3 for our primary storage, but we do maintain our own “hot cache” of data in our datacenters, too, which is less than 10% of our total storage.
Our customers weren’t affected by this morning’s outage.
Don’s got the right approach here with SmugMug’s use of S3. It’s great for near-line storage, archiving, and other activities which require relatively low-cost, flexible capacity, ‘good’ performance, and high integrity.
Never build your architecture to require low-latency, high-availability access to S3 or its competitors, because you won’t get those - that’s not what it’s for, that’s not what it’s optimized for, and you’re never going to be able to peel back those layers of abstraction and long-haul network.
We’re still a long way away from the ‘magic infrastructure cloud’ but by keeping the strengths and weaknesses of these hosted services in mind, you can still get a tremendous amount of value from them.
This is just horrible, what a nightmare.
@ 4 & 7
Yes this is Amazon’s first major outage.
For start-ups which gets a lot of traffic, for them 5 minutes downtime means thousands of dollars of loss.
I agree. It is ashame because Cloud computing as a service is a really good idea and I believe that probably Amazon is the best example of a good Cloud. But the lack of SLAs and problems like this makes it hard to trust just outside of backups and archives.
Nothing is perfect. You techies need to get a grip on things and be real.
@ 18
“For start-ups which gets a lot of traffic, for them 5 minutes downtime means thousands of dollars of loss.”
You’re so full of it. examples of your exaggerations please…
[ For start-ups which gets a lot of traffic, for them 5 minutes downtime means thousands of dollars of loss. ]
Thousands of Flooz dollars maybe
@15 That’s a great strategy Don, I think all startups should have some kind of contingency plan for any cloud service, eventually things go wrong.
Thanks Techcrunch. I have sent you that tip for a story about 10 minutes after the service went down and nobody on the internet knew about it and haven’t had a single thank you from you. Next time I won’t bother.
sounds like #13 is having the “client work” doldrums….
sure they have startups!! so could you! YOU CAN DO IT!!
quit yer pouting…
@#18
If 5 minutes downtown can cause a startup to lose thousands of dollars, they’re making as much money or more, than Fb, MySpace, and who knows…maybe as much as Exxon/Mobil
You’re lame.
Hey #20-
The techies usually know that nothing is 100% reliable. It’s either groupthink or partially-informed folks that lose touch with the fact that systems fail (sometime for lame reasons).
With a 99.9% uptime promise, everyone in the business should be prepared with 40-ish minutes of unplanned downtime a month.
CG
PS- #15: flooz! lol
Don’t be quite that hard on them. We don’t know exactly what they were doing (if anything) and this is the first major outage they have experienced. I know there is a cost to downtime, but if you have thousands of dollars on the line for being down for a hour or so, you shouldn’t be depending on 3rd parties to house your data. I know they may be cheap, but doesn’t the old saying hold true? “You get what you pay for.”
Use S3 as backup not a content delivery network!!!
For smugmug, how did you serve the 90% of photos that wasn’t in your hot cache?
It seems that “cloud-based” storage services are receiving lots of attention these days and for very good reasons. Since web businesses depend on them, any amount of downtime is unacceptable.
So the fact that Amazon S3 went down is definitely cause for concern. Today’s event clearly demonstrates that Amazon has a “single point of failure” and that their users need to take precautions.
The Amazon situation should not reflect negatively on all storage services. There are companies (Nirvanix being one) that have put together an architecture of clustered nodes around the world with no single point of failure avoiding the potential for downtime for their customers.
-nick
We weren’t down very long at Shutterpond Photo Contests, and all of our photos are hosted with s3.. It was a small hit.. Nothing to fuss about.
If you rely on a single point of anything for mission critical data you’ve got bigger problems.
By and large cloud computing is more reliable. But, it’s hard to see that because reporting on cloud computing is going to be much better because so many more people use it.
The other important issue is that the rate of adoption for things these days goes far beyond what business planners are used to dealing with. Put simply, we’ve eliminated most of the friction from the markets using the web. That’s only going to make it that much easier to underestimate demand and get into trouble.
More on my blog:
We’ve gotten so good at reducing adoption friction, that we’ll see a lot of this kind of thing. It just isn’t possible to plan for it.
More on my blog:
http://smoothspan.wordpress.co.....as-a-cost/
Best,
BW
Would explain the problems with Woot! earlier.
@ #8 Nirvanix developer boards are always open, i haven’t encountered any bugs, and obviously there was a Bigger Bug today
One good thing about more and more sites using AWS is that when one goes down, they all go down. If lots and lots of popular sites have an outage at the same time, then from a user’s perspective it shifts the blame from “their site is down”, to “the internet is down”.
SLAs are a joke anyway. They generally only pay you back for the prorated time of outage.
If my customers can’t get to their data, that costs them a lot of time-money, and makes me look bad. Getting my $5 back for the storage is inconsequential.
I’m CTO of ElephantDrive, one of the first consumer facing applications to use S3. We have been utilizing the platform for nearly 2 years, and have been measuring failures, performance, etc. with high granularity.
Our system is built to automatically detect outages, and in those events failover to our internally managed storage.
ElephantDrive users were not affected by this outage, at all. Further, we’re transferring data to S3 nearly continuously, and did not see a break in the hours mentioned.
Does anyone know if the problem was localized somewhere?
Counting on a cloud provider to meet YOUR SLA is bad business - only works if failure is OK for you. More here: http://www.appistry.com/blogs/.....its-clouds
Loosing thousands of dollars in a 5 min period would mean you’re loosing at least $1K x 288 (periods of 5 mins in a day) = $288K minimum per day. Over the course of a year that is $105,120,000 Gross Revenue. That’s a base if you only lost $1,000 not “thousands”.
I can’t speak for the general concesus but I don’t consider your company a start up if you’re Grossing $100mil a year. Maybe a growth stage company but that would seem beyond “bootstrapping”….
the downtime.. paying for the scalability is of course an analysis to occur.. but for most startups.. the benefits of these services, greatly pay for them selves, than the minimal downtime. Now if this starts to happen more and more thats another question
It’s good to have the thing have an outage… everybody and everything is still “green” until he/she/it suffers and adversity… that is a true test. I wouldn’t want to hire anyone or use any service that has not had a run in with some sort of fire-test… and survived and grown stronger. I tip my hat to Amazon and will be watching to see how they grow from the experience. Cheers.
I wrote a similar post on GigaOm yesterday regarding cloud computing. In short here it is:
1) SMB and Startups have NO choice. Well, that is unless they want to raise another one or two million dollars in venture capital for infrastructure and tech ops people.
2) SLA’s don’t add up to dog crap. OK, your down, now what? Sue Amazon, C’mon. …just for ways to build in redundancy.
3) Do you really believe you can run a data center better than Amazon? Think AGAIN! Don’t be silly!
4) Just make sure you have a good PR Firm on retainer to draft a beautiful and sympathizing (”I feel your pain”) press release.
Yes, it’s easy to point the finger when shit like this happens? But, when Microsoft Exchange Server acts up what is the realistic alternative? Today, none. Tomorrow enterprise email from Google or Yahoo.
@42
Bingo! You’re spot on!
I’ve been through that dog and pony show a few times. It’s very tough.
It builds character… and makes you ……”sip on gin and juice” (as Dr. Dre or Snoop Doggy Dog once said)
SnapSeeker.com was totally devoid of photos. It was a bummer for us this morning. And there’s nothing that could be done. (sigh)
At the same time, where are you gonna find a service like S3 with the ability to scale and serve content with the reputation of Amazon? I’ll stick with them unless they get flakier in the future.
http://www.snapseeker.com
@29 I wouldn’t rule it out completely.
The trick I use is the following:
- Store static files on my server at some sub-domain; e.g. static.mydomain.com
- Store a copy of these files in S3 in bucket static.mydomain.com
- Create a CNAME record in my DNS so static.mydomain.com points to static.mydomain.com.s3.amazonaws.com
Now, if AWS dies, I can quickly edit my DNS so static.mydomain.com points to my server. It may take some time for the DNS to propogate, but usually it will happen very quickly. This may not be the best solution, but it allows one to recover if AWS happens to die. Any one have better solutions?
@40 - This is assuming that we’re just looking at a constant stream of orders, for example an ecommerce site, that doesn’t have any noticeable loss once the service is fully restored.
For many companies, the impact is less the actual amount of downtime and more the fact that the downtime occurred at all - for example, a high-end hosting company that’s supposed to have fully redundant infrastructure will suffer loss of some current and possibly many future customers when an outage takes their customers offline.
For startups that are trying to convince their customers to trust them, something like this *could* have quite an impact. Of course, for others it won’t have a lasting effect after service is restored - really depends on the specifics of their business model.
Today’s event clearly demonstrates that Amazon has a “single point of failure” and that their users need to take precautions.
We use S3 for all user-generated audio content uploaded to kompoz.com. But we also keep a local copy of active (”hot cache”, as @15 outlined) tracks. Truly, S3 has been awesome. It’s fast, easy to implement, and inexpensive. I have no plans to jump ship just because of this issue. Of course, having a strategy like @15’s “hot cache” soluition softens the sting of downtime — something all companies should incorporate into their plan.
I love how people get all upset about a short period of down time….. I doubt that their own systems serve the same uptime as S3…. how lame