Quick, Plug The Internet Back In: Major Rackspace Outage
Michael Arrington
97 comments »
Apparently a traffic accident is to blame for a major datacenter outage this evening. Rackspace’s Dallas datacenter lost electricity shortly after a traffic accident caused damage to a power transformer. Rackspace’s generators kicked in but, as we’ve seen before, lots of other things can then go wrong. In this case, two chillers within the data center failed to start back up, and a number of servers were taken offline to avoid damage from overheating.
Laughing Squid’s Scott Beale, who is affected by the outage, has been posting status updates on his blog. Since Rackspace doesn’t have any blog or status page for outages, Scott is our only direct source of information right now.
We’re tracking who’s offline. 37Signals is down. Who else?
Update: They’re back up. Total outage time was around 3 hours. Rackspace’s official response is here.





Yep my business got knocked offline. Based in NZ and rung up Rackspace and they told me “a truck has hit our datacentre”. DIdn’t turn out to be quite the case. My site has just come back online this moment though.
all our production servers at Rackspace are offline as are some email services we run through there. effects our blog as well.
Tom Fragala
Truston
yep, we’re back up now. We came back on-line about 20 minutes ago.
Total outage was probably about 2 hours.
It’s certainly not a commodity tonight, eh?
Downtime sucks, and what it shows is what we all forget every minute - it’s power that runs the Internet.
Our startup uses Rackspace pretty extensively for one reason: absolutely no downtime. At this moment we’re discussing plans to move from them (very expensive) to a local co-lo (much cheaper) where we can have more control over operations. I understand that this may not be their fault but no downtime is why we pay, literally, 3 times more than anyone else.
Mike it’s good you’re reporting this. It will certainly be interesting to see how Rackspace responds and deals with the situation.
2230 hours EST - 37Signals apparently still off-line; at least my Backpack services are not responding.
BTW I was worried about their data center infrastructure about 4 years ago when I started using Backpack. Love it. But … the reply they gave me then is safely tucked away, just for reference. You guessed it - on a BP page!
I have 3 clients that went down in this outage. Some longer than others, but very frustrating experience none the less. Yes, Rackspace does promise 100% uptime. The question is how will they respond? We are pretty close to recommending moving 25 servers away from them. They are not cheap, but if there’s no reason for paying a premium, why pay it!?!
Total downtime was about 2 hours. That’s why I pay double the going rate, for their service and their downtime history. I guess you can’t help a wayward truck. What will be interesting to see is how fanatical they are when they contact those affected by this.
We run a price comparison and we were down, but are back up. Matt, you’re right on the pricing– it’s still worth it to us for the great support (they did probably a 4 hour custom PHP/cURL compile that perhaps doubled our price lookup capabilities for no charge) but for folks that don’t use the support regularly, it might be hard to justify that premium. We treat it as an expert employee we don’t need to keep on full-time.
It’s still frustrating, when you read all the battery-backed-up and indefinite-diesel-generation capabilities you expect it will take a serious, prolonged outage before anyone is impacted. If Rackspace can’t do it properly, I don’t know who I would trust.
Man, I’d hate to work in support there right now.
My team is trying to finish a web app. project and lost access to a lot of files on BaseCamp. I thought that 37 Signals use Amazon S3?
Yup, we’re pretty upset about this extended downtime. It’s closing it on 2.5 hours now.
Rackspace has been generally wonderful about support and uptime, however. Their facilities are top notch and stable. But shit does happen and this time it hit a bunch of fans.
We’ll certainly be reevaluating our disaster preparedness in response to this situation. We apologize to everyone who was affected by this.
Our startup lost services both Sunday morning and Monday morning, but came back after some time. We recieved our first Rackspace email “Notification of Power Issues in DFW” on Sunday so wasn’t caused by that traffic accident. Despite this, I’ve been very impressed by Rackspace’s services, but looking to move to a “pay for what you use”/Rails focused service provider.
Downtime most definitely sucks. Whenever, I’ve seen a SLA that has 100% uptime I move on to the next vendor. 100% is IMPOSSIBLE to guarantee!!!
It’s simply a marketing ploy. Good luck!
The irony is that the Rackspace banner was flashing at the top of the TechCrunch page. I guess they don’t host there mission critical marketing servers on their own data centers……. I’m kidding as this is a third party marketing platform.
a car smashed into their transformer. this comes after two days of outages.
what is SHOCKING is that their generators cannot run the chillers AND servers. what is there was real power failure?
we are beyond pissed off at MOG.
the worst part is poor communication. my team thought we were hacked saturday night.
Arnold: We use S3 to host uploaded files, not our actual software, services or databases.
@ Jason
You guys should offer some backup service for your products especially for BaseCamp. Companies like mine are totally screwed without access to the project data!
Welcome to Rackspace. My name is Noelia and I am a Live assistant. How may I help you today?
Noelia: Welcome to Rackspace, how may I help you?
you: when will our servers be back online?
Noelia: An incident occurred involving a power transformer outside of our datacenter. The transformer feeding power to the DFW datacenter was damaged causing loss of power and cooling to the entire datacenter. As a result of the temperature increase, your server had to be shut down in order to prevent additional issues. We are currently working to rectify the situation and will provide updates via My.Rack as they become available.
Weird, I thought 37s was hosted in Chicago.
Arnold: You can export your data from Basecamp or Backpack in XML format whenever you’d like (click the Account tab, then export). Highrise allows you to export your contacts as often as you want (”Export” option in the sidebar) . Of course the services need to be available, but we do offer export from these products.
The high priority right now is disaster recovery. We’ll be looking into mirroring our facilities at another data center so we can “flip a switch” if one data center is rendered inoperable. Or at the very least we can maintain a reduced capacity but baseline level of service during a major outtage The “perfect storm” that hit Rackspace is incredibly rare, but we should be better prepared nonetheless.
We’ll be investigating options once we’re back up and settled in.
Given the ethos that 37Signals tend to extrude on their blog, they will probably just give out 6 months or maybe even 1 year credits to everyone affected. Gotta earn those customers back.
On second thought, I have yet found anything that matches Basecamp so maybe no one will dare complain…
Dont these guys have the slogan “bullet proof” hosting? Obviously not truck proof. I’ve never used them..insanely expensive, and now I definitely won’t
Threadless is also effected. And i can just copy and paste what Jason said:
“The high priority right now is disaster recovery. We’ll be looking into mirroring our facilities at another data center so we can “flip a switch” if one data center is rendered inoperable. Or at the very least we can maintain a reduced capacity but baseline level of service during a major outtage The “perfect storm” that hit Rackspace is incredibly rare, but we should be better prepared nonetheless.”
PR trick …. lol
Probably not, but either way it’s PR
I clicked on my first ever ad on TC today, TheItRoom.com, around 8:15 pm PST and their site was down. Our site is on rackspace and we are back up. That advertiser is wasting my precious attention!
I have three client servers at Rackspce, Dallas. None of them went down at all. Even though none of my servers went doen I received numerous emails from Rackspace over the last two days keeping me informed about the status - and then they called me by phone today to make sure I was in the loop.
I guess I am lucky I stayed up - but Rackspace also told me they have three generators, but needed only one to power the site. I think they have things under control.
Rob
I cannot see this a PR gimmick. Not good for business.
Just had to point out that Rackspace advertises itself as “The Zero-downtime Network”:
http://www.rackspace.com/whyrackspace/network/
Of course, who could have forseen such a chain of events started by a truck (Ted Stevens perhaps?)
gome see hus
This should have never happened.
They keep blaming it on a truck hitting the transformer outside. What about the UPS system? Do they not have one in place? If they did, the cooling system and servers would have never shutdown and needed to be started back up.
Also, don’t they ever test these systems and have redundency for everything?
JamBase was down for about 2 hours tonight. Hope no one missed any shows.
I honestly love Rackspace for their fanatical support, and know they’ll probably make good on their promises. They were responsive on the phone and did a decent job keeping me updated via a support ticket I filled out. I imagine we’ll be seeing a big update from them about redundancy and fixes.
Mercury must still be in retrograde or something…
Michael, you forgot to add your disclosure you’re also in this market.
I use rackspace and had no down time, though things ran pretty damn slow for a while (and it had nothing to do with bandwidth usage).
http://www.zimbra.com also went down.
Mercury came out of retrograde Nov 1, so that can’t be it.
Our servers are hosted at Rackspace and while we received a support ticket notifying us that there was a critical issue, our website remained online.
37signals and our products are back. We’ve posted an update and summary of the downtime.
FreshBooks was down at the same time, so I’m guessing they were involved as well. I had a mini-stroke thinking that all my financials had just evaporated to the big DeadPool in the sky. Whew!
Man I was just about to talk to them about moving my customer hosting over to them. I wonder what the sales guy will say. I still feel that everyone goes down at some point. Rackspace still seems like a great place to host. I have been hosting site for 12 years, small co loc are ok, but not worth the headache.
Is it just me or TC’s page also loads painfully slow, especially comments and stuff on the right hand side…not sure why linkedin has been on and off a LOT lately.
Wow… this was such a useless post.. thanks for nothing
@37 Voices.com CEO
Dude, great url (voices.com). Must be nice knowing you have a 7 figure domain name….
Just got a personal call from the Rackspace CEO. He explained everything that happened and apologized for the mishap. Sounded like a terrible situation (truck accident, EMS, power shutdown cause the guy was still in his truck, etc). They did the best they could given a really unusual (im)perfect-storm situation.
Looks like Ron Paul got knocked out for a little while too. It’s a conspiracy!!!
I’m not a Rackspace apologist, but it’s pretty amazing this downtime is from a freak accident when other hosts I’ve been with have had downtime “because.” If it takes a truck accident to take out my clients’ sites for a few hours, then I’m sticking with Rackspace.
I’m still waiting for more details before I pass judgement. We don’t even know how many servers in the facility went offline. My servers were fine, and they are hosted at Rackspace’s DFW facility, so I know that the *whole* datacenter was *not* impacted. Seems to have only been a partial outage.
Surprising news, but not world ending. Look at every other major provider out there, compare their track records for stability and outages, and you’ll see that they still are worth the money. The first major outage in seven years? That’s still pretty phenomenal. Or rather, as they call it, Fanatical.
they don’t have a mirrored site? Ouch. With their very high prices I just assumed they did….
I want to say “sorry” to our customers who were affected by our downtime. We promised you no downtime and we failed you. We will make it right. We are determined to restore your faith in us.
To Amy Wilsch:
A mirrored site? Perhaps a magical mirror that harnesses the power of the universe to keep several thousand customer servers synced in real-time to exact replicas somewhere across the globe?
Get a clue, and stop spreading uneducated FUD.
Their response is going to be the true measure of the strength of their product. From the post above, it looks like they are taking this quite seriously.
OMG! my favorite porn site is hosted at rackspace. This is a crisis.
Get over it. Rackspace Rocks.
at #50.
A true testament to why paying much more than I “need to” for Rackspace is MORE THAN WORTH IT.
We have had BootsnAll.com hosted at Rackspace for almost 8 years. This, as most have stated is the worst outage we have experienced in that time. Yes, they are more expensive than everyone else, but we keep our most valuable properties hosted at Rackspace cuz they actually are Fanatical. It’s not just the tag line.
I won’t crucify them and leave because of this. If it became a habit, yes I would. But we have had dedicated servers at 5 other competitors to Rackspace and no one has been close to the quality and uptime of Rackspace up to this point.
HireVue.com - still down - 4 hours+
Note to self, don’t park your data center (and generators) right next to the freeway.
Other than tonight, we have been so happy with Rackspace they really are great and I will vouch for them any time any place.
The problem isn’t the outage per se — their record is good — but poor communication about the issue at the beginning. They need to copy salesforce and trust.salesforce.com (heck, we did at trust.echosign.com) and be 100% transparent the moment there is a problem. Things happen, but if you don’t know what’s going on quickly enough, confidence is undermined.
If they informed their customers, then I don’t understand the “poor communication” statement. Perhaps they don’t have a blog for the whole world to browse, some would consider such a practice a security liability.
OMG my interwebs were off for 3 hours! My whole business model is so lean that it cannot withstand an outage of more than 12 minutes! I AM RUINED!!!!1
Darn you rackspace and your non-truck-proof transformers! Why wasn’t I informed
I’ve been following it all night. The company I work for has fits over 10 minute hiccups in performance. This had people nearly apoplectic.
However, they did make it a point to stay in touch with our IT department, and the people we talked to were all pretty straight up with the situation. Can’t really ask for better.
Our office experienced a similar thing when a truck actually crashed directly into a large power box and took out a chunk of our town’s entire power grid. It took at least 3 hours to get power back up and running. As mentioned before, these things DO happen.
Well, our team was engaging key investors this Sunday/Monday and was also in the middle of our biggest outreach program since launch. And Rackspace let us down on Sunday morning for four hours (no server, email, nothing…emails bounced back! We basically didn’t exist!). They said “power failure.” And we were not even notified when it happened!
And then, after 24 hours of me explaining the situation to countless people…and assuring them that it was a rare one-off circumstance that would never happen again…IT HAPPENS AGAIN. Our server is still down.
In all seriousness, this could destroy a business. Rackspace’s whole “zero downtime” guarantee has actually been almost 10 hours of downtime in the past 48 hours (not to mention GREAT costs to the credibility and revenues of many businesses out there including my team).
What corners have they cut with back-up systems, generators, etc!? Truly destructive .
Reminds me of this news clip:
http://www.youtube.com/watch?v=z4vDClhnJjs
I have been hosting at Rackspace for well over 3 years and have never experienced this type of problem ever. Freak accidents do happen, so I can’t get too upset at them. Rackspace has helped me countless number of times with any problems I have in my environment. They do truly seem fanatical and willing to help.
“100% is IMPOSSIBLE to guarantee!!!
It’s simply a marketing ploy. ”
what? no it’s not, it’s a _guarantee_, which inherently means if they can’t maintain that standard, they will compensate you in some way. It doesn’t mean they will never lose power, etc.
Its not only the truck incident that took power off, we were down for two hours from 4:30 am CST to 6:30am CST on Nov 11th. No information was posted on their website and not even a single email or phone call. The information was posted around 6:49 am. They said they will inform when they switch back to utility power but again we went down for 20 minutes around 5:30 pm CST. Here is the details of the event as posted on Rackspace:
Sunday, Nov. 11, 2007
* 4:19 a.m. CST — A problem in the internal utility power distribution grid caused an outage to cabinets in one section of the DFW data center.
* 6:49 a.m. CST — Power was fully transferred to generator power. Based on building monitoring systems, outage times varied for every customer. DC engineering worked to isolate the internal utility problem and restore the integrity of the internal distribution system.
* 6:32 p.m. CST — A separate incident occurred when a breaker in the generator power grid tripped, causing one of the Power Distribution Units (PDUs) in the same section of the DFW Datacenter to fail, affecting a much smaller group of the customers in this section. All customer devices with dual power supplies in this section of the datacenter remained online and were not affected. Customer devices with single power supplies in this area were affected. Data Center technicians immediately acted to minimize the impact on these customers by moving these devices manually to alternate power supplies - resulting in just a few minutes of downtime.
* 7:40 p.m. CST — The breaker problem was diagnosed and resolved, bringing the down PDU back online.
Monday, Nov. 12, 2007
* 4:00 a.m. CST — The Data Center engineering team had the initial utility distribution grid realigned and resynchronized. All systems reported ready for operation.
* 4:30 a.m. CST — Transfer of power was initiated and affected devices were slowly moved off of generator power and back to internal utility distribution power.
* 5:10 a.m. CST — Transfer of power was completed.
* 5:25 a.m. CST — Unfortunately, the internal distribution grid failed again. Data Center engineering was able to transfer all affected devices back to generator power in under 15 minutes.
* 5:40 a.m. CST — All affected devices were back on generator power. The Data Center environment is stable and is designed to be able to run indefinitely on generator power. Data Center engineering is continuing to diagnose the problem and engaging all vendors onsite.
Luckily we didn’t go down when the truck accident happened.
This is why Rackspace rocks….
http://www.rackspace.com/infor.....center.php
I colocated with Rackspace for about a year, with a previous project. My experience confirmed what I have long known to be true, companies that spend heavily on advertising do not provide excellent service, as their growth engine is marketing, rather than returns from investment in resources.
Interesting…… Rackspace does not do colocation, they do managed hosting.
Maybe you should figure out what you’re talking about before you post?
how’s this for irony - our servers and the servers of many others aren’t up, we can’t get to our website or check our email but sure enough Rackspace can and their site is running swimingly! I wonder who their hosting provider is?
After now 7 hours and counting this is getting ridiculous even if we love them.
@69 Mark Newman
RS has 6 datacenters besides DFW. The recent issues did not affect the whole DFW datacenter or any of the others. Perhaps the website is hosted at one of the other DCs or in a portion of DFW that was not affected.
Unfuddle was down last night too. It’s a great night when your version control and tracking system disappears.
As a Rackspace customer and prior employee, I would encourage people to pay attention to all the details before they comment, make assumptions, or provide their opinions. I’ve been very happy hosting at Rackspace over the years, and I know how much time, effort, and care they put into trying to run the best hosting company possible.
To address a couple things on here:
#65 and 66, thank you for the detailed post. Many people seem to think that a single incident with a truck caused all this, and that’s not the case. It seems to be a series of unrelated and unfortunate events.
The truck accident occured after an internal power failure, which seems to of affected only 1/3rd of the DFW datacenter. This part of the datacenter was already on generator power at the time of the truck crash, and when the crash happened ALL systems transfered to generator normally. There was no direct downtime at the time of the accident. The problems started when the chillers failed to start on generator and had to be brought up manually. The rapid increase in temperature (until the chillers came online) was the reason for Rackspace to start pro-actively shutting down servers. Hoping to save customers systems and data by not frying them.
#64, 100% uptime is an SLA guarantee, a goal, something that is strived for, and if not reached they will compensate you for it. I feel Rackspace has held very true to this over the years. Does it mean they should just say 99.5% and not try to meet such high expectations, or compensate their customers? Not likely Nothing is perfect, though it doesn’t mean that I don’t want my hosting provider to make every attempt to achieve perfection.
#69, Rackspace’s site does appear to be hosted in DFW, and like other customers that posted here it remained up. Rackspace’s update on their site says that only a part of the DFW datacenter was affected.
From the Rackspace Website, http://www.rackspace.com/infor.....center.php
“We cannot promise that hardware won’t break, that software won’t fail or that we will always be perfect. What we can promise is that if something goes wrong we will rise to the occasion, take action, resolve the issue and accept responsibility. If you are a Rackspace customer and don’t think we’ve lived up to this promise at anytime during the outage, please let your Account Manager know.”
My thoughts - The promise (or guarantee) was that there would be 100% uptime, not that they will do their best blah blah blah. We all do our best. The Account Managers should call us and compensate us financially for the downtime because we are paying a very high premium for the promise of a 100% uptime. Asking us to call you because you broke your promise (100% uptime) is not acceptable. Call us Rackspace and compensate us for the downtime!
Shit happens but at least they seem to have been very professional about it. They are too rich for my blood but nice to know there are some alternatives out there that care as much about hosting as my current provider.
Jon
hmm…I thought TechCrunch was hosted by MediaTemple, no?
it is not as simple. When they went to aux power all the cooler systems did not kick back on and a bunch of us had servers shut down after getting to hot. We are still not up.
can somebody inform us what the compensation will be for not meeting the guarantee? will it be like a month free, or a couple bucks, or a free hat, or what? i’m curious to know.
please keep us posted
Well look at the bright side. No one cut a hole in the wall and walked off with a bunch of servers like at C I Host.
Like RS, as a business you have to plan for multiple risks. If you are heading for an investor pitch, you better make sure you have a localized version to step in. If that localized version doesn’t work, you should have other points of discussion printed out. RS won’t cause a business failure, but poor planning will.
Just another small reminder from the tech deities that computing is not a constant, and as was posted earlier, even a utility can fail.
For # 72
Thank you for taking up for Rackspace.
Please also note the only reason why the chillers failed is b/c the Utility company actually shut down all power in order to safely remove the accident victim.
I was also down for 3 hours last night.
They shut down our servers due to the heat at the datacenter. From what we understand:
In the second incident at approximately 6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center’s chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim. This repeated cycling of the chillers resulted in increasing temperatures within the data center. As a precautionary measure we decided to take some customers’ servers offline. These servers are now back up, as are the chillers.
So it seems as the redudant systems worked. With power and all, but the chillers failed when they had to cylce them multiple times because of the accident victim.
Although all of our servers and our imaged suffered, I can’t say enough good things about rackspace and what they’ve done for us. I mean, with all my experiences with datacenters (esp The Planet) they handled everything as best as I can ask for. They’ve gone above and beyond with any support request me and my team have had and they are simply… Fanatitcal as much as I can expect them to be.
Down for almost 2 hours last night. Longest ever.
I think RS *could* have done much better at communication, but they were pretty good at answering the phone and whatnot.
Although I don’t really like interacting with the sales side of RS, their techs are TOP notch. I sleep very well knowing my servers are at Rackspace. Won’t consider moving them one bit.
I think the problem is that they have been perfect for 8 years….that kind of record breeds complacency! Why weren’t they testing their generators every week? Why not testing various scenarios for failure?
It is certainly a wakeup call for all of us who might have believed the 100% ‘guarantee’ — but, then again, is there anyone who’s better? If not, then we have to do our own mirroring etc systems, and the whole Web2.0 thing just became a lot more expensive to set up properly.
Someone asked an interesting question on another site, and I have yet to see it answered, here or anywhere else. Is the guy driving the truck ok? Bottom line…that’s more important than data any day.
We have sites and servers in both Rackspace Managed and Rackspace Intensive. We moved most of our operations from other hosting facilities to Rackspace specifically because of downtime related to power failures. The problem with other facilities was not the downtime, it was the handling of the event, the communication (or lack of) and the response.
100% SLA is awesome, but we all know that is an ideal and things do break regardless of the amount of testing and planning, even if infrequently. What really counts is the response that we received from Rackspace. Proactively turning off our servers and notifying us is way better than the overheated failing servers we’ve had in the past.
The bottom line is that the response was Fanatical, and I am confident in the RS team to find room for improvement.
We had 9 sites go down for 5 hours total last night. Two were news sites that people expect to be avaiable 24/7. Rackspace was very apologetic and offered a one-month credit for all of our devices. I’m not impressed with their ineffective backup system (and neither are they — they’re going to be looking into improving it as well). But I am happy that they proactively reached out to us and offered a remedy.
#83, I don’t think it breeds complacency at all in this case. If you think they don’t test their generators (under load) weekly, I’d bet you’d be mistaken. I think if you look at the ENTIRE chain of events that happened you’d see that a number of things were truly resilient, and that while this was a significant problem, it only took down about 1/3 of their DFW datacenter. I think I’ve heard of 3-4 different significant and untimely events occur in series and parallel over the last 3 days, which I’m sure would test anyones ability to withstand.
Hi Thomas - I think that if you guarantee 100%, then you should have systems that can withstand 2-3 or 4 or 5 events happening at once.
Absolutely, and it would seem that they have those systems in place. However, nothing is or ever will be perfect. Systems will fail, the backup systems for those systems will fail, accidents will happen.. A good thing I see here is that they own up to the situation, no excuses, and I would guess that they will likely re-challenge their already “proven” stable infrastructure. As for the 100% guarantee, like any warranty sometime you have to use it. I’d rather them set the bar high, and them be “Fanatical” under pressure, than call it 99.9% and tell me to get back with them in 9 hours when it matters.
In all, with 100% uptime for my systems for 4 years now, I have a hard time complaining about a freak occurance like this. And beyond that, they’ve had like 7 years total of uptime.
My first thoughts were of the truck driver also. If anyone knows if about the health of the accident victim, please post to that effect.
We are still down. They seem to have run into a few more issues than just a power outage. I agree that they have been great to respond and work with so far so if our start up company can withstand the cash flow issues associated with this down-time then I’ll continue to stick with them.
myewellness.com
Thank you for taking up for Rackspace