As many of you know, a lot of the sites that use Rackspace as their hosting provider were down for about an hour yesterday. That’s because Rackspace went down. Apparently, it was a power outage at a data center that caused it, an incident report that we’ve obtained explains.
While Rackspace has backup systems in place, a series of events apparently caused those backups to fail, resulting in the servers going down. Here’s the key nugget:
The breaker on the primary utility feeder tripped, initiating a sequence of events that ultimately caused a power interruption in Phase I and Phase II of the data center. All systems initially came up on generator power without customer impact. The ‘A’ bank of generators, which support UPS clusters A and B in Phase I and UPS cluster E in Phase II, then experienced excitation failure which escalated to the point where the generators were no longer able to maintain the electrical load. Rackspace then attempted to switch to our secondary utility feeder, but was unable to do so due to an issue in the Pad Mounted Switch (PMS). At approximately 3:15pm CDT, power supply through UPS clusters A, B and E was lost when the batteries in those clusters discharged, and equipment receiving power through those clusters experienced an interruption in service.
The service says only one of its nine data centers were affected by this failure, but many high profile sites collapsed as a result, including EventBrite, Justin Timberlake’s site and Michelle Malkin’s popular political blog. As Rackspace noted yesterday that “We owe better, and will deliver.”
Below, find the full incident report.










I was ALMOST gonna go with these guys back in 2004 along with their crappy COGENT bandwidth…Ugh
hahaha take care of your own dedicated server man… you have a better chance locking it in your own garage than to deal with their datacenter.
That is just plain stupid.
for many more companies than necessary, it isn’t.
Simon is for real. Anyone who trusts the cloud and outsourced ops is in for what they ask for…
There’s a lot to be said for complete control…and extreme execution speed.
Simply look at Twitter….that piece of shit barely runs…cloud = SLOW.
Yeah, look at Twitter…one of the most successful services of the decade!
You obviously don’t define success to include sustainable businesses.
Bullshit.
Simon, you have no clue..
To some extent I agree but in most cases its just foolish NOT to trust a DC.
If (for example) you are someone like Twitter, then yes it is probably worth the cost of setting up your own mini-DC with a couple of racks and a fiber connection or something..
BUT for users with under say 5 servers its just stupid not to use a decent DC, as a DC *should* have highspeed backbone connections with (in most cases) 3-5 seperate backbone connections. not to mention then (very expensive) UPS and numerous backup systems which can’t really be matched from your garage…
Yeah and I’d like to see your SAS70 audit on your garage data center. Please.
One correction to your post – we stated only one of our nine DATA CENTERS was affected, not one of nine servers.
Rob La Gesse
Director of Customer Development
Rackspace
210-845-4440
Ya you guys must have at least 10-12 servers.
Wait, you have more than 9 servers? Ha ha. Thanks Rob, I’ll fix that.
So much for “This document contains Rackspace proprietary information and may not be distributed.”
Was just thinking that…
Yikes. You’d think they would regularly test these systems.
Do you think they don’t?
It could be just Murphy’s law
Murphy’s law is an adage that broadly states: “Anything that can go wrong, will go wrong.”
)
I don’t run such sites where I make thousands or more hourly. Far from it. So I dont get the big deal.
But when people forgive Microsoft, Google, Oracle and hardware companies for much lerger downtimes and faulty parts, we should probably spare Rackspace for an hour.
(I’m an unrelated “third party”, not a shill
But yes there’s a problem if one machine that is expectd to trip, trips other machines out of service. That should clearly be redesigned. And every other such component should be looked at again. Now is a good time for the review of their entire design, before any thing else fails a few months from now.
I totally agree with that…
While personally i try to hold a very high expectation of companies like Google, (which is why it annoys me so much when they get things so wrong!) everyone falls, its how they deal with it and recover is the important part.
The fact that they had some pretty major failures and were back online in an hour isn’t really too bad.
Seems like in every case of a major host losing power, their backups always fail as well.
To be fair, you wouldn’t hear about the cases where the backup worked.
Still, this happens way too often.
my company has multiple servers hosted at rackspace, all the actual rack units suffered an outage. We have one random box that is a tower unit somewhere else in the building and it continued to function no problem. Which would explain the high profile(rack mounted) sites going down while other sites weren’t impacted.
We will likely be dropping rackspace after this (we use them for hosted Exchange and this is actually our 4th outage in the last couple of weeks).
What bothers me about their “incident reports” is they never say “we’ve done this so it will never happen again”. Read through that report and you’ll see they try to convince you it’s a random occurence that’s unlikely to repeat but they mention nothing about adding further protection so it can’t happen again.
That, to me, equals fail.
Tom – the blog will be updated shortly with a message from the CEO that addresses your concerns. Appreciate you as a customer, and would hate to lose you. Know these issues are painful though.
Rob La Gesse
Director of Customer Development
Rackspace
210-845-4440
I’ll keep watching for a solution and it will go a long way if you actually put out a press release that outlines actual steps you’ll take to make sure this never happens again.
Right now it just looks like the same talk that everyone puts out there (e.g. placate people with the release and then they’ll forget it and you never have to do anything)
Tom,
I must say that even though there has been quite a few outages, their support is the best that I have found in the industry. I have not found a service provider which has truly 99.9% up time. However, I have not found another company that provide support as good as Rackspace.
Arnold
We’re hosted at Rackspace. Rackspace delivers an unbelievable level of service. If any of you has had an opportunity dealing with their teams, you would vouch for their sincerity in making sure their hosting customers are taken care of. I’m sure that we’re paying more than just about anywhere else, but their service is fantastic and worth paying for.
I think they plan to address your concerns:
“We don’t have a lot of details on exactly what happened yet. When we have an outage, our first focus is on fixing it and getting customers online as soon as possible. Now that we have the near-term situation stabilized in Dallas, we have some work to do to improve our reliability. We will follow up with more information as we work through our root-cause analysis.”
http://www.rack...com/blog/?p=334
I hope that the root-cause analysis is informative. The outage caused me pain too, and it was not what I expected from Rackspace.
Looks like Rob beat me to the punch. Nice.
The trick is finding a better provider. Over the last 7 or so years I’ve used more than a handful of providers, and Rackspace has been by far the best to work with. Luckily my servers weren’t affected by this one. Over the last 10 months with them I’ve had absolutely no downtime whatsoever, and they’ve always been very upfront about network and power issues.
I’m inclined to drop them too–the DFW data center has had issue after issue this month. I feel like the constant testing they’ve been doing to find out the root cause of the first issue has done nothing but cause several more.
I would have to pay about 1 / 4 of what I pay to rackspace each month over at softlayer and this is really tempting me to do it. Last time I went with softlayer, they had problems like this each month though. Maybe hiring someone is the answer. Its impossible to find good sysadmins these days. ughh
Also, have you tried Mosso, the Rackspace cloud? We use them for a lot of other hosting too and they have been great.
http://www.mosso.com
If I were to judge a publication by its readership, the comments here would not compliment TechCrunch.
You people are far too harsh. Shit happens. They dealt with it well.
“Shit happens. They dealt with it well.”
“Shit”, as you put it, may happen, but Rackspace’s customers pay good money (and from what I’ve heard, quite a lot of it) to make sure it *doesn’t* happen.
Since Rackspace apparently can’t do that, their customers are perfectly justified in taking their business elsewhere…
Fanatical support! These guys are something….we were behind on payment but in communication,we promised payment on a certain day, called to make the payment (it had to be that way according to them) called my account rep three times…they shut us off and really messed us up. Account rep’s excuse was she was too busy shutting people off to call me back. We switched out from them and six months later still
fighting over the final bill which was thousands more then what was agreed to.
“too busy shutting people off ”
Business must be great for them
This is why I have 2 separate hosts, one is my primary and a separate, independent one is my backup.
Data is backed up to the the aux host every 3hrs, if shit goes wrong with the primary, just change the DNS, and the aux is a mirror.
Just gotta plan ahead and not put all the eggs in one basket.
What is “excitation failure?” Read like, “Our (untested) primary backup blew up when we threw the switch.”
Just what I was thinking. But ‘excitation failure’ sounds either exotic and unpredictable, or well enough defined that they can fix it with 22.5Nm from your garden variety M11 spanner.
There’s this thing called Google, where in about 15 seconds I found the answer when I entered “generator” and “excitation”.
It’s neither exotic, nor unpredictable. It is fact a very common term when talking about power generation.
An excitation failure (as already described) is when an alternator is no longer able to sustain a magnetic field and therefore generate power.
An alternator uses a DC current to excite a magnetic field within the coils in the generator, which then in turn allows the generator to generate power. This is usually because a static magnet would be too large, or too unpredictable to work in a larger unit.
So those major sites impacted might want to ask for geographical redundancy for their sites.
If the eight other data centres were not impacted, then replication of their sites would ensure they stay alive.
People should be responsible for their actions, if your hosting provider goes down and you dont have geographical redundancy then you have no one else to blame. Otherwise live with it.
Cheers
Shane
We provide geographical redundancy now – but this is a customer requested feature (and a more expensive solution). Some customers choose that solution (and we love it when they do). For most, the cost does not warrant the (statistically) low risk of an outage.
Rob La Gesse
Director of Customer Development
Rackspace
210-845-4440
We had finally launched the new site http://www.pizzatweetup.com only to find out that it wasn’t working after I tweeted it to everyone. Thanks to the rackspace cloud (formerly mosso) I felt like a fool on my first real day!
haha, i’m not gonna order pizza from you!
Yea, you should have just gone with another web host, the plentiful kind that oversells crazily….
Unlimited bandwidth my ass.
Rackspace is still probably the best in the business.
For smaller sites, ZoneEdit (for DNS fail over) and a good monitoring system like AlertFox, AlertSite,… will solve the redundancy issue for a really low cost.
Oh, and once you fail-over in place, you no longer need overpriced Rackspace, a service like Serverbeach will do.
I’m with David – shit happens. Been with various providers, aware of many more – Rackspace have been far and away the best. Customer support rocks, and so far (touch wood) had no outage.
As for the guy with the server locked in his garage – yeah, remind me to host with you!!
I wonder if Rackspace’s outage is an unintended consequence of “greening” that datacenter?
Power factor correction on newer kit installed as updates is different, greener, to that on the kit installed when the datacenter was specified.
Sure it looks the same on the power label but what does it look like to the generator? Very different.
Could be a contributing factor to the generator failing to deliver the load, the UPS manufacturers, have been warning for a while – with good reason, not just to sell upgrades.
Why did an excitation failure affect multiple gensets, or were the UPS clusters fed from a single generator?
Excitation, for the benefit of those who don’t know, is the process of establishing a magnetic field within the alternator of a large genset. Hence, excitation failure shouldn’t affect multiple sets, and I would have expected a company such as rackspace to have a n+1 synchronised setup.
Perhaps if as mentioned above the demands of the power factor correction on the hardware ment that the gensets could not deliver the required real power.
Yes, but an excitation failure should only affect a single set (it would be odd that they would share an excitation circuit) and that the synchronisation controller wouldn’t be able to drop the faulty set and bring another online?
I think it’s great if Malkin is down.
http://www.thew...siteisdown.com/
It’s the grey one!
So their blaming it on PMS?
I haven’t experienced an outage at The Planet in 4 years and their cheap. I always thought about using them for their “fanatical support”, but have heard lots of anecdotal stories about incidents. At the end of the day, need uptime more than support.
they’re
http://www.data...s-major-outage/
It’s sad that I know this, but that picture of Scotty looks like it’s from the scene in Generations when he learns that Kirk has died.
Frankly, one major blunder in the ENTIRE history of Rackspace doesn’t bother me at all.
MG, I aspire to no longer have a website… just a business started with a hashtag. why?! It’s cuz the twitter search engine is so reliable
We can land a party on the moon, but can’t
- We can’t text to and from an 900# (err 800#)
- We can’t get an accurate average FICO. Fair Isaac claims its 723. The average VC FICO is only 678 http://bit.ly/C0msA
What’s the meaning of “excitation failure”?
Were the generators too excited and anxious and unable to perform?
Larger generators need a DC current to establish their magnetic field, that’s called exciting
I received this message from Rackspace support:
“As part of our ongoing maintenance in the DFW facility, Rackspace will be performing maintenance this evening, from 9pm-6am CDT, on our DFW power infrastructure. Our DFW Engineering Team will be gathering the settings from all UPS (Uninterruptible Power Supply) Clusters and Generators and which will be reviewed with our vendors. Rackspace will also be installing power quality meters between the Automatic Transfer Switch (ATS) and the generators in order to monitor power while performing diagnostic tests. While on utility, a generator signal will be sent to each UPS simulating the exact power filter transitions that occur when a UPS runs on generator power. These tests will assist our investigation of the root cause of past incidents. We do not anticipate any disruption to your service but did want to keep you informed as your configuration is supported by this infrastructure.
We appreciate your patience as we work through this maintenance.
Sincerely,
Rackspace”
… so it seems like they are really digging at the problem. In my experience, all hosts have unplanned downtime which is why we pay rackspace for good support.
Rackspace is a decent company, but I prefer a separation of duties from hardware and administration. This allows an unbiased opinion on the proper architecture setup.
I would recommend the use of Softlayer.com for hardware + a third party company providing administration, such as mnxsolutions.com or rackaid.com.
Did this have anything to do with the Mosso (rackspacecloud.com) control panel outage yesterday?
Take my wife. Please.
Whenever she has PMS I’m left with a helluva case of excitation failure.
Rimshot.
Done and done
I am considering using Rackspace to host my new system, but I keep hearing these stories.
They seem to be very accident prone and all these “six sigma” events seem to happen to them regularly.
Given that they charge top dollar for a premium service, I am shocked that these outages keep happening.
Does anyone know of an alternative?
If Rackspace was a cheap provider then it would be no problem. However Rackspace is expensive and their main selling point is their quality of service.
We keep reading about these RackSpace failures – over and over, and yet how they tout themselves as the top of the pile. Well, as a web-based HelpDesk provider, we need optimum uptime, and we get it from ServInt VPS servers.
We monitor the servers with a 3rd party monitoring service – and uptime is currently 99.9927% since 2007.
That is the exact reason we went with SherWeb.com. Their system is more redundant then any other we have found and their service is on the spot as well
Brilliant choice of picture tbh.
We use rackspace as well and we were lucky enough not to be affected.
I have dealt with some DCs in the past and Rackspace has the best customer support by far.
When they say fanatical, they really mean it.
Everyone here is stating that Rackspace is so expensive but the quotes that I have from them compared to other big name hosts say otherwise. Same setup and Rackspace is 1/2 the cost. The pre-sales communication has been great. Obviously concerned about any outages, but this incident has not taken them off our potential host list. But perhaps a different datacenter is needed (Chicago?)
They went down with a power failure again today! 20 minutes!
The power issue only lasted a few minutes, not 20 minutes. If you were down that long, I would say there was something else going on, but it wasn’t because the power was down – maybe your server didn’t power on after the system went to generator or maybe it fsck’d/ran chkdisk on reboot, etc. But according to the RS blog, the power interruption was minimal
http://www.rackspace.com/blog/
Rackspace credited 1/3 of our monthly bill as compensation for the 45 minute outage on June 29. While the outage was painful (our clients phoned continuously), I have to give them credit for backing up their explanation and apology with real money.
If anyone knows of another hosting provider willing to agree that level of financial penalty for missing the SLA, I’d like to hear about them.
You’d think they would regularly test these systems.
We keep reading about these RackSpace failures – over and over, and yet how they tout themselves as the top of the pile. Well, as a web-based HelpDesk provider, we need optimum uptime, and we get it from ServInt VPS servers.