(Updated) Downtime At Rackspace Cloud
by Nik Cubrilovic on November 2, 2009

Screen shot 2009-11-02 at 11.48.54 PMA large number of customers of Rackspace Cloud, including Techcrunch, have been experiencing sporadic downtime for the past hour or so. The status blog reports that the service was degraded, and other reports state that it is due to a power outage at the Dallas network operations center. Customers of both Rackspace Cloud and Slicehost are affected, putting services such as Posterous, Dailybooth, tr.im and others out of commission.

I got the first alert as I was stepping towards the door to leave (it is always like that), and when I got back to my seat found that half the web seemed to be talking about it. The main Techcrunch site was still serving pages to most, due to our super-aggressive-mega-cache, but it seemed that the entire Dallas NOC was being rebooted.

From the status blog:

As of 12:35AM CST Rackspace Cloud engineers are seeing intermittent connectivity to our WC2 cluster in our Dallas – Fort Worth (DFW) and data center. We are working to resolve the issue as quickly as possible and will update the status post accordingly.

If you have any questions or concerns please contact our support via live chat or at 1-877-934-0407 international +1.210.581.040.

UPDATE: As of 1:15am CST, Rackspace Cloud engineers are still working to address the current connectivity issues. We are making significant progress and we will post another update here shortly.

UPDATE: As of 1:30am CST, service has been restored to the majority of our technology clusters in our WC2 cluster. Some sites may still be having performance issues, We are continuing to monitor and address the situation. Additional updates to follow.

From slicehost (who actually mention power outage):

DFW Interruption
November 3rd, 2009 @ 01:14 AM

UPDATE 1:16AM CDT: Power has been restored, however, we’re working to check all our systems and make sure everything comes back up correctly. Slices have not yet been restarted. We’ll try to keep you updated as much as possible.

We are currently experiencing a service interruption in our Dallas data center. Our engineers are currently working to restore connectivity. We will send an update as soon as information becomes available.

And from Scoble, on Twitter:

Screen shot 2009-11-03 at 12.03.52 AM

(the list he pointed to is actually a good one to follow if you are a Rackspace customer).

This will likely lead to many cursing the cloud, when in essence there is nothing about this problem that seems unique to being a ‘cloud problem’. What is more concerning is that the NOC seems to have run out of power (almost unimaginable) and then took so long to come back online.

So – how did you all spend the downtime? It seems most admins and devs from Rackspace hosted companies were just hanging out on Hacker News and IRC bitching about RS :) (first time I noticed that he shares initials with his employer).

As soon as we know what happen etc. or any more, we will be posting updates here

Update From Rackspace: from their site:

Rackspace has experienced a service interruption during tonight’s scheduled maintenance on UPS Cluster G. We were testing phase rotation on a Power Distribution Unit (PDU) when a short occurred and caused us to lose the PDUs behind this Cluster. The phase rotation allows us to verify synchronization of power between primary and secondary sources.

All power has been restored and devices are being brought back online. The PDUs were down for a total of about 5 minutes. We have aborted the maintenance for the remainder of the evening and will reschedule this for another date.

Service to Cloud sites has been restored and we are continuing to work with Cloud sites customers to bring them online.

Advertisement

Comments rss icon

  • spent the whole hour refreshing twitter search until our site was back up.. dunno why. its like crack

  • Thank god it was at 11PM or I would have had some ANGRY clients.. I really hope this isn’t a foreshadowing of downtime to come. Having had dozens of sites on the Mediatemple proto-cloud Grid service I’m pretty used to this.

    Cloud downtime is like an earthquake, you can never quite predict it but the smaller ones always linger as fear of a larger one to come. A “foreshock” if you will.

    Interestingly enough slices I purchased before the RS acquisition were unaffected. I knew the RS buy would corrupt Slicehost just wondering how long it would take…

  • One of my clients’ servers is still down woohoo

    http://setformarriage.com/

    (that was a sarcastic woohoo btw)

  • My sites are back. But yes, it was a major downtime. Something like: Remember backups, just in case!!!

  • There’s irony in the fact that TC couldn’t report the outage (aside from Twitter) because it too was affected by the outage. Glad you’re back up.

  • It looks like there is no such things as 100% uptime. Although, if a web apps architecture is good enough, you can often have different web servers running (even with another hosting company / cloud service) and route the traffic to the ones that are still online.

    I am really interested to see what has caused this issue as we’re about to deploy to the Rackspace Cloud and AWS.

  • and this is why all my stuff is in the IAD datacenter in Washington DC :)

  • Does anybody know if Is there a comparison study between Amazon, Rackspace and others who provide cloud storage?

  • This sounds all too familiar – not the first time RS has gone down due to a power issue. They need to refocus some of that fanaticism for service toward fanaticism for redundancy.

  • What’s the deal with the Rackspace Dallas center, this must be at least the 4th power issue this year.

    Glad I have all my servers on the EC2, only been with them 1 year but haven’t had any issues at all, never even have had an instance terminate and it is so much cheaper than Crapspace.

      • God that effectively tinurl’d my URL. I so much prefer words in URLS, especially in blog comments where so many people are trying to divert traffic to their own properties!

        That URL is basically an article on Techcrunch on AWS experiencing a few outages, some minor and one major in 2008…

        The cloud is not infallible. But it sure beats running your own servers.

        • I know AWS isn’t perfect, I was lucky and wasn’t affected by that outage, but at least AWS doesn’t market ‘fanatical this and that’ garbage like RS and they seem more economical (although I haven’t seen the Mosso rates since they change to RS Cloud)

  • Im going to catch flak for saying this, but whatever…

    The “very” substantial premiums that rackspace demands to host there just dont add up. I asked during techcrunch50 why we should move away from our current host to rackspace ( almost 6x the cost (Yes, really 6x)) and the response was “uptime, reliability and support”

    • You are darn right it doesn’t add up. There is NO reason at all that a simple metering exercise (per their explanation) should cause an outage. They use cheap power gear, they have cheap support staff, and what you are left with is cheap, unreliable service. I wouldn’t be surprised if the actual situation here involved injury or death; it’s not often a human-induced “short” happens without safety consequences.

  • Well that’s excitingly lacking in detail. Oh well.

  • The website for SYN/ACK PAC, the Political Action Committee for geeks has been down for awhile. Slicehost has been very responsive on both twitter and on their status page. Hoping everything comes back up soon.

  • Their two-bit cloud has problems every single day: http://status.mosso.com. Good thing so many chumps buy into their marketing.

  • Anyone with a site/service that requires 100% uptime SHOULD NOT RELY ON A SINGLE HOST.
    Not even a single host with multiple data centers.

    Don’t have any Single Point Of Failure (SPOF), it’s hardly rocket science. Google, for example, uses numerous colocation hosting providers so that if one goes down you don’t lose the lot.

    If you leave yourself with an SPOF, your service goes down and you lose customers, then that’s your fault for not designing your solution properly.

    If you’re using the cloud, there are multiple cloud providers out there that use the standard APIs for provisioning. Be sure to have up-to-date images running in at least two cloud providers ready to nearly instantly transfer processing to another cloud should one die. That’s the real beauty of the cloud, you’re not chained to physical machines in a single location.

    • Why not? Do you actually suggest everyone who needs up-time to build their own redundant network?

      On a proper network there is no single point of failure in theory; but in real world that’s not always the case.

      Google had down time too so no matter how you slice it it will happen.

      • As far as I can recall all of Google’s recent bits of downtime was caused by other factors (e.g. network configuration mistakes made by them), NOT data center outage.

        I’m not implying people have to build their own redundant network at all, read what I said:

        “That’s the real beauty of the cloud, you’re not chained to physical machines in a single location.”

        Why not be running some servers in (say) Rackspace’s cloud, and some in Amazon’s cloud?

        Voila, redundant network without having to build your own.

  • Time and Again, RS proves to the world why they suck. I haven’t known of such an “expensive” and “premium” company that goes down so often.

  • Certainly not the weather.. it’s absolutely beautiful here in the Dallas/Fort Worth area today!

  • And I was contemplating if I should go to the cloud for my hosting services…..It seems it is not fully matured yet….I think I will stick to traditional hosting for now….

  • Wow, whats wrong at Dallas Center? This is what like 2nd major power issue?

  • you’ll have to excuse me for being perceived as ridiculous, but in rackspace’s defense, unless if they’ve broken an SLA promising 100% uptime, i’m unclear what’s so shocking about this. cloud hosting will inevitably find a way to go down sometimes just like everything else.

    • I have checked the rackspacecloud.com site over and nowhere does it say 100% uptime guarantee or SLA. If you require that, it costs $$$ and typically geo-redundant configs is best since ANYTHING can happen, act of God could bring down a DC and then what can you say?

      Also, this happened during a scheduled maintenance. It sucks that it happened, but it was a hardware failure. At least it happened during a scheduled possible downtime period instead of during the busiest time of the day. And the DC did get it back up and going fairly quickly for the number of machines and customers involved.

  • While it is unacceptable, downtime sometimes happens. Good IT sysadmins must plan for downtime.
    We use cloud computing providers (such as Rackspace) to provide a commercial service to our customers. While downtime is unacceptable in any market, and especially not in a commoditized market where customers demand a solid user experience. I wrote a quick blog post on how we use multiple providers to ensure that our service is rock solid:
    http://www.spou...s-unacceptable/

  • Can you say “diversity”?
    Downtime can be engineered out of the equation. How much do you want to pay for bullet-proof?

  • Who hosts their critical infrastructure on cloud?
    Bunch of whiners pay little to nothing and want 100% guarantees…

    same old bunch of whiners who pay the least and complain the most…get a life…

  • I have to change hosting providers as well. I have shared hosting now which is really sloww

  • While I’m no fan of downtime, I really can’t criticize the response of Rackspace. Their support was extremely responsive during and after the incident. The outage caused MySQL replication to break on 50% of my cluster, and I was completely restored within a few hours. I was called by my Account Manager the morning after, and she made sure there was a thorough postmortem.

    All hosting companies are going to have downtime. There is no silver bullet for 100% uptime. Rackspace’s SLA guarantees that, and when they don’t deliver, they make good on it, no questions asked. The day they don’t stand behind it, I won’t work them.

  • Just to comment on a couple points from the article:
    1) The Rackspace NOC did not lose power, nor did it have to “reboot”
    2) The loss of power in a small section of the DFW facility largely impacted servers (mostly cloud); the Rackspace network in DFW did not go down during this time. Obviously though, if your server isn’t up it doesn’t much matter if the network is

  • how come when twitter is down techcrunch blames twitter.

    but when techcrunch is down they blame rackspace.

    is this typical content and dev boys blaming the ops guys – cause they dont know how this ops ‘bloack hole’ stuff works?

    or is it business boys not accepting responsibility for having a DR solution that is hosting solution agnostic.

    its probably easier to blame the people who manage where the rubber hits the road. does the same blame get the visibility when a sale is screwed up, or a story is screwed up…poor ops guys

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbugbug
Techcrunch on Facebook