37Signals Down – Looks Like Rackspace Is To Blame Again
by Michael Arrington on January 18, 2008

37Signals is having a bad morning, according to their current home page image above. They’re pointing fingers at their service provider, which was (and we believe still is) Rackspace. Last November they suffered a three hour outage along with other Rackspace customers.

Update: It’s back up, total outage was about 2 hours. Per the comments, 37Signals doesn’t seem super duper happy with Rackspace these days.

Advertisement

Comments rss icon

  • cool to know! I almost signed up for Rackspace hosting service…

  • Our service provider is indeed still Rackspace. And yes, we’re very disappointed in how this has been handled. They have a hardware replacement time guarantee that has long since been broken. We apologize deeply to all our customers for this outage. It comes way too soon off the Truck Drives Into Power Station event.

  • Once again web consulting companies like mine have to pause working, because we have no access to data.

  • If it is Rackspace, they’re in for big trouble. It’s understandable to have one freak outage, but two in close proximity – they’re in trouble if this is their fault.

  • 37 Signals is listed as a client of Tilted another web hosting provider.

  • Dead in the water here.
    Offline version of Basecamp anyone?

  • It’s predicted that there’s gonna be “the big dotcom [technical] crash of 2008″.. some early signs of it seem surfacing.. with frequent and lengthy downtimes of popular services (including twitter.. blogger.. joyent..)

  • Also I think it’s more 37 Signals fault than anything for not having a hot standby in their setup. When you are offering a service on the level of theirs it’s completely irresponsible to not have hot standbys of such a critical piece of equipment.

  • Stop the presses… 37 signals is down. Also Michael has resorted to pasting homepages on TechCrunch… It likely took all if 3 minutes to put this post together…

  • Ahem … i heard that rackspace has really bad uptime for the last months, i’m not gonna sign with them anymore … altho i was very tempted

  • All of my boxes are at Rackspace and we haven’t had any issues. A failed load balancer isn’t really indicative of Rackspace’s ‘uptime’.

  • I agree with Joshua, I run my website with two load balancers, one in stand-by mode, for this very reason. It is notthe service providers fault when a peice of hardware fails. Hmmmm, what kind of hardware was it?

  • rails to blame :)

  • All our products are now back. We’re working on getting all the blogs etc back online too. Again, we’re so sorry for this outage.

  • They’re up again.

  • Don, good point. Rackspace is only liable to a certain degree when a hardware failure occurs.

    However, it looks like they didn’t live up to their 1 hour hardware replacement guarantee that every customer is afforded in the RackSpace SLA. I have several customers with RackSpace and this is concerning.

  • It’d be interesting to see what that guarantee actually covers. It seems the hardware was actually installed within an hour but the configuration couldn’t be swapped out since the LB died and took the CF card with it. So although the hardware was replaced quickly, the configuration seems to have taken another 30-45 minutes.

  • Well, our basecamp is back up – noticed the downtime earlier, but our use of basecamp is not *critical*

    Web hosts have been having a bad week: first it was dreamhost inappropriately billing folks and now rackspace

    I’m sure all major hosting companies encounter problems occasionally, so keeping that in mind, it is best to have contingencies in place (we learned this the hard way, but are better prepared for the future)

    We use thePlanet, and they have been great (I really like their service compared to other hosts)

  • I’m trying to see where RackSpace is to blame here. Is it the failure of the hardware? The single point of failure for 37signals at their (single?) load balancer? Or just that they’ve taken longer than 1 hour to get the new hardware up and configured?

    For what it’s worth, I have used RackSpace Intensive hosting, and for the most part they are great to work with. Comparing this one incident with a slow replacement of a single point of failure for 37signals to a truck driving into a pole and triggering a wave of events is kinda… fucked up.

    MGZ

  • I use Rackspace, and I love them. We have a RAID 5 that supposedly can recover from a drive failure by just replacing the one drive– does that mean we don’t make regular backups? No, we don’t assume those things always work.

    If there’s a single point of failure in a group of computers, and you have no backup waiting, that’s your fault. Maybe the time to replacement was too long, but if any time at all is unacceptable to your customer, then working without a backup load balancer ready is unacceptable for your service.

    Not Rackspace’s fault.

  • But how was it Rackspace, no other Rackspace customers went down… 37 Signals had a single point of failure.. and it failed. It was inevitable. It’s not the service providers fault by any means.

    I feel compelled to defend Rackspace a bit because it’s just horribly silly to not have a secondary LB on standby. Rackspace did their part, they replaced the hardware, got it back online.

    I think it’s horribly unprofessional for 37 Signals to point the finger at Rackspace.

  • Wow, we were told after the last incident that something like this wouldn’t happen. Interesting turn of events.

  • Joshua, network equipment at Rackspace is managed by them exclusively (clients don’t have access to it directly). Their hardware guarantee is supposed to ensure that if anything goes wrong, there will very quickly be a replacement installed. In our case, it took about two hours from reporting the problem to having it fixed. That’s not good enough.

    But you’re right that there should have been sitting a spare ready to go in our rack. That would have made everything happen a lot faster. I can assure you that one such unit will be there at the end of the day.

    Again, we apologize to all our customers for this down time. We’ve posted the details on http://37signal...happened-t.html

  • We had the same problems with downtime. Rackspace should get their act together and make sure the provide quality. If you can’t stand the heat, get out of this space.

  • singingdancingbear - January 18th, 2008 at 11:04 am PST

    Wow, to say that a hosting provider is too blame for 37signals being down is complete hog wash. A hosting provider can help with uptime but not stupid IT decisions. I don’t use either services; however let me give you some free consultation. If you have single points of failure you are going to go down unless hardware becomes perfect. Let me know when you find that hardware provider so that I can buy stock.

    I say if you use 37signals you evaluate their service since it seems like that make decisions have baked.

  • Sucks, but it’s not Rackspace’s fault. Falling back on a hardware guarantee when you can just set up a hot spare is a no-brainer, especially when you are talking about hardware that has minimal configuration such as a load balancer.

    If 37signals admits they should have had a spare, jumping on Rackspace for getting one up and running from scratch in 2 hours instead of 1 seems lame, hardware guarantees notwithstanding. They failed to meet their obligation contractually, but 37signals is the sole person responsible for the extended downtime, not Rackspace.

  • You pay a very large amount of money for Rackspace compared to other hosters; you can expect something from paying that premium I would think. And you pay for managed hardware; so yes Rackspace their fault it is.
    If you cannot trust them to do this faster, you better spend your money in other places where you can have more hardware for less money and at least in *this* case, the same support times.

  • It’s important to note that this is not a Rackspace problem, it’s a 37 Signals problem. One of the machines that 37 Signals owns or leases failed, as hardware occasionally does. It’s not really 37 Signals fault either, although they should probably have a backup load balancer when they’re providing a service as popular as basecamp (as another poster mentioned.) Rackspace is an excellent hosting company that has provided great service to the company I work for. It’s not like they’re asking to be bombarded by flying trucks. :-)

  • This is like not wearing your seatbelt and then blaming the ambulance for the fact that your face got smashed into the windshield in an accident because the ambulance took 10 minutes to get to your accident site instead of 5.

    Sure, it’s better for the ambulance to get there in 5 minutes, but if you had been wearing your seatbelt, your face wouldn’t be smashed into the windshield and how fast the ambulance go there has nothing to do with whether you could have completely avoided getting your face smashed into the windshield.

  • Michael,

    I’m really surprised that you pass the blame off in your post on Rackspace. Why should a 37signals customer care about who is providing the hosting and any problems they might have? From the customer’s perspective, there’s simply one party to blame here, and that’s 37Signals, not some service provider the customer should have no reason to even be aware of. I hope you follow up on this and focus on the necessity of online companies to have adequate backup/redundancy/etc in place so their customers don’t suffer in situations like this.

  • Complete dependence on one data center = non-mission critical service. If someone wants to operate a mission-critical service then they should be prepared to have a nuclear bomb fall directly on one of their data centers and have no service interruption, or at least be back up in minutes with minimal if any data loss. This is Ops 101 and is done all the time. It’s just a bit more expensive.

  • singingdancingbear, aside from your extremely poor grammar, I agree with what you said, “I say if you use 37signals you evaluate their service since it seems like that make decisions have baked.”

    As someone else mentioned, it is unprofessional of 37 Signals to publicly blame Rackspace for their single point of failure. Is it true that rackspace did in fact have the hardware in place within an hour, but configuration was an issue? If rackspace was responsible for the configuration too, they should obviously live up to their uptime guarantee with compensation. However, they did at least try and had the hardware ready as promised.

    From my perspective of what I’ve read so far:
    Downtime blame up to 1 hour: 100% 37 signals
    Downtime blame beyond 1 hour: 20% rackspace and 80% 37 signals

  • Cisco LBs can be deployed in a redundant, HA configuration. Rackspace primarily deploys Cisco networking hardware.

    I agree with many of the previous posts. Pony up the cash for a failover device and eliminate single points of failure.

  • Concentric.com has a unique clustered hosting that prevents these kind of outages.

  • this is what happens when you hire a former baywatch star to work at your software company.

    so what if he made the ruby on rails framework – he also made this terrible video with a cheeseburger:

    http://www.yout...h?v=82-FJyniP7A

    hah classic

  • Glad you have editorial integrity to bash one of your sponsors…but for the record (other than the Dallas power outage) I’ve had great success with them. (knock on wood)…esp their technical support. They are outstanding

  • Before 37Signals points anymore fingers of blame they need on their homepage:

    “We are aware of and have chosen to accept that if a single piece of equipment fails, your applications will be down for an hour. We choose not to have a spare ready but to rely on our SLA for hardware replacement for you.”

    They obviously knew about the replacement time, and chose to not have a backup ready. They were willing to have their customers down for an hour, and expressly chose to keep that risk. To blame the entire thing on Rackspace is just childish beyond belief.

  • It is our fault that the servers were down. It’ll always be our fault if something is down. The buck stops here. Again, we’re very sorry this happened.

    While we don’t have a formal service-level agreement (SLA), we still want to compensate anyone who felt they were negatively affected in their work because of this outage. Full details here.

  • Nice job. You linked to the post in your admin. Password protected.

  • Been very happy with Rackspace so far. They’re very professional and very reponsive.

  • If 37 Signals had a single point of failure as this incident indicates, they should really expect this kind of thing to happen from time to time. If downtime is unacceptable, maybe it’s time for 37 Signals to build a more resilient server platform. There is no reason why an outage at one data center has to take down an entire site, 37 Signals could have planned for this kind of thing by using a number of different technologies.

    But they didn’t. Unless the hosting provider made some guarantee on this configuration being free of single points of failure, it’s hard for me to rationalize blaming the hosting provider for the initial downtime.

    37 Signals is selling customers on their services’ availability so they owe it to their customers to build a more resilient server platform, not blame others. The releases 37 Signals made about this incident indicate that this configuration would have failed anywhere it was hosted. Replacement and reconfiguration in under 1 hour would have been nice, but I have to agree with Scott Mueller’s breakdown of blame.

  • Well.. as my buddy Scott told me this after noon:

    “ouch, single point of failure… apparently getting real does not involve doing your architecture homework”

    We (threadless) have had our fair share of hardware failures (seems to me to be problem of the hardware maker, not rackspace) – however we have always had hot standbys.

    i can’t see how this is rackspaces fault – unless 37s has some SLA that includes predicting the future.

    But like matt says – It is the service providers fault that their servers were down. I just don’t see why they insist on blaming rackspace so publicly. It isn’t like the people who couldn’t access basecamp (us) cared WHO was responsible – we just wanted it back up. Which is why a simple “We were down. bummer. we are up now” usually suffices.

  • You know, reading back through this 37S at least makes some sense– but the headline of this article is ridiculous. Not exactly thoughtful, and to whomever said it, I don’t really think editorially gutsy is the same as just being unthinking and wrong.

  • It’s 4:53pm on the East Coast and we still don’t have access to Basecamp. Unacceptable.

  • In the load balancers that I’ve configured (primarily from F5) it is standard for even the most basic high availability service to have two load balancers daisy chained w/ the second acting as a hot fail over. Since the load balancer sits in front of all app servers and is a single point of failure, not having a hot fail over (even w/ the most aggressive hardware replacement plan) seems wrong. Maybe RackSpace bears some responsibility, but in this incident (unlike the freak truck hitting the generator) I’d say it also lies on 37signals for not putting in a fail over load balancer.

    Of course, this comes with the disclaimer, that I know nothing about this other than what I’ve seen here. It’s easy to criticize, maybe it is all on RackSpace, just comes off as a little lop sided and strong, IMHO.

  • The quality of journalism on Techcrunch is down. Looks like Mike Arrington is to blame–again.

  • Huh, weird. Nobody has become shrill enough in their Nerd-day Morning Quarterbacking to demand the firing of any 37Sig staff. I’ll give it another hour or two before someone says this would be a firing offense at [link to their site].

  • Wilson, if you continue to have problems accessing the system please write support@37signals.com. The problem is that the DNS entry is being cached erroneously and those caches need to be forcefully cleared out. We’ve added instructions at the top of http://status.37signals.com/, but feel free to write support if you require further assistance.

  • @DHH

    “In our case, it took about two hours from reporting the problem to having it fixed. That’s not good enough.”

    Maybe you guys should house your own racks. You could get Zed to configure the Mongrel install. He loves Rails.

  • Davis, you’re right. Here’s the correct link with info on what happened and how customers can receive compensation: What happened this morning?

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
bugbugbug