January 18, 2008

37Signals Down - Looks Like Rackspace Is To Blame Again

Michael Arrington

73 comments »

37Signals is having a bad morning, according to their current home page image above. They’re pointing fingers at their service provider, which was (and we believe still is) Rackspace. Last November they suffered a three hour outage along with other Rackspace customers.

Update: It’s back up, total outage was about 2 hours. Per the comments, 37Signals doesn’t seem super duper happy with Rackspace these days.

  • Sphere It

Trackbacks/Pings (Trackback URL)

  1. ISV Survival Blog

Comments

RSS feed for comments on this post.

  1. MikeT

    cool to know! I almost signed up for Rackspace hosting service…

  2. DHH

    Our service provider is indeed still Rackspace. And yes, we’re very disappointed in how this has been handled. They have a hardware replacement time guarantee that has long since been broken. We apologize deeply to all our customers for this outage. It comes way too soon off the Truck Drives Into Power Station event.

  3. Arnold Leung

    Once again web consulting companies like mine have to pause working, because we have no access to data.

  4. Don Jones

    If it is Rackspace, they’re in for big trouble. It’s understandable to have one freak outage, but two in close proximity - they’re in trouble if this is their fault.

  5. Joshua

    37 Signals is listed as a client of Tilted another web hosting provider.

  6. Andy

    Dead in the water here.
    Offline version of Basecamp anyone?

  7. Technicle

    It’s predicted that there’s gonna be “the big dotcom [technical] crash of 2008″.. some early signs of it seem surfacing.. with frequent and lengthy downtimes of popular services (including twitter.. blogger.. joyent..)

  8. Joshua

    Also I think it’s more 37 Signals fault than anything for not having a hot standby in their setup. When you are offering a service on the level of theirs it’s completely irresponsible to not have hot standbys of such a critical piece of equipment.

  9. Isaac

    Stop the presses… 37 signals is down. Also Michael has resorted to pasting homepages on TechCrunch… It likely took all if 3 minutes to put this post together…

  10. Anonymus

    Ahem … i heard that rackspace has really bad uptime for the last months, i’m not gonna sign with them anymore … altho i was very tempted

  11. Anonymous

    All of my boxes are at Rackspace and we haven’t had any issues. A failed load balancer isn’t really indicative of Rackspace’s ‘uptime’.

  12. Don

    I agree with Joshua, I run my website with two load balancers, one in stand-by mode, for this very reason. It is notthe service providers fault when a peice of hardware fails. Hmmmm, what kind of hardware was it?

  13. Martin

    rails to blame :)

  14. DHH

    All our products are now back. We’re working on getting all the blogs etc back online too. Again, we’re so sorry for this outage.

  15. It's back...

    They’re up again.

  16. Brian

    Don, good point. Rackspace is only liable to a certain degree when a hardware failure occurs.

    However, it looks like they didn’t live up to their 1 hour hardware replacement guarantee that every customer is afforded in the RackSpace SLA. I have several customers with RackSpace and this is concerning.

  17. Joshua

    It’d be interesting to see what that guarantee actually covers. It seems the hardware was actually installed within an hour but the configuration couldn’t be swapped out since the LB died and took the CF card with it. So although the hardware was replaced quickly, the configuration seems to have taken another 30-45 minutes.

  18. SG

    Well, our basecamp is back up - noticed the downtime earlier, but our use of basecamp is not *critical*

    Web hosts have been having a bad week: first it was dreamhost inappropriately billing folks and now rackspace

    I’m sure all major hosting companies encounter problems occasionally, so keeping that in mind, it is best to have contingencies in place (we learned this the hard way, but are better prepared for the future)

    We use thePlanet, and they have been great (I really like their service compared to other hosts)

  19. MGZ

    I’m trying to see where RackSpace is to blame here. Is it the failure of the hardware? The single point of failure for 37signals at their (single?) load balancer? Or just that they’ve taken longer than 1 hour to get the new hardware up and configured?

    For what it’s worth, I have used RackSpace Intensive hosting, and for the most part they are great to work with. Comparing this one incident with a slow replacement of a single point of failure for 37signals to a truck driving into a pole and triggering a wave of events is kinda… fucked up.

    MGZ

  20. Morgan

    I use Rackspace, and I love them. We have a RAID 5 that supposedly can recover from a drive failure by just replacing the one drive– does that mean we don’t make regular backups? No, we don’t assume those things always work.

    If there’s a single point of failure in a group of computers, and you have no backup waiting, that’s your fault. Maybe the time to replacement was too long, but if any time at all is unacceptable to your customer, then working without a backup load balancer ready is unacceptable for your service.

    Not Rackspace’s fault.

  21. Joshua

    But how was it Rackspace, no other Rackspace customers went down… 37 Signals had a single point of failure.. and it failed. It was inevitable. It’s not the service providers fault by any means.

    I feel compelled to defend Rackspace a bit because it’s just horribly silly to not have a secondary LB on standby. Rackspace did their part, they replaced the hardware, got it back online.

    I think it’s horribly unprofessional for 37 Signals to point the finger at Rackspace.

  22. Don Wilson

    Wow, we were told after the last incident that something like this wouldn’t happen. Interesting turn of events.

  23. DHH

    Joshua, network equipment at Rackspace is managed by them exclusively (clients don’t have access to it directly). Their hardware guarantee is supposed to ensure that if anything goes wrong, there will very quickly be a replacement installed. In our case, it took about two hours from reporting the problem to having it fixed. That’s not good enough.

    But you’re right that there should have been sitting a spare ready to go in our rack. That would have made everything happen a lot faster. I can assure you that one such unit will be there at the end of the day.

    Again, we apologize to all our customers for this down time. We’ve posted the details on http://37signals.blogs.com/pro.....ned-t.html

  24. Willem

    We had the same problems with downtime. Rackspace should get their act together and make sure the provide quality. If you can’t stand the heat, get out of this space.

  25. singingdancingbear

    Wow, to say that a hosting provider is too blame for 37signals being down is complete hog wash. A hosting provider can help with uptime but not stupid IT decisions. I don’t use either services; however let me give you some free consultation. If you have single points of failure you are going to go down unless hardware becomes perfect. Let me know when you find that hardware provider so that I can buy stock.

    I say if you use 37signals you evaluate their service since it seems like that make decisions have baked.

  26. Greg

    Sucks, but it’s not Rackspace’s fault. Falling back on a hardware guarantee when you can just set up a hot spare is a no-brainer, especially when you are talking about hardware that has minimal configuration such as a load balancer.

    If 37signals admits they should have had a spare, jumping on Rackspace for getting one up and running from scratch in 2 hours instead of 1 seems lame, hardware guarantees notwithstanding. They failed to meet their obligation contractually, but 37signals is the sole person responsible for the extended downtime, not Rackspace.

  27. frank

    You pay a very large amount of money for Rackspace compared to other hosters; you can expect something from paying that premium I would think. And you pay for managed hardware; so yes Rackspace their fault it is.
    If you cannot trust them to do this faster, you better spend your money in other places where you can have more hardware for less money and at least in *this* case, the same support times.

  28. Scott

    It’s important to note that this is not a Rackspace problem, it’s a 37 Signals problem. One of the machines that 37 Signals owns or leases failed, as hardware occasionally does. It’s not really 37 Signals fault either, although they should probably have a backup load balancer when they’re providing a service as popular as basecamp (as another poster mentioned.) Rackspace is an excellent hosting company that has provided great service to the company I work for. It’s not like they’re asking to be bombarded by flying trucks. :-)

  29. pffft

    This is like not wearing your seatbelt and then blaming the ambulance for the fact that your face got smashed into the windshield in an accident because the ambulance took 10 minutes to get to your accident site instead of 5.

    Sure, it’s better for the ambulance to get there in 5 minutes, but if you had been wearing your seatbelt, your face wouldn’t be smashed into the windshield and how fast the ambulance go there has nothing to do with whether you could have completely avoided getting your face smashed into the windshield.

  30. Deva Hazarika

    Michael,

    I’m really surprised that you pass the blame off in your post on Rackspace. Why should a 37signals customer care about who is providing the hosting and any problems they might have? From the customer’s perspective, there’s simply one party to blame here, and that’s 37Signals, not some service provider the customer should have no reason to even be aware of. I hope you follow up on this and focus on the necessity of online companies to have adequate backup/redundancy/etc in place so their customers don’t suffer in situations like this.

  31. Erik

    Complete dependence on one data center = non-mission critical service. If someone wants to operate a mission-critical service then they should be prepared to have a nuclear bomb fall directly on one of their data centers and have no service interruption, or at least be back up in minutes with minimal if any data loss. This is Ops 101 and is done all the time. It’s just a bit more expensive.

  32. Scott Mueller

    singingdancingbear, aside from your extremely poor grammar, I agree with what you said, “I say if you use 37signals you evaluate their service since it seems like that make decisions have baked.”

    As someone else mentioned, it is unprofessional of 37 Signals to publicly blame Rackspace for their single point of failure. Is it true that rackspace did in fact have the hardware in place within an hour, but configuration was an issue? If rackspace was responsible for the configuration too, they should obviously live up to their uptime guarantee with compensation. However, they did at least try and had the hardware ready as promised.

    From my perspective of what I’ve read so far:
    Downtime blame up to 1 hour: 100% 37 signals
    Downtime blame beyond 1 hour: 20% rackspace and 80% 37 signals

  33. mike

    Cisco LBs can be deployed in a redundant, HA configuration. Rackspace primarily deploys Cisco networking hardware.

    I agree with many of the previous posts. Pony up the cash for a failover device and eliminate single points of failure.

  34. sodapop

    Concentric.com has a unique clustered hosting that prevents these kind of outages.

  35. asshat

    this is what happens when you hire a former baywatch star to work at your software company.

    so what if he made the ruby on rails framework - he also made this terrible video with a cheeseburger:

    http://www.youtube.com/watch?v=82-FJyniP7A

    hah classic

  36. Jon F

    Glad you have editorial integrity to bash one of your sponsors…but for the record (other than the Dallas power outage) I’ve had great success with them. (knock on wood)…esp their technical support. They are outstanding

  37. Morgan

    Before 37Signals points anymore fingers of blame they need on their homepage:

    “We are aware of and have chosen to accept that if a single piece of equipment fails, your applications will be down for an hour. We choose not to have a spare ready but to rely on our SLA for hardware replacement for you.”

    They obviously knew about the replacement time, and chose to not have a backup ready. They were willing to have their customers down for an hour, and expressly chose to keep that risk. To blame the entire thing on Rackspace is just childish beyond belief.

  38. Matt @ 37signals

    It is our fault that the servers were down. It’ll always be our fault if something is down. The buck stops here. Again, we’re very sorry this happened.

    While we don’t have a formal service-level agreement (SLA), we still want to compensate anyone who felt they were negatively affected in their work because of this outage. Full details here.

  39. Davis

    Nice job. You linked to the post in your admin. Password protected.

  40. manp

    Been very happy with Rackspace so far. They’re very professional and very reponsive.

  41. James

    If 37 Signals had a single point of failure as this incident indicates, they should really expect this kind of thing to happen from time to time. If downtime is unacceptable, maybe it’s time for 37 Signals to build a more resilient server platform. There is no reason why an outage at one data center has to take down an entire site, 37 Signals could have planned for this kind of thing by using a number of different technologies.

    But they didn’t. Unless the hosting provider made some guarantee on this configuration being free of single points of failure, it’s hard for me to rationalize blaming the hosting provider for the initial downtime.

    37 Signals is selling customers on their services’ availability so they owe it to their customers to build a more resilient server platform, not blame others. The releases 37 Signals made about this incident indicate that this configuration would have failed anywhere it was hosted. Replacement and reconfiguration in under 1 hour would have been nice, but I have to agree with Scott Mueller’s breakdown of blame.

  42. harper

    Well.. as my buddy Scott told me this after noon:

    “ouch, single point of failure… apparently getting real does not involve doing your architecture homework”

    We (threadless) have had our fair share of hardware failures (seems to me to be problem of the hardware maker, not rackspace) - however we have always had hot standbys.

    i can’t see how this is rackspaces fault - unless 37s has some SLA that includes predicting the future.

    But like matt says - It is the service providers fault that their servers were down. I just don’t see why they insist on blaming rackspace so publicly. It isn’t like the people who couldn’t access basecamp (us) cared WHO was responsible - we just wanted it back up. Which is why a simple “We were down. bummer. we are up now” usually suffices.

  43. Morgan

    You know, reading back through this 37S at least makes some sense– but the headline of this article is ridiculous. Not exactly thoughtful, and to whomever said it, I don’t really think editorially gutsy is the same as just being unthinking and wrong.

  44. Wilson Cleveland

    It’s 4:53pm on the East Coast and we still don’t have access to Basecamp. Unacceptable.

  45. JpMaxMan

    In the load balancers that I’ve configured (primarily from F5) it is standard for even the most basic high availability service to have two load balancers daisy chained w/ the second acting as a hot fail over. Since the load balancer sits in front of all app servers and is a single point of failure, not having a hot fail over (even w/ the most aggressive hardware replacement plan) seems wrong. Maybe RackSpace bears some responsibility, but in this incident (unlike the freak truck hitting the generator) I’d say it also lies on 37signals for not putting in a fail over load balancer.

    Of course, this comes with the disclaimer, that I know nothing about this other than what I’ve seen here. It’s easy to criticize, maybe it is all on RackSpace, just comes off as a little lop sided and strong, IMHO.

  46. 37user

    The quality of journalism on Techcrunch is down. Looks like Mike Arrington is to blame–again.

  47. EH

    Huh, weird. Nobody has become shrill enough in their Nerd-day Morning Quarterbacking to demand the firing of any 37Sig staff. I’ll give it another hour or two before someone says this would be a firing offense at [link to their site].

  48. DHH

    Wilson, if you continue to have problems accessing the system please write support@37signals.com. The problem is that the DNS entry is being cached erroneously and those caches need to be forcefully cleared out. We’ve added instructions at the top of http://status.37signals.com/, but feel free to write support if you require further assistance.

  49. anon

    @DHH

    “In our case, it took about two hours from reporting the problem to having it fixed. That’s not good enough.”

    Maybe you guys should house your own racks. You could get Zed to configure the Mongrel install. He loves Rails.

  50. Matt @ 37signals

    Davis, you’re right. Here’s the correct link with info on what happened and how customers can receive compensation: What happened this morning?

  51. chrisw

    I wonder why they (37signals) don’t just move their entire service over the AWS - they’re already using S3 for file storage, why not go all the way?

  52. PA

    Hey Guys,

    I’m a Lead Tech in the Intensive Team (Linux). And well, to be honest, hardware fails, and as such we stress to our clients, if it’s a core solution, have a HA setup. We have solutions that cater for almost every failure that we can (redundant power, network, storage, servers)! But again, that costs money, we regularly do redundant Firewall, and Load Balancer pairs, which are all Cisco kit, with some of the BEST cisco tech’s on the market working for us.

    When we write a SLA (which all of our clients have) we regard them as promises, which we keep to the best of our ability, a promise we make, is a promise we keep. It’s unfortunate that the hardware took nearly 2 hours to replace and get back up and running, and for that, I can only but apologize, but as things go, even if we’d met the 1 Hour that was listed in the SLA, would that have been OK to 37 Signal’s clients?

    It’s awesome to see so many customers coming to our rescue in the comments, and I’m sure if 37 signals went anywhere else, they wouldn’t get the attention and support they receive now. I would be more than glad to spend my personal time with 37 Signals to get their faith in Rackspace restored. (as would any other racker, no doubt). DHH, if you would like, you’re more than welcome to email me personally at p eefy net and i’d be more than happy to do my best to help you come to love rackspace.

    P

  53. Strubit

    @ 51
    agree. why isn’t AWS a more suitable environment for 37S?

  54. Ramon

    is this same rackspace hosting that has an ad on right side of this page?

    i think 37signal has very good services

  55. Sven

    I use BaseCamp and I hope to use Rackspace in the future.

    It’s BS for 37signals to point the finger like this. If I was Rackspace I would drop them like a cheap suit.

    If you want to run a business with SaaS at the core, you have to over-invest on the hosting and backups fronts. Not point the finger at someone else.

    1 million customers and you have to scapegoat a vendor?

    Lame, lame, lame.

  56. Eric Wagner

    load balancers aren’t shared at Rackspace (as far as I know) so this most likely only means that 37signals went down. Rackspace does offer redundant load balancers so if you choose to save $ you are taking a risk. I use Rackspace for hosting and excluding the truck incident I have had zero problems.

  57. Fred

    This is a bummer for 37Signals, but as a customer of theirs for several years now I’ve been impressed with their customer service. Stuff like this happens in IT. Say they did have a spare load balancer, then you run into the problems associated with running redundant load balancers.

  58. chrisw

    Wow! I just had a phone call from RackSpace (on my mobile at 10 in the morning- I’m in Sydney Australia!). I’ve been considering a number of hosting providers in the US for the launch of an new online service my company has been developing. After reading about this issue this morning I emailed my contact at RackSpace and thanked them for their proposal but indicated that we would not be proceeding due to the apparent outages they had suffered recently.

    To my surprise, RackSpace was on top of this immediately. I received a 3 way phone call from the account director and from her Vice President, Jairo Romero (he was on his mobile - driving home in his car). They both assured me that the outage was not their fault (very diplomatically of course). They both re-assured me that RackSpace provided the best hosting service there was and absolutely guaranteed me that I would be happy with RackSpace if I hosted my site with them.

    To me, this is pretty incredible. I would never expect an Australian company to go that extra mile - especially for a ‘potential’ customer half way across the world.

    We’ve been seriously considering hosting our entire product with Amazon Web Services due to the low cost of establishment and the ability to scale dynamically as our customer base grows - we’ve been looking at SmugMug’s business model closely since we think it’s very similar to ours (different online service though).

    But since the call this morning from RackSpace, I’m wondering if ‘customer service’ (which I doubt Amazon will provide in huge bucket loads) should be factored in our equation when looking for a hosting provider. With AWS I hadn’t considered it a huge issue since I assumed that “Hey, it’s Amazon - their big and can’t afford to have major downtimes or expansion issues - we’d be nuts not to take advantage of AWS” ….. maybe I’m wrong. Maybe having a team pf people who go to the trouble of calling you at 10am on a Saturday from half way round the world is important?

    Pretty impressive.

  59. Avinash

    Two outages. I think 37signals should definitely think of moving on to a different hosting provider.

  60. G

    Avinash,

    Thats like saying “I had two flat tires, I better buy a new car”. Good thing you aren’t my financial adviser.

  61. sodapop

    G, I think its more like having your engine fail.

  62. ididak

    Most people here seem to blame the ISP. It’s really 37signal’s fault of not having a redundant LB setup. There is simply no excuse not to have at least 2 netscalers (or equivalent) that go through ISP’s redundant switch/routers, if your customers depend on your web services. If you don’t want to pay cash, learn to setup a redundant LVS solution. It’s not that hard.

    I’m actually surprised to see 37signal this clueless about web services, given the hype and sometimes solid advice from the company.

  63. Andrew

    Maybe this is why virtualized scaling setups such as EngineYard’s make a lot of sense - no one single point of failure - scale it with multiple physical locations (yes, I know data replication would be a nightmare) and you would have a pretty redundant system.

  64. HonestMall.com

    somebody just needs to make a good project management, chat and contact system that can be hosted on people’s own servers.

  65. buckerooni

    relying on one host is a single point of failure, but if there’s no SLA between 37signals and their customers, then what are they complaining about?

  66. Bill

    Wow, dozens of messages all for a few hardware failures. I have actually consulted on several projects which leveraged RackSpace for hosting, and have always found them to be one of the best. There are actually several lessons from this:

    1. If your web site is important or creates revenue, invest in redundancy up to the point any outage impact costs less than the architecture to prevent it. Even that may be to risky for consideration. (e.g Air Traffic Control) ((Would you want your crash rate to increase by even 1% because a budget did not allow for $000’s hardware purchase.

    2. Consider using some of the ‘grid’ type services available. Even RackSpace has a spinoff group that offers this highly redundant type of hosting LAMP type applications.

    3. Create a backup hosting environment with a different company that maintains a replication of your production environment. (Yes, it will be a pain to setup) However, the advantage is you can very quickly shift users to the alternate site while recovering the primary site.

  67. Anon

    The main issue here is probably 37signals hype. Someone said
    “1 million customers and you have to scapegoat a vendor?”

    Folks, 37signals has one million signups, not customers. It’s a small company with limited resources. That’s probably why they did the cost analysis and decided they didn’t want to pay for a second load balancer.

    At the end of the day, ask yourself two questions:
    1. How much do you pay for your 37signals service.
    2. How much downtime have they had since you’ve started using them.
    3. Would you pay double to halve the downtime.

    I don’t feel too bad for the 37signals guys, though. They got so much free marketing and exposure through their hype. Since when should people care that the product was written in Ruby rather than in Java, .NET or PHP.

    They *are* marketing geniuses. Our product has very few features *by design*. Brilliant.

  68. Kevin

    As far as hosting companies go Rackspace is one of the best I have dealt with. By no means perfect, but they usually have clue and are very responsive.

    A single point of failure on a load balancer is just asking for it. This could have been avoided by paying a few hundred extra per month for stateful failover for the webmux, F5 or whatever gear you had in there.

    Doesn’t matter what the hardware replacement time is, anything over 30 seconds of downtime will bother anyone.

  69. Paul

    Of course 37s is blaming rackspace because when you use a managed service like rackspace, you are never at fault! Think about it, 37s does not actually own any of that equipment, nor or they responsible for maintaining it (at least the hardware). They probably have never even seen it. I don’t know how any web company can be serious when they are not responsible for their own servers.

  70. Bali

    They have a hardware replacement time guarantee that has long since been broken.

  71. Mark

    Why any company would trust mission critical data to a 3rd-party, off-site server beyond their control is beyond me.

    My company loves Basecamp and uses it frequently, but any sort of technical issue isn’t going to stop us from any critical day-to-day operations.

  72. Todd

    IT Director responsible for hundreds of servers in five data centers here. Clearly, the majority of responsibilty and blame rest on 37signal. I see several “single points of failure” 1) one data center 2) probably single firewall 3) obviously single load balancer, 4) probably single switch.

    We run our company on the 3s principal. Three main data centers each carrying 1/3 traffic. At least three load balanced web servers, three app., three sql. Any two systems can carry the full load if an emergency comes up.

    Definitely RS failed with the hour turn around for hardware. If they are managing the FW and LBs they are responsible for a current configuration in another area. They have a ton of these in each data center and it literally takes minutes to reflash a FW or LB.