Why Gmail Failed Today
by Erick Schonfeld on September 1, 2009

When Gmail went down today, it caused more than a minor panic. People, like me, who use Gmail as their primary email couldn’t get much work done. There’s nothing like an outage to make you realize how much you rely on something.

So what happened exactly? Isn’t Gmail supposed to have multiple points of failure? Well yes, Gmail has thousands and thousands of overlapping mail servers which can pick up the slack if any one fails because the data is replicated and spread all around. But there are also request servers which do nothing but route the requests for email to whichever server (with the right emails on it) happens to be available.

It tuns out that Google took down some regular email servers for routine maintenance, and because of some recent changes, that overloaded the request servers. Google engineering VP Ben Treynor explains on the Gmail Blog:

At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.

So much for redundancy.

Gmail, which recently passed AOL to become the third largest Web mail service in the U.S., is obviously having some growing pains. A few hours of downtime is not the end of the world, although it might seem like it at the time. It just better not make this a new habit.

Advertisement

Comments rss icon

  • Now we know why… But it was still hard to function. We love Gmail.

  • Cant complain really you get what you pay for in this case its nothing.

  • I agree, a big Company such as google who has been at it for 10 years should be well aware of their massive traffic, and server resources. I don’t blame Google though, I’m brain washed and love them!

  • Seriously…?! Oh, come on — do we really need half a dozen stories per news site about Gmail being down for the ridiculously long period of 100 minutes…?!?

  • I love gmail and I forgive them. Google is doing a lot of ambitious things and it’s a sign of their competence these events are so very rare. I also appreciate them explaining what went wrong. Having a culture that admits mistakes and learns from them is the mark of a company that provides stable products.

  • I love Gmail, both personally and Google Apps at my company. It’s a wonderful service. That said, there’s a big part that Ben left out of the story. It’s when Ben got together with members of his team and the PR team on what’s the acceptable story to present on what went wrong. What got on the blog was likely ok’ed by Google higher-ups before publishing. It was probably something more like a developer screwup or operations gaffe (a fundamental product/process flaw), but only the insiders will know.

  • So much for the cloud…get a backup at Yahoo :) .

  • I don’t even realize that it actually went down.
    Anyhow I’d still love gmail :)

  • When possible, redundancy is not only the service provider’s responsibility but also the user’s.

    When I discovered a problem accessing Gmail via the web this afternoon, I quickly switched to my iPhone email client and was instantly good to go.

  • Today’s outage made me re-evaluate my entire belief in cloud computing.

    http://www.stuf...loud-computing/

    • their down time wasn’t even full down time, just their web access were affected, while imap, pop was fine. they recover within hour-1/2; see if you can bring your local server up that quick due to a raid-controller failure… you’re looking at 3+ hours recovery. don’t let the cloud scare you!

    • Buddy, you dont have experience in running real services yourself. Get real. If Google got the gmail web client back in 100 mins, I would salute them.

      You have no clue even great companies dont know a shit about 24/7 operations.
      In my current role, I have to scream for basic things like exponential back off to keep service running when there is unplanned reduction in capacity.

      If you ever run your own service of some scale, you will know how difficult it is to give anything more than 3-4 nines.

  • Come on people, pick up the phone. Is a few hours of downtime really that big a deal? Most of human history functioned without it.

  • Come on people, pick up the phone. Is a few hours of downtime really that big a deal? Most of human history functioned without it.

    • For mission critical applications (some people consider Gmail mission critical) then YES, a few *hours* of downtime in the middle of the work day is a big, big deal.

      Consider: 2 hours of downtime means that for the year, you won’t be able to get higher than 99.9% availability. If you tried to sell me an application or piece of hardware that had less than 99.99% availability, I’d not even take the meeting. I don’t care how cheap it is. If it’s not going to work, randomly, of the course of my business day, I don’t want it.

      2 hours of downtime means that for 2 hours, entire companies were unable to email one another. Consider a company that relies on Gmail to receive trouble tickets about it’s application. A customer opens a trouble ticket and expects and email or call back … and they wait … and they wait. And 2 hours later they get a call: “Sorry, our email was down.” The customer would think this is unacceptable. For some companies, email is more important than the phone.

      Outages like this prove to companies that run their own email servers that hosting in the cloud is not ready for primetime.

      • Jeepers, it’s not that big a deal. Take a stroll around the block, eat a ham sandwich. Life will go on.

      • 99.99% availability huh? I suppose you (and the rest of the business world who wouldn’t stand for such poorly performing applications) certainly would not be running Windows? Businesses have shown time and again that that they have zero tolerance for any poorly performing applications….oh…wait…

      • I wish I could say that my work’s private email servers have been down less than 100 minutes this year, but I can’t. You’d have to have a pretty amazing information structure to be able to compete with Gmail in the cloud.

      • You’re assuming that the average company mail server has less than 2 hours downtime across a year. Not the case in my experience, especially if it’s running Exchange. I’m including time that your mail is delayed (due to serious virus outbreaks choking the server, etc) in this downtime. I think Gmail would come out quite favourably in this comparison. Besides, if an entire company was using Gmail as their email server, surely they’d be using IMAP with a regular mail client for reliability anyway.

      • J,
        Which world you are coming from? You are saying gmail used for mission critical things? Really, and what is that which can not wait 2 hours?

        Is you app 911 call center?

        I have been on telephone with customer support of top 10 companies for HOURS without a response. So, which is a company which has a LOSS when their email does not work for 2 hours.

        Get real….

        • I come from the network world. Anything less than 99.999% is failure.

          For the people writing in about Windows – you’re not thinking about availability in the right context. I don’t care if my Microsoft Exchange server reboots suddenly – I have backups. I don’t care if your Windows machine crashes, I have thousands of other users that are fine (and didn’t install that virus). If I have 100,000 users, and your machine goes down for an entire YEAR, I can still hit 99.999%. You are a statistic.

          And yes, 2 hours of downtime for 100% of users is critically bad.

    • Picking up the phone isn’t always possible. It just so happened that just went Gmail went down I was trying to access my account because I needed to call someone- too bad that they had sent me their telephone number in an email. I spent at least a half hour trying to access it, and the same went for him. 100 minutes can be disastrous, especially to those who rely on it most. Am I saying we should all be dependent on our emails and everything else? Certainly not, but sometimes we have no real alternative.

      In any case, I’m so glad Gmail is back up. Best thing I’ve ever used and most of the time, it’s reliable. Thank goodness I was able to access my email through iGoogle.

  • And why on earth would you take servers offline middle of the day? Don’t these kind of maintenance activities take place at wee hours of the night when the traffic is not that high?

  • Why do people care *why* it went down? It went down. As a user, that’s all you need to know. It’s not like knowing why it went down makes it any less inconvenient. It doesn’t matter if it’s because someone tripped over a cable extension at the hosting center, or because router XYZ experienced a segfault in its ethernet driver. Seriously, what difference does it make to YOU? None.

    • I care, because I want to know if it’s a problem that can be easily avoided in the future (it is). In sure Google is building redundant request servers as we type…

    • I care. It helps me knowing the details. I am in the same business as large scale services.
      It is just unparalleled knowledge knowing what happened.
      But that is selfish of me. Yes, I know…..

  • lucky for me i don’t login much to the site. my blackberry and outlook access held it down.

  • So much for “That 16 email error made me go Google”

  • “stop sending us traffic, we’re too slow!” Does it sound familiar in our corp culture: “I am over loaded, ask someone else, please!”. Still, I don’t understand why the routers got overloaded all of sudden today.

  • totally annoying for the moment (i did have a little panic), but in the end, nothing worth getting in a lather about :)

  • Time for Google and other Web 2.0 companies (Twitter, Facebook) to become a little bit more like AT&T or Verizon. You don’t take down mission some of your critical infrastructure at 12:30 PM PDT. Those servers are redundant, yes, but they are redundant for volume and load. They’re not needed at 12:30 *AM* PDT, but they are needed in the middle of the work day. If they had done this during a typical maintenance window, like 12:01 AM – 6 AM, they would have had one of two things happen:

    1. There would have not been as many requests and nothing would have even hiccuped
    …OR…
    2. The service still would have gone down, but few people would have noticed it at 12:01 AM PDT.

    This is not rocket science.

    • Good thing all of Gmail’s users are in the Pacific Time Zone. In that context your suggestion makes perfect sense.

    • J –

      I agree 1,000,000,000% (as mathematically impossible as that is.)

      You’re right on the money and I’m really surprised that the geniuses at Google didn’t figure that out before today. I work at Sirius XM Radio and they always do infrastructure maintenance in the middle of the night – usually on the weekend.

      • As far as I can tell, Google still suffers from the same “cowboy” mentality that screwed up the cable companies for so long.

        “I’m an engineer, I understand the system, I can take it down in the middle of the day because it’s redundant.”

        Wrong, you don’t understand the system as well as you thought, and you cost Google a lot of bad press today.

        The really great news is that they didn’t lose any emails (as far as we know about) and it was *just* an outage. This time. Hopefully the bad press is enough for Google to ensure their internal practices to prevent this from happening again. I do believe Google is smart enough to put in a process to fix this, they just need to learn the hard way – like everyone else.

    • J, as odd as you may find it, not all GMail users are on the same time zone, some are not even close to your time zone.

      • Amir, you bring up a good point that only amplifies mine.

        According to this blog post from Google, a majority of Gmail users are outside the US (http://gmailblo...oes-global.html) Assuming that’s accurate, let’s call it a 50/50 split for US vs. everyone else on Gmail. 12:30 PM PDT (when Gmail went down) easily effects all US timezones, plus the timezones +1 and -1 outside of the US. Most of the rest of the world won’t be in business hours, so we’ll ignore them. GMT would be 8:30 PM at this time. Singapore would be 3:30 AM at this time.

        Now, let’s assume that at least SOME of those foreign users are Canadian, Mexican and in any of the countries in the rest of Central and South America. Will you grant me that 10% are Canadian and 10% are Mexican? It doesn’t seem to be that far of a stretch. Let’s ignore the rest of Central and South America. You’re looking at 50% US, 10% Canada, 10% Mexico – 70% of the users. And you decided to take the servers for maintenance during the few hours that we assume they’ll be hitting the service the hardest.

      • That’s right…*SOME*.

        • There are 24 freakin timezones out there, why do people on PDT time think the world should revolve around them? Maybe this was something that HAD to be done then and there because to put it off for another 12 hours may have caused more problems?

          Gmail is still an awesome “free” service (for all those that want to bitch about looking at advertising). Like someone else said pick up the phone, or meet for coffee? We didn’t know what email was 50 years ago but I’m pretty sure people still did business back then!

      • Amir,

        Good point, but to J’s overarching point, you do the upgrade at the time of lowest usage, regardless of the time. That’s what we do at my work. I’ve done server swaps at the weirdest hours. :)

        John

    • Err? You realize that Gmail is used worldwide, and that there is no “middle of the workday”? Such a myopic world view.

    • As hard as it might be for you to believe, the whole world is not situated in your time zone. No matter what time a company takes down its servers for maintenance, a lot of people are going to be affected.

      Here’s a crazy suggestion: if your email isn’t working, pick up the telephone.

      THATS not rocket science.

    • I’d be willing to guess that more people in Western/Central Europe use it than in the Pacific US.

  • today… almost the end of civilization, as we know…

  • gmail uptime is still an order of magnitude better than the Ex$hange server at work… I’ll take it…

  • Isn’t this the second or third time this year?! You can see why Yahoo is the #1 mail provider… No major failures in the last couple of years that I can recall.

  • Just curious…what are the number 1 and 2 email services?

  • At least, we should get the right message – we are slow – connect later –

    What do you think first? – oh, my connection is slow, or my provider does not work.

    Creates confusion, if you don’t get the right message.

    Nobody is perfect, anyhow.

    See you later,

    Dan Gabriel
    twitter.com/gdan

  • Seriously, this is why I tell customers that they’d be insane for moving their mail to gmail. Seriously insane. If not just for the outages, for the retardedly high level of false positives for spam, the lack of security, and the invasion of privacy.

    • And what should they use instead, your handrolled sendmail setup that you serve from your mom’s basement?

    • 1. Outages are much less than your average Exchange server. IMAP was completely unaffected.
      2. Some false positives – for me, one every couple of months. Weekly check of the Spam folder is sufficient. Worth it for the effectriveness of the spam filter. Surely there’s a trade-off?
      3. Explain the security issues you face.
      4. Invasion of privacy – I don’t really mind if a computer program reads my email. What’s to say a sys admin in your company isn’t reading your email right now?

    • Eldon, I could probably walk into your IT guy’s office right now and find a backup of your entire company’s email thrown on a shelf.

      That is assuming the guy is still doing backups after the spindle of DVD’s ran out.

  • I did not get their error page, but this buggy broken html page…
    http://reduce.li/petd (screenshot)

    looks like partially loaded sprites …

  • Don Lewis - Who uses gmail? - September 1st, 2009 at 9:15 pm PDT

    I have been using HOTMAIL (yes I said HOTMAIL) and in past 10 years…. it has never ever failed even once…

    huh… wonder who uses gmail if any…

    bullshit if their routers told… “stop sending us traffic, we’re too slow”

    Yahoo and Hotmail also have Routers which has never crashed since their inception…

    o btw…let me check my inbox….

    • that’s so dumb. you’re making this into a my penis is bigger and works better than yours non argument. i’ve been using hotmail since 96-97 and there have been more than a couple of times where hotmail has gone down. gmail has a lot of users…it might not be popular with people who use yahoo, hotmail, aol, or any other email service but i have accounts with other email providers and yet i still gmail, and gmail has increasingly become my primary email account.

  • Imagine if that was googles yet to be introduced windows killing software that went down. Yikes!

  • If you want redundancy and rely on gmail so much, why not spend a few minutes configuring POP or IMAP access?

  • Nonsence, it is all nonsence, in a well engineered system they should have known about consequences of taking down servers. This is not the first time Gmail fails and Techcrunch is trying to show it is not a big deal and that all is just a maintenance problem (Techcrunch traffic is heavily saturated by Google). We all know there are some serious problems with Gmail. They can not handle their own size. They have to split the company into separate entities that afford to provide quality server localy.

  • Never had any issues today on my 2 macs and iPod Touch. But thats cuz I don’t use the web interface.

  • QQ more?

    Seriously, my free service’s web interface went down (Not on my BlackBerry) for an hour today… *cry* OMG POST NEWS

    Besides Tech Crunch guy it gave you your story for today, figure it paid for itself, huh.

    Like the other guy said kicks the hell out of my slow ass 20MB capped exchange server at work.

  • Nothing like a GMail outage to send some traffic to one of TechCrunches 17 articles about it.

  • Use IMAP. It’s better anyhow.

  • just for the record if you read that google gears still worked so hardcore googlers would still be able to access, also you could set up a account with firebird and/or evolution to get the emails if u needed them, and I could go on but gmail still kicks ass with its uptime compared to anything I have been using it for over 5 years so I can say such a thing, well have fun bashing things you know nothing about and have a nice day

  • Now it looks like Harrison Ford is super pissed about the Gmail outage…

    http://tinyurl.com/ntpjut

    Google is so screwed.

  • Don Lewis - Who uses gmail? Part 2 - September 1st, 2009 at 10:02 pm PDT

    “that frequent gfails made me switch to Hotmail… ”

    LOL

  • So..when hotmail goes down, we get “Microsoft is evil.”

    When gmail goes down, the majority of comments are “We LOVE google”.

    GIVE ME A BREAK. It’s all the same!

    • Are you saying the history, corporate culture and attitude to users is identical at each company? Give me a break!

    • Are you claiming to be surprised that people on the internet love Google? Where have you been the last 10 years man?

    • you should talk to the guy a couple comments above yours who said hotmail has never failed for him…in all of 10 years, which is implausible…that’s more something i would attribute to yahoo, but i hardly used yahoo mail before i got ridd of it. i happen to hotmail all the time and it’s gone down more than once. gmail is reliable but also goes down occasionally. it’s life. it doesn’t mean one or the other is evil. what would suck is if this had become like what happens on twitter, and if people weren’t kept informed about what’s happening with gfail. i saw comments on the many threads yesterday that mentioned igoolge and by the time i signed in through that, i could access my account from the gmail page. it was really a blip. it’s not like how when they are doing maintenance in the morning you can’t access gmail for hours. i didn’t have access to gmail for about 15-20 minutes.

  • What’s funny is that the borderline luddite “omg cloud sux… PROOF!!!!” crowd doesn’t understand that corporate or other forms of hosted email go down as well. Just because you run it yourself doesn’t mean it’s immune from failure. It’s only likely that GMail reliability will increase as their code and knowledge base matures. Like people said previously… it wasn’t really completely down, it was just the web interface. Also, the vast majority of companies will function fine without email for 100 minutes. It might actually help some of their employees get closer to “inbox zero.”

    The moral of the story is things go down — even things that are “mission critical” — and you should always have a backup plan.

  • I didn’t even notice! I was too busy writing:)

  • I use IMAP as a backup. It’s easy to set up and I highly recommend it for anyone who wants to mitigate outages. Kudos to Google for providing this support for FREE!

  • I still don’t get why people care why it went down. It went down. Knowing why it went down doesn’t make any difference.

    • er. i might not be able to do anything about it, but if a service i use goes down i would like to know the reason. the reason might make me think about switching to another service or not.

      • same. i don’t understand these questions of why do people care. people care because they use the service. it’s okay if you don’t care or feel the need to not be informed because you don’t use the service daily, fine. i don’t tweet, but i want to get a twitter account. i care when twitter goes down because i want to know if this is normal for that service and how the users feel about these downtimes and how they deal with it. when i had a fb, they would do routine checks and people couldn’t access their pages and couldn’t talk to or stalk their friends. i didn’t really understand what the travesty was because i felt people could just use the phone or email…but i am not laughing at those users for being dependent on fb. people who google, use gmail, pay for gmail are not stupid people. i’m sure although for some people they were stuck because they work from their gmail, but i’m sure many others had backup ways to email. i went back to hotmail and realized how much i miss my gmail account. it’s life. it’s not a big whine and dine. gfail wasn’t a big deal. the pr wasn’t that bad either considering they got the service back up and it wasn’t down for the whole day. i’d rather it hadn’t happened but it’s to be expected. kinda like how people are used to at&fail dropping calls all the time and how although apple pretends it hasn’t gone the way of evil…they’re already there. i think jimmy flip had the right sarcastic answer.

  • It’s Beta, duh!

  • Seems like I’m the only one who doesn’t use Gmail (other than the Hotmail guy, of course!). As somebody else said, growing pains can be just that, a pain, but at least Google are constantly striving to improve their offerings.

  • “People, like me, who use Gmail as their primary email couldn’t get much work done.”
    Yeah, let’s see. Be professional, DON’T use gmail as your primary work email. Use a REAL email address!

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbugbug
Techcrunch on Facebook