January 15, 2008

Joyent Suffers Major Downtime Due To ZFS Bug

Duncan Riley

32 comments »

joyent.jpgServices provided by cloud computing provider Joyent have been offline for three days following issues with ZFS.

Strongspace and BingoDisk are amongst services that have been affected since January 12. Joyent CEO David Young said in a post to the Joyent forums that the service “got bit by a massive ZFS bug…that got onto/into the backups” preventing easy restoration.

Data Knowledge Center has more, and you can see our review of Joyent’s Slingshot product here.

The problems at Joyent follow ongoing difficulties at online storage provider Omnidrive, who saw its main website was down for over a week following suggestions that the service may be facing financial difficulties. At the time of writing Omnidrive’s web site is back up and other reports would suggest that the core storage service remained unaffected by the website issues; however, there are big questions about their long term viability.

  • Sphere It

Trackbacks/Pings (Trackback URL)

  1. Hosting gone bad: EVERYWHERE!!!

Comments

RSS feed for comments on this post.

  1. Nikc

    nice shoutout to omnidrive duncan

  2. Technicle

    Sun is known to have design issues once in a while for their storage products, software or hardware.

    The other major one was about a hilarious timer counter reset on a raid product a few years ago that would cause total data loss for all data on the raid. Sun advisory to the user was (amongst other remedies): power cycle the raid once in a (long) while :-)

    Omnidrive, well, fast becoming irrelevant. Mike should get out of that asap.

  3. Frank54

    ‘Got Into the backups’. Yeah right, I bet they just didn’t have any. How in god’s name does a ‘bug’ ‘get into’ backups.

  4. Alex

    Here’s the ZFS forum thread about this issue:

    http://www.opensolaris.org/jiv.....p;tstart=0

    Looks like Joyent was stung running a stale version of OpenSolaris that they felt was otherwise stable. I wouldn’t wish this troubleshooting adventure on anyone.

  5. Steve

    Is this the reason why Twitter has been sluggish/down for the past 24 hours - being customers of Joyent? Or is it more about failing to cope with the unexpected demand (again)?

  6. Frank54

    Twitter left these hot air salesmen weeks ago, though joyent have been carefully deleting any comments or questions on their forums about it. This company is dishonest and unreliable, and their overpriced unreliable solaris junk has been shown up for the trash that it is.

    Twitter’s issues probably have more to do with the macworld keynote.

  7. Alex

    @Frank54

    Would you be kind enough to elaborate on your claims? What is your platform of choice? Why is Solaris junk compared to your platform?

  8. Lars Fischer

    I can highly recommend using JungleDisk / Amazon S3 for remote data storage. Before that I used Joyent Strongspace and I liked the application a lot - but it was slow as hell and unreliable.

  9. Technicle

    Any encrypting service don’t have much market, or more precisely, won’t have the mass market. It’s as lame as selling encrypted email service. Yes it’s something good, technically. But, who really care? If you can trust unencrypted email (including plain-text transfer via smtp), why can people trust unencrypted storage?

    To encrypt for transit mightbe reasonable. Encrypted login procedure almost a must (note: almost). But encrypted storage? Who cares.

    Simple (successful) example: S3

  10. Technicle

    >plain-text transfer via smtp), why can people trust unencrypted storage?
    plain-text transfer via smtp), why can’t people trust unencrypted storage?

    and there’s no need for any answer — it’s not a question, rhetorically or not.

  11. s3box

    Nothing beats S3 — not even Google.

  12. Andrew

    is there a good 3rd party comparison for S3 vs Joyent

  13. Jason

    Frank54 I can almost guarantee you’ve never set foot in the Solaris platform, let alone done ANY work in ZFS. Get back to your Windows 2003 server.

  14. till

    The article is slightly “off”. The headline should be, “Joyent Suffers Downtime Due To Running Outdated Software”.

    If you read closely, that was the reason. Not because of something unforseen. Just because someone didn’t update, etc.. Always sucks having to deal with backups and shifting around lots of data. But it’s their business.

  15. Ross

    I find this quite funny after Jason from Joyent bashed me for one of my articles: http://www.hostdisciple.com/20.....b-20-site/

    I’ll let you read it yourself but in it I basically said Joyent appears to have problems with downtimes/customer service which Jason blatantly denied, maybe he’ll post an apology….

  16. Technicle

    I fully support attributing the story to a ZFS bug - because that’s exactly what it was.

  17. Amit

    Alright then, I was just planning on signing up with Joyent. I might as well hold off, but there are some people that have some deep resentment for Joyent. Is this valid or is Joyent overall reliable (excluding the current issue)?

  18. Jason

    It was a ZFS bug that was fixed well before they had the issue. Not updating to the latest version of FREE software is ridiculous and irresponsible.

  19. John

    don’t knock them, it’s a minor thing and they’re on it.

    should be fixed in another 3-4 hours.

    they’re good guys!

  20. doug

    I don’t understand the comments asking for comparisons between joyent and S3. S3 is a _storage_ platform, not a “cloud” computing offering. Perhaps you meant EC2?

    Regardless, I’m not moving anything important to EC2 until SLAs are put in place and they offer some basic features such as … say … a static IP address that I don’t lose if I reboot an instance?

    I’m all for bleeding edge when I’m looking at development, but if my business is relying on it, I’m still hosting with more tried and true dedicated server technology for my production environment.

  21. anonymous

    Joyent’s responsible for a good part of twitter’s down times. Rumor has it twitter is moving off of Joyent as soon as they can. Sometimes marketing hype is just that.

  22. dave

    Wow first Dreamhost, now Joyent with a major screw-up. BTW - they must feel pretty silly right now. Touting Sun for a couple years, pimping Sun hardware like it was the second coming and now they’ve switched to Dell and gotten smacked by ZFS.

  23. Amanda

    I haven’t been a fan of Joyent since my company first started looking for a new hosting company a few months ago, with their horrible communication skills, pricing, amongst other things. Now they’re DEFINITELY off of the prospective hosts list.

  24. Jungle Fire

    whose the best RoR host out there in all of your opinions ?

  25. Jason

    Probably EngineYard now.

  26. Kristie Wells

    Disclaimer - I work for Joyent.

    We are keeping our customers up to date on the issues at hand and have posted an update on our corporate blog. http://www.joyeur.com/2008/01/.....isk-update

  27. Dan

    I’ve been using Joyent servers for a while and I’ve had good experiences, save for growing pains on the shared BSD servers a couple of years ago. I think this is a similar occurence, though it of course stinks for those who are affected. I appreciate the company’s transparency, and would still recommend them to others (once this is sorted out of course). And as for Frank54’s unkind words, I’ve found Solaris to be quite stable and adequate on my medium Joyent accelerator.

  28. David Young

    I am the CEO of Joyent. I wanted to make one clarification, because the headline of this post seems to imply that all of Joyent is suffering an outage. The vast majority of Joyent services are up and running without issue. While we haven’t yet restored service for Strongspace and Bingodisk, the ZFS issue are in no way affecting our Rails, PHP, Java, Python, etc. hosting customers. Joyent Accelerators continue to provide excellent on-demand service for Joyent’s thousands of customers including one customer driving nearly 20 million page views a day.

  29. Geoff Cheshire

    I’ve been a Joyent (well, TextDrive) customer since January of 2005. While Joyent has had it’s share of bumps, I feel I’ve received a good product at a good value over the long-haul. The community is a great asset, and they provided platforms for doing interesting web stuff.

    One statement above that’s misleading is that Joyent was just running a “stale” version of OpenSolaris and got what they deserved. There may be some room for criticism here (I don’t know, I’m not a guru), but people like Ben Rockwood and Mark Mayo are not amateurs. They held back on upgrading the OS for a reason.

    [D]oing upgrades on things like thumpers is non-trivial and high-risk, and so we watch the commit logs and bug reports and decide what constitutes a critical problem that demands an upgrade, and what doesn’t. The root cause of this problem isn’t yet fully understood, but I can tell you that we’ve hit several apparently “unique” and “impossible” conditions that definitely have not been reported in any existing bug or commit log.

    I really wish it were as simple as “stay on top of known bugs and update accordingly”. Really, I do.

    http://www.joyeur.com/2008/01/.....te#c008480

  30. Slapper

    1st I would never run mission critical instances/data on open source operating systems.
    We ran into a bug with redhat AS4 U1 (and this is not opensource) and its lpfc driver that whenever a registered state change notification is sent from the fiber switch (this is sent even when a zone is created) all the luns assigned to any RH node on U1 would move into read only mode. This caused one of our Oracle databases under heavy load to suffer major corruption.
    To just start slamming technology openly usually highlight shortfalls in the business continuity plan for the affected data/customers.

  31. Slapper

    Sorry I wanted to say opensource without a commercial support.