
Only two days ago the contact messaging application Twitter suffered another bout of downtime, leaving some users frustrated and others asking why the platform continues to suffer problems.
I have recently spoken to an individual who is familiar with the technical problems at Twitter as well as the challenges that lay ahead for the startup. He re-iterated his belief that the problems lay not with Blaine Cook (the former head of engineering who was shown the door), nor with Joyent NTT (their host) but with the early lack of understanding of how complex their problems would be.
The issue is that group messaging is very difficult to achieve at a grand scale. Other large sites such as Wordpress and Digg are mostly dealing with known problems, such as how to serve a large number of pages or a large number of images. Twitter is unique in that it needs to parse a large number of messages and deliver them to multiple recipients, with each user having unique connections to other users.
Social networks have similar complexity issues, but they only usually need to route a message to a single user (or at the most to a defined group). Even so, social networks like Friendster struggled for years with technical and scaling issues. Twitter is specifically dealing with text messages, and in most cases with active users those messages are very frequent and go out to hundreds of contacts (or followers, as they are referred to in Twitter). Every new Twitter user and every new connection results in an exponentially greater computational requirement.
Some of the best web applications are able to efficiently solve very complex problems to produce simple results for users (Eg. Google). The success of these applications is due to the innovative efforts by developers to solve large technical challenges, where they have often had to break new ground for solutions. For Twitter to reach a similar point of reliability they too will need a very comprehensive, ground-breaking solution.
The source that I spoke to also commented on how ill-prepared the Twitter team were and are for their current and future challenges. The small team contains a handful of engineers, with only a person or two committed to infrastructure and architecture. He goes on to point out that at Digg the team for network and systems alone is bigger than the total engineering team at Twitter, and that at Digg they are lead by well-known “A-list rockstars”.
The problems at Twitter are often attributed to their use of RubyOnRails, a web development framework. Twitter is almost certainly the largest site running on Rails, so fans of the framework and its developers have been quick to deflect the criticism and point it back at the engineers at Twitter. Utilizing a framework that has never conquered large-scale territory must certainly add to the risk and work required to find a solution. As an out-of-the box framework, Rails certainly doesn’t lend itself to large-scale application development, but was a big part of the reason why Twitter could experiment and release early.
Rails has enabled Twitter to prototype quickly, to quickly launch and then to easily iterate with new features. But the old adage of “Good, Fast, Cheap – pick two” certainly applies; and Rails would do itself no harm by conceding that it isn’t a platform that can compete with Java or C when it comes to intensive tasks. Twitter is at a cross-roads as an application and Rails has served its purpose very well to date, but you are unlikely to see a computational cluster built with Ruby at Apache any time soon.
What we see at Twitter today is a very useful and popular service, but one with very complex underlying technical challenges to overcome. Twitter will require not only a new architecture approach and a big injection of the best minds they can find ($15 million can help), but will also need a little patience from users and those of us observing.









said it before, and will say it again…. why cover such a crappy service. credibility issue alert
I agree, this service is trash, yet it continues to receive coverage on TC
Solving Twitter’s Tech problems is actually very easy.
Compare that to ennormous computing needs of Meebo for each user…
“with only a person or two committed to infrastructure and architecture.”
WTF? They’ve been at this for almost a year now. And they’ve only hired one or two people? Hope they’re fairly smart.
FWIW, Scribd is a Ruby on Rails site that (according to Alexa) is quite a bit bigger than Twitter in terms of web traffic. Of course, the API might tell a different story.
Nonsense.
This is not that hard or unique scaling issue, especially how long they’ve been plagued with these issues. Similar Instant Messaging architectures are well documented and have scaled well beyond the scale of Twitter.
Patience? Twitter’s users are the most patient I’ve seen. These problems have been going on for over a year now. You can see their downtime and avg response times at Pingdom: http://www.ping...ame=Twitter.com
Mod me down (oh yeah, you can’t!) but I do not understand this notion of ‘delivering’ the messages to people. What is meant here? If I post a message to twitter, and I have 400 followers, are there 400 copies made and ‘delivered’ to each follower? Is the term delivery another way of saying SMS alerts?
All external services which accesss the API cause quite a load on the Twitter servers. We saw that with one of our side projects… There are many external sites checking the feeds and calling functions…. This is one of Twitter’s strengths and also the Achilles heel….
7. I think you’re onto something, there’s something fundamentally wrong with what they’re doing.
PHP with Zend optimizations. That’s what they need.
Why not decouple the computationally intensive pieces? Use C or something else for the high performance parts and keep it more agile with Rails on the front end. Seems simple, so maybe I’m missing something.
With regard to Scribd, does anyone know if they use Rails for transcoding docs to iPaper or just for the UI?
Woops, comment was trunc’d
7. I think you’re onto something, there’s something fundamentally wrong with what they’re doing.
That’s what they need. Though it must be a problem with _how_ they’re doing it.
PHP with Zend optimizations. I’m not impressed with Ruby, I’ve seen cases where it’s derailing before it even hits the track.
Methinks you spelled Blaine Cook’s name wrong.
But the old adage of “Good, Fast, Cheap – pick two” certainly applies and Rails would do itself no harm by conceding that it isn’t a platform that can compete with Java or C when it comes to intensive tasks.
or .net
I suspect the problem is rails and mysql. Maybe they should be looking at bigtable or a bespoke system.
The biggest problem is will they have enough time to get it right before people get annoyed and walk away for other solutions?
my take on an open twitter system http://darrenst...m/post/35530215
@10 – Digg and Facebook use C extensions for heavy parts. I think they’re relying completely on Ruby from what it sounds. You can only scale hardware enough where you need a fundamental software overhaul.
agreed. this scaling issue is easier (much easier) than that of systems like phone switches. twitter management is at fault for not hiring the right dev talent all this time. even when they’re up their sys barely can keep up with message flow. clearly their initial sw architecting was incompetent. it’s embarrassing.
I don’t know what’s all the fuss about. Twitter is a playground that keeps most of us distracted from our actual work. So I am actually thankful when twitter is down for a few minutes so we can get stuff done…
Remember it’s no rocket-surgery ass DHH says and Rails enabled twitter to come into existence in the first place. They would have never had the time, patience and knowledge to get it up in C++ and we would have never gotten that wonderful toy to distract us from work.
Now they might have to move on to build a rock-solid backbone infrastructure for a world-conquering system. By the way I’ve read that for a few years eBay had to re-write their back end systems on a yearly basis to keep up with the demand.
Are investors getting a bit nervous about the lousy reliability – probably yes. Should they make sure that the next round goes into a hardcore engineering team – yes they should. Is this all common sense… thats for you to answer.
By the way, maybe now would be a good time over at twitter HQ to have a strategy meeting over how to monetize the service (as in drafting out a business model). The implementation of such might dampen the growth a little – which appears to be just what they need right now. In turn if they had a way to monetize the growth scaling problems would actually be fun to have because they could hire more ‘rock-star’ engineers with all the incoming cash flow…
Back to our complaints: No-one (hopefully) tried to call the doctor via twitter to come to an emergency yet. And the 911 operators don’t have an twitter account yet either. So while the twitter guys are fixing their infrastructure problems we could just remember that there is a phone in case we really need to connect with a friend right now, that there is IM if we want to blabber with a few more at the same time and there always is email to spam everybody.
My humble advice to twitter:
- turn it into a business now, or it won’t survive no matter how big and reliable
- bite the bullet and re-write if you have to without complaining about the technology that enabled your success story.
- Don’t hire rock-stars, solid engineers sans ego-trip will do fine
Tip to the rest of us:
- Why not all of us go over to pownce for a week and see how they scale on a dime (would be a fun exercise)
- The thing buzzing in your pocket: its a friend that wants to talk to you
Peter
do you follow me @ http://twitter.com/peterurban -in case it’s up
Facebook was dealing with a very similar issue when they were designing their chat system. The difference is that Facebook started their implementation with a clear goal to scale the system up to 70 million users early on.
Link to the article if anybody is interested in details: http://highscal...rs-using-erlang
wah.
This is why Twitter needs to be more of a decentralized platform to really hit up larger scale. I ranted about this a couple days ago http://www.soci...BlogPost%3A1501 Sorry if the rant isn’t terribly coherant, but the idea is to turn the micro-blog into a standard and use the internet as the computational network via multiple servers capable of hosting the standard API’s and allowing for tweaks, then just adjusting incoming and outgoing feeds.
No idea if that would work, but it would be neat if it did. Think Wordpress extremely light with social network capabilities.
Interestingly enough Twitter picked “fast and cheap”. In business you want to build customers as fast as possible and fix the bugs along the way. Check, they’ve done that and as loyal users we have stayed relatively loyal. The unfortunate problem for them is that they picked a platform that has a Drop Ceiling!!! If they can’t get more funding for more developers can they monetize what eyeballs they have left to pay for more developers? They have to act quickly before someone capitalize on this apparent weakness…
Twitter’s problem is how it has attacked the problem. Twitter basically has the computational model of a stock exchange such as the NASDAQ not a web site. At a large bank you have millions of messages being passed around every second and actions needing to be taken on those messages. That is the same model as Twitter.
I should have clarified this further in the post, but the issue certainly isn’t with creating new messages, the issue is when you check your own messages
For eg, if I am following 200 people, the query needs to check the new messages from all those people and sort them into date order. That query would be really intense
Now take that query and multiply it out for each user, and on top of that you have all the Twitter clients requesting the same query every x seconds
Caching only helps to an extent, since that query changes so often (ie. as soon as one of those 200 posts a new message, its a new result, hence a new (big) DB hit). It wouldn’t surprise me if the DB is falling over on those requests.
Also judging by message ID’s and what I have seen so far, their data is in no way segmented – so you have one BFT (big fucking table)
I agree that the delivery part can be scaled.
Performance DOES NOT equal scalability and FUD articles do no service to anyone. Java is faster than Ruby, and C is faster than Java, and direct binary machine code dominates them all, but none of them are any more scalable than the other.
Also, very little if any of twitter’s messaging system has anything at all to do with Rails. Rails is a web framework and has no relation to messaging queues.
http://skwpspac...odgates-of-fud/
There’s no excuse for terrible planning and management. Compassion doesn’t seem like the appropriate response when a product represents itself as public-ready, and very frequently fails to operate. Twitter down has become the joke of the office here.
It would seem that any savvy investor would have already realized that the underlying technologies were faulty and have rapidly undertaken re-engineering. I don’t think we’re dealing with impossible obstacles.
Aloha, Jeff
Thanks for the explanation about Twitter’s problem in a simple language. I am a Twitter fan but do not know much about technology. Henceforth, i will be more patient. I am sure, Twitter will solve this issue soon.
Actually, it’s much more complex than an instant messaging system, more akin to a group chat system, and those are notoriously hard to manage. A big part of the problem is exemplified by Robert Scoble. The amount of processing every time required anytime Scoble posts something or anyone following him posts something is staggering, compared to the average user. All of these super-followers are taking down the system – ideally Twitter should charge you if you follow over 200 people to discourage this, otherwise they’ll never manage the scale.
Big-O notation – ya heard of it?
Nik, I would like to correct this sentence: ‘nor with Joyent (their hosting company)’.
Joyent is NOT their hosting company. Twitter moved to NTT in January. The post reads as ‘current’ so I would like that noted. Please.
Seriously? Another twitter post? Good God.
I am flabbergasted at the amount of comments from people who are utterly clueless about computer science, scaling applications or development as a whole but that still are able to point fingers. You guys do have the right to state your opinion but keep in mind that just because you throw in a couple of buzzwords to the mix, it doesn’t make you right. This is true for most articles on the blogosphere that mention Twitter’s scaling issues.
#9: PHP with Zend optimizations doesn’t mean shit to Twitter because their growth problems are with their architecture. Actually, they’ve said so themselves, admiting their problems were with their database, not their frontend or API.
It’s pathetic to keep reading articles about Twitter and Ruby on Rails. If Twitter was implemented with a similar architecture and a different framework things would be no different. The problem they have right now is that in order to change their architecture to something scales effectively in terms of message delivery, they’d be redesigning (from a scientific perspective) the application all over again.
(PS: Nik: this is naturally not a rant about you or your article, although I believe you’re wrong about Rails being the issue here. Hope you’re doing well!)
Not all of Twitter is Ruby: http://twitter....tuses/801530348
I recommend you read this blog post and a couple related posts on this site.
http://theabstr...itter-problems/
Great post Nik. Seems to me that most people underestimate the complexity of what Twitter is delivering. It is NOT as simple as an Instant Messaging platform so a comparison to that is to compare apples and oranges. The messaging on an IM platform is mostly real-time (you might need to collect some IM’s while someone is offline) versus Twitter where each user has a history of tweets from their friends, their own message history, etc. that has to be served up to Twitter.com and via the API throughout the day. Also, think about the difference in scale. Most IM message traffic is 1:1 or you might have group chat with 10-20 people. When Robert Scoble or Leo Laporte send out a tweet it’s going to 20,000+ people at one time. And it’s not unusual for “average” users to have several hundred followers. Throw in the API support and the SMS layer and it gets even more complex.
Why doesn’t twitter talk to the guys at Google App Engine?
– Google’s infrastructure could handle the traffic w/o thinking twice
– “We made twitter scale” would be incredible marketing for App Engine
– Google engineers would probably pitch in to help them make the transition
– It would get twitter closer to an potential acquirer
I’m sure that Rails isn’t the problem. Scribd is an example of this. They probably started Twitter as quick development with no specs and now they’re finding many issues realted to how to scale the db or kepping all the things together.
This is an excellent article, and hits the nail on the head.
I was the SVP of Engineering at m-Qube back in the day when we were hooking up carrier-to-carrier text and data transmission. Our traffic growth was 23% every week…the traffic numbers dwarfed anything that Twitter has seen to date, but it took a lot of work, 24/7 vigilance and an engineering staff of 60 and Ops staff of 9 to keep up: Constant changes to databases, network infrastructure, SAN arrays, big iron in co-lo’s, changes to software infrastructure to support big traffic and rapid database I/O. We were making changes to the system and the architecture while we were in flight. By the time I left, we were handling north of 2B transactions a day (up from 17,000/day when we turned it on).
The system? Java. The appserver? JBoss.
Ruby one day might be the new Java (but I have first hand experience with Ruby – and personally, I doubt it), but that is a long way away. Ruby is – at best – a nascent technology. It is roughly in the same state of scalability and large scale system support as Java was in the very early 90’s. It makes great demo, but that’s about it.
Twitter is doing itself a serious disservice by not using that $15M-$20M to hire a new team of, as the article itself states, “rock stars” who understand large scale transaction systems and scalability, and also by not switching languages.
It sounds like in Twitter’s case, the growth of the business results in exponential growth of traffic. That’s a losing game folks! More and more dollars chasing an ever-more demanding capacity/scaling challenge. Is that a smart investment?
Fred: your right that even if it was mod_php using a php database connector into MySQL, the query for latest posts would still be heavy (you might save ~5%). Im not sure what the reflection part of RoR adds as a cost overhead
In terms of Big O, the query for latest posts would be x^n, which is the problem as with twitter you have a large number of followers on average and also a large number of messages
Also interesting point that the top 0.1% of users probably do account for a huge % of load. It only gets worse with every extra follower they get and with an increased frequency of messages
Replacing MySQL with a distributed purpose-built DB (al BigTable, or sets of sqlite) is a solution
Also, all corrections made – thanks
@21: The problem isn’t just “millions of messages…” It has to do with the specific pattern of messages. Traffic in a system like Twitter (and many other pubsub systems) is exceptionally bursty.
For example: TechCrunch has almost 18,000 followers on Twitter. That means that every time TechCrunch tweets, you need to
* Determine which of 18K users are following TechCrunch on XMPP and if they are online, send them the update.
* Determine which of the 18K users want SMS delivery and queue those deliveries.
* Update the “tweets received” list of all 18K users
* Update Techcrunch’s history, feeds, etc.
* Mark as “stale” any cached RSS/Atom feeds that might exist for the 18K users.
* Match the message against any of the “tracking” subscriptions that might exist (Some popular words could result in thousands updates…). Then, determine which subscribers are online and send messages as needed.
* various other things…
Now, do the above at the rate of many, many times per second for over a million users who each have anywhere from 0 to many thousands of followers… The system is exceptionally bursty and very unlike what you see in most other messaging systems that you might be familiar with.
There are not a great many folk out there who have experience with large scale publish/subscribe systems. Don’t be too quick to trivialize what Twitter is doing. There are methods to handle this stuff — but not too many people have ever even been close to needing to use them… (Note: But, I agree, they should be able to do much better than they are.)
bob wyman
Erlang. Enough said.
Two things that come to mind here:
1) Email
2) User Patience.
The fact that Twitter can’t handle complex algorithmic calculations and handoffs to worker bots, etc is not an excuse. Email is just as complex from a top-down perspective. Add in mail server configs of a rainbow variety and it may be more complex. The answer is distribution. Winer likes to claim he was the first one clamoring for this but I was long before him. If you treat Twitter as a standard protocol instead of an application, you a) alleviate concerns of downtime, b) distribute the cost (who needs $15M anyway), c) mitigate user data loss should the service ever cease to exist – which is inevitable at the rate they are going.
User patience. Look, we’ve come to rely on Twitter. You get what you pay for doesn’t cut it anymore. I would happily pay $20, $20, $40/mo to have a reliable service. The fact that people are pissed is not a reflection on user patience, but a reflection of the fact that the Twitter team is completely incapable of solving these problems. The fact that they can’t hire competent people to scale this thing with the funding they already have is a reflection of this. We had our share of problems at b5, but the company is in it’s first round of funding with an initial investment of $2M and they (formerly, we
) have hired smart people who have solved some of the scaling problems we had. You don’t need MIT PH.D’s to do this stuff -you just have to swallow your pride and hire people smarter than you.
To date, this article included, I have no confidence in Twitter. Yes I use it. Yes I get pissed when it’s down. But more so, I’m frustrated watching how they work.
Nik – you just hit on the problem, except wouldn’t this be a factorial algorithm, rather than exponential? If I go from 40 to 41 isn’t a 41! process now required?
In either case, every time Twitter adds a small percentage of users, they could experience a doubling of infrastructure demand. We ran a group chat-like environment on an app and experienced a jump from 10m/s in traffic to 20-30m/s when our user count went from ~150 concurrent users to ~200. We found a way to strip out the bandwidth per transaction, but it’s still exponential for us. Imagining our problems on Twitter’s scale gives me plenty of sympathy for their downtime.
@#5 two things wrong with your statement:
1.) “According to Alexa”
2.) “might tell a different story”… might???
The time to defend Rails (or any “Framework”) as a production tool is past. For prototyping? Sure, why not… proof that concept, raise some initial capital, and churn out a production level application with the moneys raised… (rather than giving executive raises in exchange for shorter work days and a big-ass party with free booze and expensive hookers*… Getting an investment should mean it’s time to work twice as hard, not half as much.)
* not saying twitter did this. but it’s the vibe of the valley in general. what they did not do was take their initial investments and turn them into a production level application. (they should not be on rails at this point, they simply should not. it’s got nothing to do with their bitchy users, it has everything to do with their gracious investors.)
#43, why is this even a Rails issue? If the web site is the root of their problems, then they are really lost.
These scalability issues are really relatively easy to fix if you’ve been there, done that…
Sure it would be easy to scale Twitter with a 100-node Oracle 11g RAC Cluster, but it could be done using MySQL (it’d be harder, but it can be done). The problem is not the tools (RoR, MySQL, etc.), it’s the architecture.
I’ll quote Jack Nicholson…
“I’m an artist, you give me a f%$@ing tuba and I’ll get you somethin’ out of it.”
I use twitter, but I am certainly not reliant on it. It is interesting, and learning exactly how it works from using the site, 3rd party apps, and Pidgin (my IM client) has been really interesting. 98% uptime is more than enough for me. and I hate whiners.
But, what I REALLY hate is idiots that think they know anything about the technical challenges that face twitter. IT IS NOT SOME LANGUAGE OR FRAMEWORK ISSUE!
Real-time, cross-protocol, archiving, access-control (my twitter feed is not public), follower lists and followed lists> Just not quite as simple as you all think.
I’m going to write a greasemonkey script that hides every techcrunch article tagged with twitter.
Seriously, this is getting out of hand. Not only is twitter WAY over hyped, the amount of coverage it gets here is just staggering…
Please, quit blaming rails
As you know….wait, as you obviously don’t know (from the post, your comments show otherwise for some reason), the way you scale any web app (php, python, asp.net, rails, anything) is to separate the application from the database. At that point you can have as many app servers (ruby on rails) as you need, and you have to deal with the database.
It is true rails as is isn’t saleable out of the box, but that hardly means “rails isn’t scalable”, an hours work could get the majority of an application following a custom schema, whether that means partitioned, or slaved.
I believe The SMS plug could be causing problems. Also, this is not a 1-1 chat but a one-to-friggen-many-many. thats what causes problems every time arrington or scoble send out power-tweets. the IM backbone, built primarily for one-on-one chat takes a hit. so my take is that they need to rebuild the backbone. its just like the cell networks going down during a tsunami or terrorists attack as everyone starts using them. *sigh*. probably need to rebuild the architecture ground up again, taking into account its a 1-2-friggen-many-many.
Nathan: you might be right that it is x!
For those not familiar with it, it is the traveling salesman problem:
http://en.wikip...alesman_problem
So every additional user/follower adds more than just a proportional level of load. Twitter should publish their numbers so that users understand the problem more.
Going from push to pull is a good solution as well. I remember reading somewhere that a huge % of the traffic is API traffic.. so push makes sense. For web apps you would just need a callback URL, for desktop apps NAT traversal
I think it will end up being decentralized, either Twitter will come out and specify the standards or somebody else will do it and they will eventually build in support. It just feels like they are trying to do the equivalent of implementing all of email in a single central service
Also the emphasis of this post is about the core problem, not so much about the tools they used. I am saying that the solution they find for Twitter at this scale and beyond is likely not to involve RoR (or MySQL for that matter)