With more than 15 billion photos (and 60 billion image files with replication for different sizes), Facebook eats up a lot of storage with its photo application alone. Members are adding 220 million new photos every week. Facebook currently has more than 1.5 petabytes of storage for its photos, and that is growing at a rate of 25 terabytes a week. Last year, Facebook spent an estimated $30 million on NetApp storage appliances alone just to keep up with the growth of photos and other uploaded content. To reduce some of these costs, Facebook decided to engineer its own storage architecture called Haystack.
Now more details have emerged about how that system actually looks and works. In a nutshell, Haystack will allow Facebook to switch from expensive, commercial storage appliances to commodity off-the shelf hardware. It is going from a traditional network file system to something more akin to stripped-down network application that does only what it needs to do. Not only will Facebook get the cost savings of going commodity, but they also get a 3X improvement in storage capacity. In other words, what used to take 30 discs to store, now will take only 10.
With so many billions of images, serving the right one is like finding the proverbial needle in the haystack. With a traditional network file system, a lot of metadata goes flying around detailing when files were last modified, what directories they are listed in, and so on. All of this metadata creates a bottleneck as it is passed back and forth. So the two Facebook engineers who built Haystack (Doug Beaver, Peter Vajgel, and Jason Sobel) decided to get rid of much of this metadata. As they explain on Facebook’s engineering blog:
The new photo infrastructure merges the photo serving tier and storage tier into one physical tier. It implements a HTTP based photo server which stores photos in a generic object store called Haystack. The main requirement for the new tier was to eliminate any unnecessary metadata overhead for photo read operations, so that each read I/O operation was only reading actual photo data (instead of filesystem metadata)
All of that metadata is stored in what Facebook is calling “needle.” Each needle pulls together the metadata for hundreds of thousands of images. The needles are paired with an index to make up the Haystack object store. You can read all the technical details on the Facebook engineering blog. The company will keep its existing network file system for the 15 billion photos already uploaded (after all, those NetApp boxes are sunk costs). But going forward, all new photo uploads will be handled by Haystack. And in the future, Facebook may even open-source the architecture so other companies can benefit from it. Not bad, for something that was built by three engineers.
(Photo credit: Flickr/Vitor Antunes)









The real question is, when is Facebook going to try to monetize these photos?
here comes a monetization f***er.
Note to Techcrunch, your audience is not stupid so no need for dumbed down journalism: “3X improvement in storage capacity. In other words, what used to take 30 discs to store, now will take only 10.”
What if they are?
I don’t understand what you’re trying to say…
mmmmm….. aaaaaa….. what?
What if they used to use 172345 discs?
“So the two Facebook engineers who built Haystack (Doug Beaver, Peter Vajgel, and Jason Sobel)”
Um, that’s 3 names.
A proofreader wouldn’t hurt either… “Members are adding members add 220 million new photos every week.”
Back on topic, that’s pretty amazing that three engineers could churn that out. With the booming popularity and, by extension, the amount of media they’ll be uploading they probably needed to do something to increase efficiency ASAP.
Facebook seriously needs to consider a premium subscription service like Flickr. I’d happily pay $9/month for an ad-free Facebook with high-resolution photo upload/downloads. They’d easily sign up ten million users.
For me, comments like these reveal the silliness of the first comment on this post, about how Facebook will monetize their photos. They have a million different options to make money. Fortunately, they’re smart enough to not let it inhibit their growth and integration with the web at large.
@ Rick: Pls comment here: http://oonwoye....k-i-am-serious/
I have been screaming since!
” In other words, what used to take 30 discs to store, now will take only 10.”
‘Disc’ is different from ‘disk’ and it’s the latter you’re referring to.
If they wanted to improve it by a factor of three, they could’ve just painted the system red.
There’s a privacy hole here I haven’t seen discussed. Facebook assigns random and static URLs to each photo, which *always* deliver the photo regardless of the owner’s privacy settings or requesting person.
Put another way, if I post a private photo, Facebook will deliver that photo to anyone on the internet who knows the URL.
Is this simply the industry standard for privacy controls? It seems to me quite weak: essentially security through obscurity.
I’d also be curious to know if this is a new function of the “heystack” design.
@Darrell: yes, unfortunately “security through obscurity” is pretty prevalent
I don’t think it’s reasonable to be satisfied with security through obscurity.
I just spent a couple minutes looking through how Basecamp does this, and sure enough, they’re doing much better than Facebook. Using S3, they generate a random URL like FB, but also use a timeout that makes the image publicly available for only a couple minutes or so.
It looks to me that Facebook has taken a shortcut here with privacy settings, which is, um, kind of their hole shtick.
I’d be more than happy to learn I’ve got some crucial detail wrong here…
I think that it would be something of interest for the engineers at Facebook to explore the possibility of using wavelet algorithm/s for digital image compression, archival & retrieval. JPEG2000 compression standards is wavelet-based.
Los Alamos scientists had developed the current FBI’s Finger-print Image database is using wavelet (compression, archival & retrieval).
One of the top expert’s/researcher’s in wavelet today is Prof. Dave Donoho. They have developed an open source wavelet package called Wavelab (written in matlab) which is freely available for download from his research group’s website, so perhaps FB engineers could tap David & his group’s expertise to either consult to Facebook or fund the Wavelab team to develop a wavelet-based archival/retrieval system for Facebook. FB has millions of dollars of funding and I am sure that the expertise that the WaveLab team from Stanford can offer FB is something worth paying for, because it is a fact that facebook’s image repository will keep growing over time.
I found this story to be interesting, not only because of the statistical detail but because of the new questions that arise as a result.
How long is FB keeping all of this data? As long as the user doesn’t delete it? Heck, I have no plans to delete anything off my FB profile, yet at least 50% of it is worthless to me. I don’t delete it because I don’t think to delete it. So they risk becoming a data dump, and have to pay to store the data that has no value to anyone.
Ten years from now, will all this useless data still be sitting on spinning disks? Out in the ocean in one of Google’s wave-powered data centers?
Glad FB is investing in new technologies to keep up, but seems to me the approach is a lot like adding lanes to the freeway. At what point will we have to change the behavior, not just the infrastructure? I *know* FB doesn’t want to change the behavior quite yet, but storage isn’t free. This story is far from over. Will be interesting to watch.
a future president is going to have some damaging photos on facebook a few years from now that will surface and make it all worth while…
They can try at the eleventh hour to become more efficient but I still have a feeling they’re up poop’s creek.
Facebook could be worth 40 billion. Web 2.0 is not a bubble. http://iamned.com/blog/ people need to stop complaining about recession and keep spending.
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk.
The engineering blog entry is actually a good read.
Facebook seriously needs to consider a premium subscription service like Flickr
So they risk becoming a data dump, and have to pay to store the data that has no value to anyone.