Amazon released its previously announced Public Data Sets web service this evening. The project encourages developers, researchers, universities and businesses to upload large (non-confidential) data sets to Amazon – things like census data, genomes, etc. – and then let others integrated that data into their own AWS applications.
Previously, Amazon says, large data sets like the Human Genome or U.S. Census data required “many hours to located, download and customize,” but that developers can now access and start computing on this data within minutes. Data is hosted for free.
Data sets available today include an Annotated Human Genome, a public database of chemical structures, various census data and labor statistics.
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
There are other for-profit projects that are trying to help people tap into large public data sets. San Francisco-based Swivel is one that launched in late 2006.
Perhaps someone can now upload all those now-public iFund applications to Amazon.









I’m surprised Google doesn’t have something like this… Makes data gathering much easier…
very good
The major points to enter in India are Mumbai, New Delhi, Bangalore, Chennai, Kolkata and Hyderabad. When you come from western countries, chances are that you will get in through these cities. In the recent years, many other airports have been upgraded to manage the international flights. So book cheap flights to India online with Tickets To India and receive special discounts on cheap flights to India. Enroll yourself with our regular newsletters to update you on any promotional offers on ?buy one get one free? and other money saving ideas
I finance works in a genetics lab. The are trying to maintain a public database on their own. This service would be perfect for them.
Whatever happened to Freebase? I thought they were going to lock this market up but seems to have stalled.
Cool idea indeed. But in the many cases of interesting data (biological, demographic, macroeconomic) – the hard part isn’t making the data available, but making it accessible — e.g. in a format that analysts can easily use.
NIH has been maintaining Genbank, the world’s largest database of DNA sequences, since 1982. An entire ecosystem of bioinformatics software tools has been built around this — I’m hopeful same will soon occur in other data spaces.
I’ve been thinking of making at least some of the vehicle reliability data collected by TrueDelta available to anyone willing to crunch through it and share the results. This might be a way to do it.
Michael, I’m at Tableau Software. We do data visualization software and can simplify large and complex data sets, and we might be interested in analysis of your data. Drop me a note at efields at tableausoftware dot com if you want to follow up.
Chances are there are people who work with these data sets but have not been using Amazon’s compute cloud because they felt it would be too tedious (or expensive) to have to upload and pay for storage of gigabytes of data (some of which is frequently updated). So this seems like a smart move!
“and then let others integrated that data into their own AWS applications”
Can you spell?
Google has GoogleBase which is more generic than this, I believe.
For the last two months I have been searching everywhere for public access logs, such as the ones produced by Proxy Servers. But I have not been able to find any. I understand that access logs include lots of private information, but it is possible to anonymize them for the purpose of research. If you know of anywhere or anybody that might be able to help me get access to something like this, please leave a note. It will help me a lot.
many thanks
A PhD Student from Australia