Clusters are the way to go. Google and Yahoo run their Websites on distributed databases spread across vast clusters of servers. Now Aster Data Systems, a startup that is coming out of stealth mode today, is offering a clustered database for Web analytics to any large website. One of its first big customers is MySpace, which is running the database on a cluster of 100 server nodes to analyze what songs and videos are going viral, what features are becoming popular, and what content is being consumed on its service. That comes to more than one terabyte of new data every day that needs to be analyzed. CEO and co-founder Mayank Bawa explains:
Google and Yahoo had to build this infrastructure for themselves. Others don’t have this. So we will give them a very scalable database, and keep costs low by running it on commodity hardware.
Bawa and his co-founders were Ph.D students at Stanford when they founded the company in July, 2005. They raised an angel round of about $1 million in November, 2005 from Stanford computer science professor David Cheriton, Josh Kopelman at First Round Capital, Anand Rajaraman (founder of Junglee and Kosmix), and uber-angel Ron Conway. Cheriton was also one of the early angel investors in Google. Another Google investor, Sequoia Capital, took the entire A round in May, 2007. The company is not disclosing that round, but it is believed to be around $5 million.
Kopelman explains why he invested:
AsterData give companies deep insights on massive data by transforming off-the-shelf, commodity hardware into a powerful, self-managing, and scalable analytic database. Data analysis that previously took days to run (or were impossible to run) now routinely finish in minutes/hours. They already have paying customers today and Aster is in production managing billions of events per day.
It’s a data driven world, and the old-style databases just can’t keep up.







It’s both a data driven world, and a data driven web… re myspace, it’s international access becoming problematic lately… sometimes can’t even login, or after login, some functions renders to blank pages, or browser saying this not found or that not found… not something great (lately)…
Feels good to be back on TechCrunch!
Good for them on scoring MySpace as a customer. And great example of how it will be used to spot viral trends.
However, I’m not clear on what makes them different. Is it their analytics that make them different? Because clusters, grids and distributed computing aren’t new and there are plenty of companies offering these solutions such as DataSynapse, Digipede and others. Kopelman’s description, “Data analysis that previously took days to run (or were impossible to run) now routinely finish in minutes/hours.” That’s exactly what clusters, grid and distributed computing do.
It would have been nice to hear more about the analytics part and how they are going about getting the “deep insight” on the data. There’s nothing on their site either. I would like to know more about how their Web analytics differs from other offerings or if it’s just that they’ve combined the two (cluster database with Web analytics) as one package.
It’s interesting that these guys don’t actually help you manage or scale data. They just analyze it! How useful that is up to discussion and depends on a business. You will also need to throw in some resources for experts that will help you analyze it all.
Data Analytics? good ‘ol PVM-MPI works just fine 4 me.
Yes you insensitive clods, I m that old.. you myspace generation might not have heard abt these parallel processing libs.
Now, if Twitter applied this to brand and product monitoring, and provided a public / paid API to access mentions and metrics, the issue of ‘monetization’ would be done with.
“There is more money to be made in mining Myspace than all the Google search ads for all time.” )me)
Hi Alan Wilensky ,
“There is more money to be made in mining Myspace than all the Google search ads for all time.”
How? Can you please explain that in more detail?
These guys, while a nice looking offering, and clearly have some strong technical chops, have an uphill road ahead for them. Ignoring the pure grid players that another commenter mentioned, such as DataSynapse, Platform Computing, etc., t here’s already a slew of folks shilling highly scalable data warehousing platforms.
The two or three “newcomers” to that space are DATAllegro, Netezza, and Greenplum - DATAllegro has been around a long time already, and Netezza is a publicly traded company doing $125m/year in revenue. Greenplum, meanwhile, is a data warehousing version of PostgreSQL and has been around since ‘02 or 03.
These guys compete with the “old school” data warehouse offerings from Oracle, IBM, Sybase, and Teradata, all of whom have felt the competition from the newer players on large, lucrative data warehousing deals at large enterprises (you should have heard the screams at Oracle when Amazon moved a DW from Oracle to Netezza), and have responded by building targeted data warehouse offerings to compete.
My point is, while there’s always an option to rise up and seize market share, this not only a traditional space (the data warehouse has been around for forever) with many large legacy products, it’s already gone at least partway through the “innovation from the outside” phase with the notion of data warehouse appliances, or dedicated data warehouse engines, and there’s a lot of competition as a result. It can be hard to do that again, no matter how nice the technology.
@8, The data warehouse has been around “forever”, but is normally a horizontal technology offering (yes, there are lots of product offerings in the market). These guys look like they are offering a more vertically targeted, out-of-the-box experience. Analyzing different data sets with different goals, better presentation, and simpler interface could always be of interest to the right customers. You are being very short-sighted.
@9, I couldn’t disagree more. While there are general purpose databases that get adapted for data warehousing (Oracle, et al.), even those platforms have specific technologies internally that are built for doing types of queries that have traditionally data-intensive roots. Even among the legacy database players, you have Teradata, which is expensive, and a little overblown, and very much a declining offering in my opinion, they’ve proven more than up to the challenge of meeting any customer’s needs from a *functional* perspective.
Everything new about Aster, Netezza, DATAllegro, Greenplum, etc. is about cost and speed - because time is literally money for a lot of businesses. But if we look at the list of targeted verticals in the Aster solutions section, we have basically communications, financial services/insurance, and online/web 2.0. If we look at the solutions for each of those verticals, the answer is always speed, scale, and ease of administration.
Now if we go to Netezza’s site, look at their offerings, we see the same categories basically, and if we look at their customer list, we see e-tailers, advertising, content, financial services, telecom, etc.
Believe me, there’s nothing vertically targeted about this aside from the marketing speak. They’re saying, “We’ve built a data warehouse offering that is faster and cheaper than the competition”.
Also, there’s no “presentation” here - it simply spits out SQL. It’s still up to your traditional reporting engine to present the data. they’re simply saying they’ll produce data sets to be reported on by some other BI tool.
I don’t mean to sound like there’s no value here - perhaps they really have built something that is significantly faster or significantly cheaper or significantly easier to manage than the existing offerings. But I think the idea that this concept is somehow something new is the real short-sighted opinion.
Matthew, It seems you have some knowledge in this area, which is good (thanks for taking the time). I think the news was as much about the VCs that invested (being used as an indicator of possible success), as it was about the product offering.
Because of the early stage nature of the company, in an obviously competitive market (as you’ve indicated), I doubt they are showing all their “cards”. My understanding is that they are working on great data visualization tools to provide actionable results, once the data has been analyzed. If what you are saying is true, and they wish to compete by being the price “leader” alone, then you may have a point.
By your definition, nothing is “new”, and everyone should just stop trying to innovate and disrupt. However, I never once saw anything in this article claiming that there was anything “new” here (the words: new or innovative never appear once). Nice straw man.
Sounds great but am I missing something on the analytics side? How is this data being captured? It is great to have an endless scalable cloud raining terrabytes but it is only as good as the algorithm collecting it.
Is myspace feeding this through other data capture tools i cant figure it out from the aster litereature?
@11, thanks for the feedback, yes I do have a fairly deep background in the area (though I can freely admit I’ve never run one of these massive data warehouses), and I doubt not at all that they will have new features and that this is the first step they’re taking on a long journey.
You seem to have an advantage over me in terms of insider information, so certainly I can’t disagree with your clearly considered statements. But I do object that my argument was a straw man on a few levels. One is that this is a blog that’s supposed to be about new Internet companies:
“TechCrunch, founded on June 11, 2005, is a weblog dedicated to obsessively profiling and reviewing new Internet products and companies.”
So by definition, I think any new company that gets profiled on here gets consideration as to whether what they’re offering is truly new or innovative. Even more to the point, in the article itself:
“Clusters are the way to go. ”
and yet better:
“It’s a data driven world, and the old-style databases just can’t keep up.”
Which makes a clear statement about the author’s feelings that this is indeed something new and innovative or truly useful. My overriding objection is that none of the challenges that we’ve discussed in these comments come up in the blog post at all, giving the article an impression even more so that this is something truly new and innovative. I would rewrite this as, “Google Backers Back Aster Data Systems: Yet another commodity clustered data warehouse, stiff competition from legacy and emerging players, smart people, good VCs, looking forward to something game changing”. Would you agree that that’s a little more balanced?
Your last point could not be more incorrect. Suffice it to say I’m a firm believer in the idea of new technologies to disrupt and innovate. I am a huge fan of all of the topics that come into play here, and their abilities to transform the way people do business.
But when I can run down a feature list, and it’s the same as the other players in the space, and when I can read a whitepaper and basically substitute out their marketing lingo for the other companies, and when the value prop for the product is identical to the competition’s…..it’s kind of hard to argue that this is innovative on the face of it. Might there be new and impressive features coming out down the road that are truly new, different, unique? I’m sure there are. But they become innovative *then*, not now.
@Matthew and Frank - Based on my experience (and I do have a little), I have to agree with Matthew.
First, “It’s a data driven world, and the old-style databases just can’t keep up.” What is an old-style database? Do you mean a relational database? Codd came up with the concept in the 70s but RDMSs are by no means antiquated. They, like every piece of technology, have their place and their time.
Second, terabytes? Come on, there are only a few dozen sites that produce that much data. Can you say in-house development?
Because there are so few companies with a need for this product, they better have a damned good sales team. Besides that companies Matthew mentioned, you have to look at Microsoft, Sun (MySQL), IBM, etc. They may be generic, “horizontal” solutions but that also makes them extremely adaptable to a given business model or vertical. Oh, and they have tons of cash and huge sales teams.
I could go on an on, but Matthew has touched on the most important re: market dynamics and existing competitors.
PS - lose the lame “massively parallel processing” and “queens/workers/loaders” marketing-speak.
What sets Aster apart from the Netezza’s and Terradata’s are a couple of things. They are hardware independant so you can scale them out on any hardware, allowing rapid growth in both the size of the data store and the computer power. Isn’t that what grid computing is all about anyway? So from the hardware perspective this is a differentiator.
The article didn’t mention it but they also have written their own analytics that target areas no normally targeted by “legacy” data warehouses. This allows for some unique and powerful analysis of the data.
As one person put it Netezza and Terradata are “just a faster oracle”. And it’s true, they are really just like Oracle, only vastly faster.
Aster goes beyond that giving true scalablity and advanced analytics.
@Sean: I caught on to the hardware independent piece, but I suspect the devil is in the details. For example, in their whitepaper they talk about how you just put new nodes that are PXE-enabled, hook them up, and turn them on, and something, presumably a manager or coordinator node (or maybe a “queen” node) does the OS prov, configures it to be the correct role for its task, and then starts handing it workloads. This scenario assumes, though, that the OS that’s being provisioned is supported on that target system. There’s a whole host of potential issues with this - slightly different driver revisions required on different hardware, differences in physical layout, interface ordering, OS security patches, etc.
Netezza and DATAllegro are the best comparisons here, where they also use commodity hardware, but they use their own hardware so that they can control the OS configuration, drivers, architecture, and even disk layout can be defined and optimized for what works best for them. Netezza, you buy by the TB, and you can just keep stacking on new “bricks” (not their term, but it escapes me), and it’ll start using those machines for data. Exactly the same architecture, just with a little Netezza logo on the boxes instead of Dell/HP/IBM, etc.
Greenplum takes the opposite approach, where you can use any hardware, they’ll make recommendations are far as configuration, but you’re responsible for managing the OS and Greenplum will just use whatever storage is presented to it.
As far as advanced analytics - again, I can’t vouch for anything beyond what’s on their site, but I don’t see anything on there that looks different than what the other guys offer - smart partitioning, intelligent layout, etc. Typically, in DW environments, the analysis itself is provided by the people who are asking the questions, while the DW itself provides tools to make those questions easier to write. So, perhaps the SQL language provides some additional verbs or query types to make some of these operations easier, but I couldn’t find anything on their site to indicate that.
@Taylor - thanks for the atta-boy btw, but there are a lot of companies that do hardcore analytics in the terabyte range. I have a customer who asked my advice a few months back on a 20TB data warehouse that was currently running on a giant Sun server with three EMC Symmetrix arrays hooked up on the back end to handle the I/O. I mentioned Netezza, DATAllegro, etc. - this was just a random company that you would never picture as having anything that big.
Even more interesting, there’s a lot of companies that don’t have the horsepower to keep terabytes of data around, so they collapse down into summaries and aggregate historical data to keep the data set smaller. Doing so loses information, obviously, and reduces the amount of fine-grained analysis.
So, one of the big pitches the appliance vendors are using is going to shops that would like to have multi-terabyte data warehouses but need something faster and cheaper, and showing them how they can get their multi-terabyte DW and better analytics at a price point comparable to their current summary DW.
It’s great to see so much dialogue forming since our first announcement to the market. We realize there are a lot of vendors in this market (both traditional and new) and sometimes what is different among them is not clear until you peel away the marketing messages. I just wanted to emphasize the focus Aster has on scalability – both from a performance (linear scalability) and a management (one-click scale-up of the cluster) perspective.
As far as I know no other company today properly addresses the biggest bottleneck that MPP systems face: the interconnect between the nodes. Certain traditional DB vendors “solve” this problem by using its own proprietary (and expensive!) interconnect, but we solve it using software algorithms across our entire architecture stack. This gives us a significant advantage for all queries that cannot be trivially parallelized, which are usually the ones that bring such systems to their knees. We’ll discuss these principles and our technology in more detail on our blog (http://www.asterdata.com/blog) – or feel free to email me if you’d like to learn more (tasso at asterdata.com)!
Interesting that they give so much play to running on generic x86. Not sure that customers care and not sure that x86 will have the cost advantage going forward, especially considering that the chassis is the biggest electricity waster and the inevitability of cost efficient scale up systems. Seems a little Web 2.0 groupieish to be religious about x86. Kind of like insisting on scaling RoR for RoR’s sake.
So: “Bawa and his co-founders were Ph.D students at Stanford when they founded the company in July, 2005″. Wonder if they’d chased up already Stanford’s leadership in using PS3 to create a low-cost highly powered (ie record breakingly so) super computer.
Context: I’m not a techie, I just saw the connection.
From a business perspective this could be a real breakthrough. The analytics piece will be very valuable but the ability to load data fast and not be tied to very expensive hardware is very exciting. Having MySpace gives them a lot of credibility in my mind - my business partners really need to get the TCO for data driven down and I don’t see the traditional players in this space with anything new to offer. Good for them and hope this works out.
Welcome to the party! The idea of running high performance analytical database software on commodity hardware was introduced by Kognitio Ltd, in 2003, two years before Asterdata was founded. Kognitio is on its 6th release of the solution and has 20 customers and they have patented algorithms that address the interconnects between the nodes. AsterData’s announcement validates the “MPP software on commodity hardware” approach - I wish them good luck.
PS how come all Bay area start-ups are like Google, approved by Google or have someone on board that knows someone at Google?
To follow up on what Charles, Matthew and some others have said here…Kognitio has been present for a number of years in this area that Aster is just now entering. If you take Kopelman’s quote:
“AsterData give companies deep insights on massive data by transforming off-the-shelf, commodity hardware into a powerful, self-managing, and scalable analytic database. Data analysis that previously took days to run (or were impossible to run) now routinely finish in minutes/hours. They already have paying customers today and Aster is in production managing billions of events per day.”
…and substitute “Kognitio,” you’ll still fall short; Kognitio’s WX2 is designed to produce sub-second responses. Like Aster, Kognitio runs analytic databases on commodity hardware; unlike Aster, it also licenses the software and additionally offers it as a DaaS (Data warehousing as a Service) option.
Sorry if this comes off sounding too much like a marketing pitch, but there are several key points that needed to be addressed.