Who knew statistical computing competitions could be so cut throat? Since we reported on the contest last night, two teams in the Netflix Prize have spent the last few hours jumping back and forth on the Netflix leaderboard as the three-year-long competition ticked into its final moments, with last minute sniping submissions coming from both sides. Finally, the results are in: The Ensemble has managed to come from behind to upset BellKor’s Pragmatic Chaos with a top submission of 10.10% — an improvement of .01% — only 4 minutes before the contest closed.
It’s been a long road to get here. Over the last three years computer science teams around the world have been vying for the Netflix Prize — a competition that invited teams to try to improve on Netflix’s movie recommendation algorithm by 10%, with a reward of $1 million to the best submission. Since then teams have gotten progressively closer to the magical 10% mark, but it wasn’t until last month when a number of top teams joined forces to form BellKor’s Pragmatic Chaos that the barrier was finally broken, with a score of 10.08%. However, their announcement kicked off a 30 day last call period where other teams were invited to make their final submissions.
Last night, a team called The Ensemble managed to one-up team BelKor with a score of 10.09%, less than 24 hours before the close of the competition. Things were looking bad for team BellKor (some suggested that they might be out trying to drown their sorrows). But the team was clearly still hard at work — it managed to tie the Ensemble with a score of 10.09% with only around 24 minutes remaining in the competition. But they were to be foiled once more: with only four minutes left, The Ensemble struck back with a score of 10.10% to regain the top spot on the Netflix leaderboard. Soon thereafter a notice went up on the Netflix homepage stating that there were to be no more submissions.
This is all very exciting, but there’s a reason that Netflix has not yet annouced the winner, even though the leaderboard is quite clear. That’s because the Netflix prize is actually based on two sets of data. The first is called the Quiz set, which is used to publicly display how a team is faring to the public and to determine when the contest would kick into its 30 day last call mode. But the set that really matters, the Test set, is still a secret, and nobody knows if The Ensemble or team BellKor performs better. Netflix will make the final contest announcement in the next few weeks.
Update: Yehuda Koren of BellKor’s Pragmatic Chaos has posted on the contest’s forums that BellKor came out with the lowest Test score, though it appears that Netflix has yet to confirm this.
Thanks to Almir Karic for the tip.










Man, I’ll feel so bad for “BellKor’s Pragmatic Chaos” if they lose the $1m
WHY???? Be happy for The Ensemble for KICKING THEIR ASS!
Intense stuff. I feel sorry for the BellKor team, they held the lead for so long only to be denied it here. Hopefully they perform better on the Test set.
I wish more companies behaved like Netflix, I think it’s really cool that they gave a substantial prize to anyone who could improve the algorithm.
It’s sad that this contest didn’t result in anything really useful. Countless hours of work by very smart teams only led to a 10% incremental improvement. No breakthroughs in figuring out what someone will like based on what they’ve liked in the past. Little new learning that can be applied outside movie recommendations.
Hi Dan,
That’s not correct. I’m not sure if you’ve read the papers that came out of it but the advances in my view have been very significant.
The SVD-based models have shown surprising accuracy even in their emerging complexity and the blending of thousands of predictions have surprisingly been responsible for the ten percent improvement.
The KNN article by BellKor is already a classic paper in my view.
Yes, Dan Grossman is clearly a moron who hasn’t taken the time to look at the actual results that have been inspired by the prize.
Dan, do you have any idea how hard it is to improve upon any benchmark in computer science by 10%? That’s not an incremental improvement. Go back to developing your stupid web apps and leave the real CS work to people who know better.
A 10% in this algorithm is the equivalent of tripling gas mileage. It’s fucking huge.
That’s a substantial overstatement, but it is an impressive increase. Perhaps a company should try this type of competition for gas mileage. In my view, that is a more important objective.
Dan said…
Little new learning that can be applied outside movie recommendations.
Dan, the algorithm is data ignorant. It doesn’t care where the data comes from, be from video movie database, music ratings, user buying history, word-by-document, images, etc,… as long as the data is available in numeric form, then the algorithm can work pretty well. I’ll give an example, the SVD (singular value decomposition) has been applied in applications as far as text-search engine, movie recommendation system, image similarity retrieval and more… The SVD doesn’t know where the data come from as long the data has been put into a numerical matrix. All it knows is that it has a numerical matrix to crunch so as to find/extract useful features, irrelevant if the numerical matrix was formatted from documents , images or historical video renting data, etc,…
In fact what Netflix’s algorithm can be applied to a wide variety of data-analytics, even for stock recommendation.
Why is it sad? It’s business doing what it does best, making improvements in the areas it knows, with a profit objective (they are not non-profit). A 10% improvement is significant and the better Netflix is at predicting what people will like, the stickier they will become because even though Hulu and others might get the content, they won’t have the recommendations based on people telling the algorithm what they like. Netflix will have paid $1M for a competitive advantage worth much much more.
Most of these teams did it for fun in their spare time. They did not plan on using this money to live off of…. I still use blockbuster over netflix because they have video games and sometimes when I can’t find a game to entertain me I opt for a movie. I would not be surprised if Netflix buys gamefly in the future.
Accuracy is only one metric, and a 10% improvement is most likely going to have only a minimal effect on the user experience, if any at all. In fact, accuracy can be detrimental to the user experience. For instance, if someone rates a Chuck Norris movie highly, he/she is more likely to rate another Chuck Norris movie highly. So you can imagine that a movie recommendation system might just keep spitting out Chuck Norris movies. This is “accurate,” but it’s not allowing for discovery/serendipity.
I think you misunderstand what is going on here. Why would n=n recommendations need – or even be subject to – improvements?
And if that algorithm led to less recommendations being chosen for rental, it would not be a 10% improvement over the current one.
We’ve been doing a whole series on the assumptions of the Netflix Prize and whether it will actually lead to a better user experience. See starting here: http://blog.med...ound.com/?p=419
Cool! And just to clarify, I’m in no way saying that a 10% improvement isn’t a remarkable achievement. It clearly is.
My god .. the ignorance of TechCrunch readers is astounding. There’s no evidence to suggest the recommendation algorithms suppress serendipity. If you have such evidence, let’s hear it.. until then, STFU. And, the algorithm won’t lead to less recommendations being chosen for rental, you dipshit, because future recommendations being chosen is precisely the evaluation criterion used. Try reading the contest rules before posting your ignorant replies please.
Try reading what you’re replying to before posting your ignorant replies please. I was countering that very argument.
How do you fare in the contest Dan? Did you place? Did you do better than “incremental”?
I believe NetFlix probably knows a shitload about recommender systems and whether or not 10% is substantial or worth a million dollars to them or not.
Dan, I think that you ought to read about the subject before making an uninformed comment. Start with the following:
Incremental Singular Value Decomposition (SVD) Algorithms for Highly Scalable Recommender Systems
The authors of the paper has also developed a Java-based recommender system software based on their publication, which you can download and play with it.
Learn the subject first and then comment.
@Falafulu Fisi, if i may say, You are a real dick.
Thank you for listening
Some of the teams (e.g. Pragmatic Theory, Dace) did it for fun but some were doing it as part of their jobs in private research (e.g. BellKor) or at a university (Gravity) or to promote consulting practices (Big Chaos).
The 10% improvement is more important than it appears because the theoretical limit on this data set was probably not much more than that.
We do not know if there has been a break through because we do not know what the teams did to finally win the contest.
this is really cool.
An update on the winner:
“BellKor’s Pragmatic Chaos” is the currently top contender for winning the competition.
Even though the leaderboard shows “The Ensemble” at the firstplace, it is only on the Quiz set, which approximates the important one – the Test set. However, on the Test set BellKor’s won.
ha, told you things weren’t done yet.
From a marketing perspective It would be interesting to see how much free advertising this contest has yielded. I would guess the amount of exposure (Wired, Techcrunch, etc) and positive at that, would far surmount the 1 million that was spent on improving the current system. Heck they may have even had an insurance policy on the 1 million. Im sure others have written on this, anyone know of a good write up?
The real winner here seems to be Netflix.
The numbers reported here (10.10% vs 10.09%) are means. Even for the Quiz set, are they significantly different (in a statistical sense) from one another?
No matter who wins, this is a great contest! It’s amazing that it took so long for the 10% improvement.
And they say comp sci is boring. Just see how interesting it can be. And I am sure we’ll see a Num3rs episode based on this very soon
Hats off to Netflix for a brilliant contest idea that was well worth the prize money in positive press coverage and marketing — not to mention the direct benefits to their system.
The plot thickens…a post on the Netflix forums by a member of BellKor claims that the final numbers are actually showing them in the lead:
http://www.netf...?pid=9237#p9237
BellKor is saying they won based on the test dataset.
So, at 9:49 PDT is it still up in the air?
I wonder if anyone chose to film the competition for a documentary. Seems like it could have been the web 2.0 version of “King of Kong.”
Very cool contest, and I love the discussion around how significant 10% is. Too bad more companies, and the media for that matter, can’t make CS competitions like this more prominent in our society. At the end of the day what is more important to our advancement, a competition like this, or highlights from a baseball game?
A lot can happen in a week. One week prior to close, the NY Times covered this (see http://it.toolb...t-locally-33093) and at that time it was an ATT group who was winning.
Since the test set was a lottery draw wherein any team could have come up on top, isnt it a little awkward for BPC to justify why their results would be better since the other team (Ensemble) is visibly better than then them on the leaderboard
Arvind,
Please notice that the Test set is not a lottery draw, but the ultimate goal of this competition. Competitors had much consideration on how to optimize performance on that Test set. This is unlike the Quiz-set, or leaderboard, which should have been taken just as a proxy to the Test set. For more explanation, please refer to the competition FAQ.
Congratulations to both teams for exceeding the 10% improvement mark, but I can’t help but wonder why the improvement was so close to the bare minimum required to succeed. Was there really ONLY 10% improvement left in the algorithm? What if they needed 15% to get the prize?
That is like asking ‘Why are my keys always in the last place I look for them?’ The answer is they are always in the last place you look because once you find them you quit looking!
The improvement was close to 10% because that was the threshold that needed to be reached to end the contest.