Update: Digg Recommendation Engine Confirmed For This Week
by Michael Arrington on June 30, 2008

Digg has released some materials around their new Recommendation Engine, which we wrote about last night, and say that it will be released this week. Two overview videos are below, including an interview with Digg Lead Scientist Anton Kast. We’ve also included the text of a white paper on the Recommendation Engine.


Digg Recommendation Engine from Kevin Rose on Vimeo.


Anton Talks About The Digg Recommendation Engine from Kevin Rose on Vimeo.


The Digg Recommendation Engine
People love Digg because it’s a place to discover and share great content from around
the Web. The Digg homepage always has the most popular stories, but many Digg
users find their content in the Upcoming section, which gets over 15,000 new stories a
day. To help users filter this enormous amount of content, we have created a new
feature: The Digg Recommendation Engine.

When you Digg a story, you tell the Recommendation Engine two things: that you
recommend the story to other users and, less obviously, that the users who Dugg the
story before you are good at finding content. The Recommendation Engine keeps track
of users who Dugg particular stories before you did, and it recommends you the stories
they Dugg. The more content you Digg, the smarter the Recommendation Engine
becomes.

Finding Diggers Like You
The Digg Recommendation Engine uses your Digg history over the last thirty days to
make Recommendations. (You can see the number of items you have Dugg over the
last month on the right-hand side of the Recommended view.) Every time you Digg a
story, the Engine matches you with other Diggers who Dugg the same story, and keeps
track of all your Diggs in common with them.

When it’s time to calculate your Recommendations, the Engine draws from this pool of
matched Diggers. For each matched Digger, it computes a correlation coefficient
between you and them. It then picks a cutoff for this correlation coefficient, and the
Diggers who make the cut are called “Diggers Like You.”

It’s easy to understand how the correlations are calculated. For each user with whom
you Dugg something in common, the Engine determines how many stories the two of
you Dugg in common, and divides that number by the total number of stories you or they
Dugg. The ratio is a correlation coefficient, a number between zero and one (zero if you
and the other user never agreed; one if you always did). Such a ratio is sometimes
called a “Jaccard coefficient.”

This scheme automatically accounts for the overall level of Digging activity. If another
user Diggs a lot, they have to agree with you on many stories to become a Digger Like
You. If another user Diggs rarely, then a small amount of agreement can suffice.
2
From Diggers Like You to Recommendations
Once the Engine has determined your Diggers Like You, your Recommendations consist
of stories that your Diggers Like You have already Dugg, minus the stories you already
Dugg or Buried. There are some extra steps, like the diversity rules and the
promotability constraint described below, but this is the basic idea.

Recommendations are always displayed together with your Diggers Like You and their
compatibility percentages. These percentages are just correlation coefficients. You may
notice that you are more compatible with a user that has fewer Recommendations than a
user with less compatibility but with more Recommendations. This is because although
you have Dugg more items in common with the more compatible user, that user has not
Dugg as much.

The Recommendations you get from any particular user will come from topics (such as
Technology or World News) where you have a shared Digging history. We figure that
two users may have similar interests in a subject like ‘playable web games’, but one
person might be into politics while the other follows celebrity gossip. So we actually
compute correlations, Diggers Like You, and compute Recommendations in several
collections of topics independently.

Promotable Stories
Since the Recommendation Engine works only with Upcoming stories, all the stories you
get from the Recommendation Engine are “promotable”, meaning that they are recent
enough to be eligible for the Digg homepage but haven’t appeared there yet. This
means that whenever you Digg one of your Recommendations, you are helping select
stories for the front page of Digg!

Diversity
Just like stories on the homepage, we want your Recommendations to be diverse: a
balanced number of stories, not all on the same topic, and not all Dugg by the same
people.

To make sure that your Recommendations are diverse, the Engine imposes limits that
keep things from getting too focused. It makes sure that no one Digger Like You
determines too many of your stories. It attempts to make your Recommendations reflect
the spectrum of topics that you’ve Dugg in the past, and it adjusts the compatibility cutoff
for Diggers Like You so you don’t get too many or too few stories.

The Engine also limits the influence of any single one of your Diggs. For instance, if you
are Digg number 1,000 on a popular story, you will have 999 similar users from that one
Digg alone, and those users are not necessarily more compatible with you than the two
3
or three who may have Dugg a less popular story you also liked. The Engine limits the
total pool of users you can get from a single Digg to balance things out.

We hope you enjoy using the Recommendation Engine and look forward to helping you
uncover even more great stories on Digg!
Digg on!
Anton Kast – Lead Scientist Digg

Trackback URL

Comments

This sounds like a great update and a good opportunity to become more active in the Digg community.

 

wow, talk about playing right into the hands of digg spammers. it sounds like those people actively gaming digg are going to have the reputation of their diggs boosted.

 

Hopefully will live up to all the hype.

 

This can both be a good step as well as a bad one ….

Good coz, i might actually have to search less to find my type of stories and can check the recommendation to check the stories that relate to me.

Bad coz, the people who are actively digging just for the purpose of getting more visitors to their websites or their friends website, it will be easier for them.

 

One more for the Digg fanatics. I am sure they will lap it up, but really how useful is it going to be otherwise. Will need to watch out for it.

 

Peter Urban must have an auto refresh setup so he can comment first on every post - making him look like a lame ceo.

Overheard:
Arrington: Rose, want to be on my leadership panel at tc50?
Rose: Yes!
Arrington: Perfect, Heather will send you the contract - the terms are simple, we require 24 hours prior notice of any Digg announcement for the next 5 years. You will always be listed as “source”
Rose: Yes!

 
silicon valley dropout - June 30th, 2008 at 11:46 am PDT

took them 4 years to implement this

 

From Dr. Kast’s description, it looks like that they don’t do text-mining at all. The downside, is that the content of stories are not being analysed so that relevancy is connected to content rather than popularity of items ranking by other users.

 

Hurry Chris, license your recommendation script to Digg.

 

I don’t see what is newsworthy of this, this is the standard technique that they use for any recommendation engine… in fact its the oldest technique!

 

The question becomes… how do you monetize this recommendation engine? If you can find a way to monetize this, and place the technology into a licensable format .. this could spell trouble for Google ad search revenues. Furthermore, if a serach engine were to exclusively partner with a this type of technology, it could throw the current search engine ad sales ecosystem into a mess… Look… http://www.readtheanswer.com/index.php?RTA=web2

 

Reddit tried to do this and failed miserably.

 

@Joe Bowers

facebook provides a similar recommendation engine with its “People You May Know” feature. probably uses a similar algorithm, and it has been successful in getting people to “friend” me.

how did Reddit’s feature fail?

 

Digg is only analysing the 2D ranking user data (ie, a matrix of user-by-story ranking weights) but ignoring the content. If they want to analyse the content & rank at once, then they should use a 3D algorithm such as Tensor SVD (Singular Value Decomposition). In this way, they can collect the rankings of every user plus the contents of the story (word-frequency). The data matrix is 3D , which is user by ranking by word-frequency (ie, a cube of data, row by column by depth) that the tensor algorithm can compute at one go.

 

Looking forward to this. Hopefully the little guys will now have a chance at getting to the top of the pages now too.

 

Digg can add the bells and whistles it wants, but at the end of the day its just trying to be an everyman’s Slashdot with limited success. They can add all the features in world, it still doesn’t justify the wild valuations for this company. Digg is quite possibly the most overhyped web startup in existence today.

 

With all the collective intelligence that digg have, digg looks like a good subject to be acquired by Google.

 

I really kind of sick when I see lots of similar stories in the front page comes from different tech blogs. Let’s hope this will make Digg to be a better social site.

 

They’ve been working on this for a *long* time, I recall having a conversation with Kevin Rose about it way back in August 2006!

It sounds like a fairly standard user-based collaborative filter. One problem with these is that they require a dense dataset to be effective, and if they are limiting it to 30 days of user activity, it is hard to see how the dataset will be dense enough. Put simply, user-based collaborative filters require a *lot* of data before they will work effectively. Unless you are Amazon or Netflix, you probably don’t have enough data for it to work well.

I have an interest to declare: I’ve been working on a collaborative filtering technology called SenseArray (http://sensearray.com/) for over a year now which is specifically designed to deal with a scarcity of data. It does this using a type of SVD collaborative filter, combined with the ability to use metadata about users (eg. browsers, operating system, geographic location, etc), and items being recommended (eg. keywords, categories, author, etc).

SenseArray already powers Newsgator’s website (http://newsgator.com/), and we’ll soon be rolling it out elsewhere for applications as diverse as selection of dating partners, to targeted advertising.

 

Digg is anything but a collective intelligence site.

1. They implement wisdom of the crowds phenomenon wrong - users must vote independently of each other, meaning they should not see what’s popular and what’s not. That kind of puts Digg in a catch 22 - in order to have a truly clever voting process, ALL or MAJORITY of users should vote on ALL or MAJORITY of content (up and/or down). That’s next to impossible to achieve and also defeats the purpose of having a page with most popular stories b/c people tend to vote for those and ignore the rest. On top of that, it has been shown that out of Digg’s 2.7mil users, about 35% are spammers, of the rest 65% about 95% are lurkers, of the remaining few percent another 95% are just blind voters, and only a fraction of percentage actually submits something. That turns Digg into a niche site at best and a self-promotion competition at worst. No wonder they couldn’t get the asking price of $300mil.

2. The number of news sources on Digg is VERY limited. I have personally used their API to run some stats my results show a perfect Pareto result - 21.7% of the sources accounted for 83% of the news. And that analysis covered the entire Digg dataset from Jan 2006 to April 2008. So much for the long tail. If you don’t believe me do it yourself. Just write a Ruby/Python/Perl whatever script to pull the popular stories from Digg going back to their beginning, put them in a DB and run some queries. You’d be shocked at how much hot air the Digg long tail is.

3. What Digg is doing with their latest feature is simply a somewhat primitive collaborative filtering. Showing me what other people vote for that overlaps with mine is OK but far from valuable. That’s not much different from showing me the front page. Similar voting patterns are far from similar tastes. What truly should be done is to build a model for EVERY user of what that user likes and then for upcoming news select the news based on CONTENT that would appeal most to every user individually. That’s where value is - finding what’s interesting to YOU. That’s what a recent MIT study called the future of the Web - on the fly personalization of the site that adjusts to you and to every other user’s preferences.

Some of my fellow grad students and I are building an engine that implements all of the above mentioned features, so hopefully we’ll manage to expose the sham that Digg is. I can’t believe that someone would be willing to pay $300mil for a niche site with a largely immmature user base, which on top of that is dominated by spammers. Until the technology industry realizes that building a vaporware business and selling it for a lot of money (only to collapse later) is NOT success, the bubbles will keep on coming, bursting and wasting a lot of money that would be better used somewhere else.

Good luck Digg, see you on the field soon!

 

man I could sure go for a cheeseburger!

jt
http://www.FireMe.To/udi

 

Hm… wonder if diggrecommendationengine.com has already been regged?

 

This could be a great marketing tool for advertisers. Get ads catered to your specific taste.

 

good digg application. it combine search engine and human intelligence. it will let better be better.

 

Leave Comment

« Back to text comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.