Since its inception, one of the biggest problems with Digg has been that users often submit the same content over and over again. This makes it harder for cool content to become popular because some users digg one submitted story, while some digg another. Today, Digg is releasing “several major updates” to its duplicate (known as a “dupe”) detection system.
The solution sounds fairly intensive. “To better understand the nature of the problem, we analyzed the types of duplicate stories being submitted. Most common are the same stories from the same site, but with different URLs. Our R&D team came up with a solution that identifies these types of duplicates by using a document similarity algorithm,” Digg’s Director of Product Chris Howard writes in a blog post. He goes on to say that there will be a follow-up more technical post to explain a bit more about how this actually works, but says that it has proven to be a reliable system so far.
But the really tricky stuff comes when people submit the same story from a different site. This is a gray area because of course some sites have different takes on the same topic, and whose to say which is more Digg-worthy than another? Digg now says it will scan for descriptive information such as the story’s title to see if something very similar is already in the system. But still, it’s a gray area.
At least the submission process should be faster now. Digg will run these dupe checks after you enter the URL but before you enter the description, which saves a step in the process. It claims this dupe detection will take only “a few seconds.”
And if you ignore the dupe algorithms and submit dupe stories anyway, Digg is watching: “We’ll also be monitoring when certain Diggers choose to bypass high-confidence duplicates and will use this data to continue to improve the process going forward.”

[photo: flickr/yogi]









first
Digg should put a RT @diggusers in front of any potential dupes and become TC media darlings.
That is the big problem with digg, it takes a lot more steps to tag stories.. tweeting is a 1000 times easier.
That is the big problem with digg, it takes a lot more steps to tag stories.. sneezing is a 1000 times easier.
Seems like the algorithm needs to change a lot.
That is what I find to be a major problem with Twitter as well – links from different tweets (typically re-Tweets) to the same article or blog posting. Since tiny URLs hide the actual destination of the links, users have to click-thru see what the article is, only to find that it’s something that you have already read. Someone needs to create a algorithm that reads the actual destination URL and marks any duplicates as “Already viewed”. Now if only I knew how to write code…
hey jeff…
don’t worry… i’m convinced that 95-99% of the people on here think css is code!
but what you state is what i’ve said is a major, athough as yet, unexploited issue involving tiny urls.
tiny urls, obscure the underlying url, and is contrary to the entire issue of the DNS system.
if i give you a tiny url, you need to “trust” that the issue is valid, and by then, you’ve already linked on the link. unless you have a well formed sandbox browser.. who knows what the heck can happen.
and given the fact that an evil doer could in theory, spawn 100s of tiny urls with a number of levels between the tiny url you get, and the final parent url that’s the actual malformed page… this could wreak havoc….
have fun!
This is one of the reasons I left digg. That and the community has gone down hill since the early days of digg. Its unfortunate but I can see this issue developing with an increased popularity.
Maybe I caught it on the switchover around noon, but there are multiple dupes of the NY Times Al Franken story, including my submission, with only minor variations in the URL. I was not offered the standard “Is this a dupe?” interruption screen during submission.
I sincerely believe that DIGG is DEAD!
“Our R&D team came up with a solution that identifies these types of duplicates by using a document similarity algorithm”
Now you can create such sentences too: http://dack.com...b/bullshit.html
Except that the sentence they used is completely legitimate… just because you don’t understand it doesn’t mean it’s BS.
Discriminating between those two (BS and legitimate things one doesn’t understand) seems like a big part of the challenge of the modern world. Too far to one side and you’re a sucker; too far to the other and you can’t learn.
True…
“Our boys wrote some code yesterday to see if multiple URLs point to the same page”
VS
“Our R&D team came up with a solution that identifies these types of duplicates by using a document similarity algorithm”
I miss those days as an engineer where we used to mess with clueless business dudes. Glad to head the old trick is still in use.
The bigger dupe problem is stories being submitted from the same site with slightly different urls that point to the same page. This should be easily solved since the resulting page are identical.
For stories on different sites, if the text content is 95% the same then there is a dupe. The problem is if someone submits a rip-off of the original article before the original is submitted. This could be solved by linking the different pages together under the same submission. You could then have some sort of flag people can set to say which is the original.
techies have a new crush , twitter
seems like it might work
ever since the stopped allowing shouts, even my coolest content doesn’t get diggs..despite 400+ friends.
Seriously, does anyone but 1000 or so crazy hardcore Digg users care about this?
What Digg should do is transfer all the diggs from the duplicate article to the original article, that way all the duplication can be eliminated and there will be no bias.
This post, totally photoshopped.
“…and whose to say which is more Digg-worthy…”
Who’s. Not whose. Who is.
I think this is a very good thing. However, I think they if it detects duplicate content from a different URL, it should add that to the main entry.
That way, when you are viewing the Digg entry, you can see all the places that the content is published. It would be sort of like how Google News shows you a story but allows you to read it from different publishers.
Hold up a minute. TC author discusses “users often submit[ting] the same content over and over again” ?
rofl so true. repeating and/of having no or half a story is great business.
also hilarious:
http://tinypic....=htv3nr&s=5
Good luck trying to accomplish this. Sounds like a daunting task
Specify your canonical!
Digg is a useless site full with junk links on top. I am sure investors (VCs) are trying to get rid of it very hard. May be AOL would buy them for free to put them out of their misery.
thats absolutely right.
if i give you a tiny url, you need to “trust” that the issue is valid, and by then, you’ve already linked on the link. unless you have a well formed sandbox browser.. who knows what the heck can happen.
Good luck trying to accomplish this. Sounds like a daunting task