November 4, 2007

Attributor Launches Service to Track Copyright Infringement Across the Web

Erick Schonfeld

44 comments »

attributor-logo.pngEvery media company on the planet knows that its articles, songs, photos, and videos are being copied and spread willy-nilly across the Web, but they don’t have a clue what to do about it. They are not even sure what to do about all of their stuff that is just on YouTube (should they let Google monitor itself or create some vague industry guidelines and hope that every site follows them?). A startup called Attributor in Redwood City, Calif. says it can monitor the Web for copied content no matter where it may be, help publishers and media companies track it all, and help them decide what to do about it.

Attributor was founded in 2005 and has raised $10 million from Sigma Partners, Selby Ventures, Draper Richards, First Round Capital and Amicus. The enterprise version of its service launches today, although it has been testing it with Reuters and AP for about six months. The enterprise service will cost anywhere from tens of thousands to hundreds of thousands of dollars per year (a more limited self-serve version for bloggers and smaller publishers could cost as little as $6 or $7 per month, and will launch in 2008). CEO Jim Brock gave me a demo of Attributor last week in the lobby of the Waldorf Astoria.

Attributor is already indexing 100 million Web pages a day (15 billion total so far), but it is not a keyword index. It looks for bigger blocks of content. Right now, it can handle only text. Images are in beta. And video matching will go into beta early next year. If you are a publisher that is a customer of Attributor, it ingests all your content and comes up with matches. Attributor splits up the world between sites that exhibit extensive copying (more than half of an article, for instance) and just some copying. It shows which sites have linked back to the original source and which have not. “Often, that’s all they want—a link,” says Brock. Below is a typical dashboard view of what a customer would see. In this case, the content from People.com is being analyzed (based on its feed). Of the 265,000 matches, 103,000 don’t link back to People.com.

attributordashboard.png

Attributor also shows which sites generate the most traffic, which are supported by ads, and which ad networks are making the most money off of your content across the Web. Of the sites that copy People.com extensively, for instance, 55,000 are supported by ads. “This becomes a billing engine at some level,”says Brock. But rather than go after each offending site, he thinks that Attributor’s data will give media companies leverage against Google and other ad networks. “If I am a big content producer,” reasons Brock, “and I can identify all the pages with Google AdSense, my conversations at that point is with Google.” They could ask Google to ban the offending sites from AdSense or, better yet, to cut them in on some of the advertising revenues associated with their content.

attributor-lyrics.pngUltimately, though, it is all about the links. Links are the currency of the Web. They are the way attributions are made. In most cases, media companies would be better off if they could just get everyone who is copying their stuff to link back to them than by trying to extract licensing fees out of them or suing them. There is a lot less friction in asking for a link, and it doesn’t cost anything to give one out. Yet all of those links can turn into traffic, both directly and by imbuing the original source with higher search karma (i.e. a higher ranking on search engines).

A case in point is what is going on with music lyrcis on the Web. The term “song lyrics” is one of the most popular searches online. In a study just released today (PDF here), Attributor scoured the Web for the lyrics of 14 of the songs at the top of the Billboard charts. It found 1,524 copies, mostly on lyrics sites, social networks, and blogs. The only site that has actually bothered to cut licensing deals with the record labels for these lyrics is Yahoo Music, yet in all Google searches (and even 81 percent of Yahoo searches) other sites outrank Yahoo Music when it comes to finding the lyrics for these 14 songs. Of those sites, 57 percent were supported by ads (mostly AdSense) for ring tones, concert tickets, and the like. A Google search for the lyrics to the Rihanna song Umbrella (pictured above) shows how much AdSense is powering the lyrics Websites.

It’s not just lyrics. In another study evaluating 215 recipes on Epicurious, Attributor found 3.959 copies, 65 percent of which did not link back to Epicurious, and 56 percent of which were ad-supported sites. More than half of the copycat sites ranked higher in searches than Epicurious itself. I asked Attributor to run a search on some of my TechCrunch posts. One reporting some early details of Google’s OpenSocial project (codenamed Maka-Maka) was the 15th most copied post on TechCrunch since June, when Attributor started monitoring our feeds. (This Hulu post was the most copied overall, being copied 572 times).

For the Maka-Maka post, Attributor found 243 copies, with 200 of those taking more than 80 percent of the text. Fewer than 40 percent actually linked back to the original post (you swine!) and 79 percent had ads on the pages. And this is just for one post. I won’t actually link to the offending sites—you know who you are so cough up those links—but here are some screen shots (highlighted portions are copied verbatim from TechCrunch—at least one takes our entire feed, reposts it with AdSense ads, strip out names of the authors, and does not link back to TechCrunch):

just-a-random-blog-maka-maka.pnghuman-capital-maka-maka.pngwebuy-maka-maka.png

  • Sphere It

Comments

Come on - Y are blogs sooooooooooooooooooo long . PLease make it short and sweeet ( as it used to be ) [:)] - at the first sight of me seeing the blog , I hav a gud mind to read only the heading , skip it and go for the next one !

 

Great….after long time i saw a startup which has a potential of becoming big like google.awesome work…I know this will hit very big or will be buied by google or microsoft very soon…..

 

Copyscape already does this.

 

How effective can those quasi-jibberish ad sites be? Any human can tell that they are just clumsy screen-scrapes.

Do they really get enough random click-throughs to make money? Even worse, are there actually people who intentionally return to them?

 

I think people would cheer…
Techcrunch could be next… :)

Sometime Techcrunch never quote people’s works.

 

Excellent post!…but I agree it’s a little long. May be I just have ADD.

This company has lots of potential. Also, I have to hand it to First Round Capital as they are getting into some really good deals.

 

@3 - I think copyscape is able to find some of your content online but I dont think it can manage such an Index and monitor for you.

This startup seems to pick the ball and answer a need many bigger sites and maybe bloggers needs, most of them won’t resort in getting a lawyer and sue the copycat, but if attributor can strip the email address off these sites and automatically mail theses offenders requesting for a link back, that could be a nice option.

I won’t find myself email each one of the 231507923 sites that copied my content.

On another hand, I think we should open a cyber police already and declare a web of war. This is getting ridiculous ;)

Update- It seems copysentry.com should do that, but I am not sure since it requests a registration and I would like just to have a sneak peak before registering.

http://www.octabox.com

 

Having posted THOUSANDS of threads in Forums and Blogs over a number of years. EVERY single quote was linked to the originator of the piece.

Even if it was found as a recap on another site - the original article was always sought out to provide a backlink.

This was being done even before backlinks and Google became popular.

It was a matter of ethics.

 

Well now that sounds like a good idea. Pay a web police force just like they do regular cops. Make them Cyber cops:) so they get paid to run down this kind of thing all day every day.

Hmmm, they’d probably need a mass emailer that can scrape email addresses from sites where things are copied without a link. Then they can send an ominous email to cease and desist all at one go.

I know myself, I try to use a paragraph or so and paste that link in there with the author and the whatever name(Paper, journal). I don’t think I’ve used anything without a link in years and years.

I prefer people to go back to the original site because those folks know more than I do about whatever they’re writing about, LOL.

And just for giggles and grins I’ll let you know I’m planning on using a paragraph or two of this article on my journal as a hook to grab interest and I WILL link it back to you consider it free advertising if you will, LOL. I’m even gonna put your name on it;) Well if you don’t mind too much that is. I think folks should know about this. And everyone doesn’t know about your page here.

 

They’ll have to have a nerve to copy this article without crediting it!

 

@Liz

One of my professors runs a blog about cybercrime and she actually tackled the issue of a cyber police force a couple weeks ago. Its a pretty interesting read, and perhaps of interest to the general TechCrunch readership.

http://cyb3rcrim3.blogspot.com.....space.html

 

Here is a simple test for skeptics. Go look up any random topic on Google. Then take a part of the bio, story, whatever from a blog & stick it in Google. It will pop up on a bunch of other sites. Pretty sad, actually.

 

Google’s 700 dollar share built on a house of cards? NO FREAKIN’ WAY! You mean to tell me SEO and AdSense exploitation has made a complete mockery of PageRank and Google Search? NO FREAKIN’ WAY!

 

You call it copyright infringement; computer scientists call it as redundancy.

Some day - years after today - big sites we know now will eventually fade out and disappear, with whole domain names leaving only a domain not found error after themselves (or a domain parking site).

Redundancy is the real currency of information, not links: whether something is important to mankind can be tracked back to how redundant that information is stored, how much it is worth the space and effort it takes.

So, unless we do not want to go back to the ‘www-404′ era, I guess this is how it should be.

 

hmmmm,
Hey, I can use this to see if Jobs is copying my speeches!!

http://fakesteveballmer.blogspot.com

 

Techcrunch, please quit your whining about others copying your blog posts. It’s simply economics at work. You should be happy that this is happening.

Marginal production costs are zero: Like software, it doesn’t cost anything to produce another digital copy that is just as good as the original as soon as the first copy exists, and anyone can create those copies (meaning there is perfect competition and zero barriers to entry). Unless effective legal (copyright), technical (DRM) or other artificial impediments to production can be created, simple economic theory dictates that the price of blog posts, like its marginal cost, must also fall to zero as more “competitors” (in this case, blog writers who copy) enter the market.

 

This is going to be a big winner!

 

Copyscape uses only google lookup and provides you with a UI. There is nothing behind it. These guys sound like a real deal. 15B index is not a cakewalk people….

 

I for one, welcome this. I am sick of getting google alerts that indicate someone in some far off corner of the globe has put out a “bid to clone” my website. I mean, seriously, good lawyers or not, this software is welcome.

 

Does this also work with PDF files or only HTMLized text??

 

I never understood why people care about quoting song lyrics. I do lyrics searches all the time. I just read them to see what the song is actually saying. Most of the time they’re not linked to the song to play, or where the music is, the lyrics are not there. Recipes and other content I can see, but lyrics? I don’t see why musicians would care if their lyrics were posted out there. People are not turning them into poetry books and selling them.

 

Yes, exactly as #3 said, copyscape have been around for ages, and they haven’t made much of a dent.

Anyway, this is like a typical GoDaddy product, something you can scare consumers into signing up for, but which is basically useless.

 

“The only site that has actually bothered to cut licensing deals with the record labels for these lyrics is Yahoo Music”

…this is completely untrue. There are a number of sites other than Yahoo that are legally displaying lyrics, including Rhapsody (RealNetworks), nuTsie (Melodeo), SoundClick.com, and others.

(Disclosure: We power the LEGAL lyrics services on those sites).

There have also been a number of other announcements from us and our competition regarding new legal services over the past few weeks. Yahoo is not the only source!

 

There have been previous startups doing this type of stuff, and almost all of them disappeared again. What makes this different from those previous ones? It seems the technology for detecting duplicated text is well understood, or do they have something that is really different? Or have the economics changed significantly since the last time, due to adsense and blogging, opening up a new opportunity?

Anyway, I am very underwhelmed by this, at least the way it is presented here. Seems like this could be built within a few months by 2-3 good people. Maybe there is an interesting backstory, but techcrunch’s breathless “look, someone has a tool that can do x” coverage doesn’t help me.

 

Great startup, with an important product.

This article is a nice way to see how Google monetizes and sponsors copy/paste content.

 

TC - imitation sincerest form of flattery ..or is it? :-)

 

This doesn’t help the ‘little’ blogger, who’s articles are being swiped the instant they’re posted online. I’ve tried contacting Google when a site including my material uses Adsense. The response I get is that I have to do a DMCA complaint to get them to take action.

So merely knowing isn’t enough; there has to be a way to do something about it.

 

Very clever, #16: A verbatim copy of Arrington on October 4:

http://www.techcrunch.com/2007.....ards-free/

And I didn’t even need Attributor.

 

Hey, where’s the pingback to this morning’s insightful post, in which I argued that the “humble blogger” version of Attributor should be free?

 

i’m eric. joining a couple boards and looking
forward to participating. hehe unless i get
too distracted!

eric

 

Links are the currency of the Web.

Great article that inspired this post.

Since the currency of the web is links and links are proportional to authority then I propose G = 1 Google as the universal constant for the web. Does this make sense to you?

 

Oh come on.

You know we would only read from TechCrunch. Let them be. If you’re bummed about their website, just sue them. Probably is not worth the hassle.

 

There’s already a company that tells you if your images are being used elsewhere on the Internet… it’s http://www.copystalker.com

 

Sorry, the comment form is closed at this time.