August 8, 2007

Google News Hypocrisy: Walled Off Content

Michael Arrington

51 comments »

TechMeme founder Gabe Rivera makes an interesting observation on the Google News story all over the blogosphere today.

One thing that bugs me: they’re now hosting original news content, yet they prohibit other aggregators from crawling it (per robots.txt restrictions and TOS). Of course Google News relies on the openness of other organizations with original news content.

Google crawls news sites and grabs their content for republishing on Google News. They rely on the willingness of those news sites to get distribution on Google. But Google restricts others from crawling Google News itself via their robots.txt file and terms of use, which state that “you may not…use any robot, spider, other device or manual process to monitor or copy any content from the [Google News] Service.”

That policy wasn’t a problem when Google was simply aggregating news from around the web. But now they are hosting original news content, written by people that are involved in the story. And they are telling the world that no one else can crawl that content and display it. Yahoo News, TechMeme and every other non-Google owned news service on the web is restricted from using that content.

The restrictive policy hasn’t changed with the new feature launch, and this may just be an oversight. We’ll find out soon enough if Google intends to build a wall around this news content, or share it with the rest of the web.

  • Sphere It

Comments

Whatever happened to Google making everything “open” - i.e. information, wireless networks, news, etc.? I guess those days are long gone.

 

whatever happened to Google’s mantra “Don’t be evil”?-i.e. Not sharing, hypocrisy, ‘walled content’? I guess those days are long gone.

 

I hope Techmeme is ablaze with Google’s response tomorrow.

 

I imagine it is just an oversight. If it isn’t, they’ll have to fold to the pressure or be viewed as a company that has begun to “do evil”

 

I think Google has special agreements with some of the sites they crawl , simply they pay for that content.

My guess is they are not allowed to open that content and offer it to any body.

 

Honestly to my I like the services of google

 

Google is definately the new Microsoft. They’ll bully others when they can.
They’re not as evil as Microsoft( no one is ), but they’ll get there eventually.
Also, I think they have no idea what they’re doing anymore. Their going in 6000 different new directions and don’t have enough manpower to keep up. Meanwhile others may very well create a newer, better more advanced search.

They got a lot of bright people. But like MS, they’ll flee if their not making millions after a few years. No point in working 12 hour days if you’re just a normal peon paying the morgage.

 

Holy fuck, how could we have been so blind. “Don’t Be Evil” my ass.
Today it was also discovered that they are filtering torrent sites/files from their search index because of some DCMA take down notices.

Google obvosiuly doesn’t understand that the entire world needs just one reason, to start justifiably hating on Google.

 

It is weird for Google to be this closed off. Crawling their site would make things too easy for the competition i guess?

 

In a poll of evil organizations worldwide, including the Nazis, Aryan Brotherhood, and Janjaweed, the number on “evil company” was Google.

The Nazi said it best: With propaganda (advertising), Google has deluded the world into believing it is good, which gives it an enormous capacity to do evil behind the scenes. Just like the Nazis.

 

TechMeme is going to get killed for this.

 

It will be interesting to see if Google changes their stance on this

 

If any news site wants to stop Google from crawling and grabbing their content for republishing on Google News, all they need do is enter a few lines in robots.txt.

 

As soon as Google puts their news product in the organic search results many publishers see Google as a competitor and back out of that syndication deal.

 

News flash people. Google has been evil for a loooong time. They’re out-microsofting microsoft and the battle’s already been won and lost. :)

 

Google is not the same company we all loved in 2002-2005. They mask their behaviors in a don’t be evil motto…wake up

 

Yeah. It’s very hypocritical of them.

I think it’s because they feel their work on text summarizers and document classifiers doesn’t deserve to be open (for whatever reason) but they’re really just relying on the hard work of others.

Granted they provide value (just like Tailrank, Techmeme, or Spinn3r) but they need to eat their own dogfood.

http://feedblog.org/2007/08/08.....-crawlers/

 

While looking at how hypocritical Google has been of late, it is worth noting that Google was recently caught selling PageRank again:
http://www.webmetricsguru.com/.....stiny.html http://www.seobook.com/archives/002403.shtml

 

duh, of course they’re evil, it’s a business, not a daisy farm - can we all move on now and lose a litte naivete in the process?

 
 

Frankly, I don’t think this matters very much. First of all, is it really that necessary to scan Google’s already-scanned content? What kind of killer app were you thinking would come out of this? A comment aggregator? Come on, write one that aggregates comments from *across* the web and combines it with quotes from the sites of newsmakers (that’s the open model, right? everybody leaves comments on the “leaves”), it’ll be even better!

And second, is Google News that important of a service to begin with? What’s Yahoo’s policy in this regard - do we even care? Obviously, judging something based on one’s personal preferences is not a very sound practice, but thinking about myself, while I occasionally use Google News, but I wouldn’t say they are my first choice for news (aggregated or original) - I much prefer places where humans *were* involved (and quite possibly, hurt) when laying out the front page. For example, I’d check this blog to see what’s new before I’d go on Google and look under the “technology” section. Call me old-fashioned…

 

Everyone needs to calm down. :-)

 

This (added to the interesting habits of Facebook et al) reminds me of the enclosure movement of commons land in England a few hundred years ago…hmmm…may just blog on that methinks.

 

I noticed three key lines in their robots.txt file:

Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /news

A bit tricky isn’t it? If they want to disallow crawlers why post the second line in the robots.txt file?

 

Google didn’t actually “rely on the willingness of those news sites to get distribution on Google.” Google didn’t give a damn about whether those news sites wanted their stuff used or not. I recall Agence-France Press had to sue Google because they, in fact, didn’t want Google republishing their stories for free.

Not that I belong in the anti-Google camp. Unlike Microsoft, Google are good at what they do. Google has done wonders for me and my business. Research which used to take days takes hours, and of course the revenue from Adsense has changed our bottom line dramatically. I love you Google. Mwah! Keep up the good work!

 

Well, *this* is mighty interesting! (although I have to agree with Dave W….) but does anyone actually read Google News’ original content? Going to the site, all I find are the aggregated articles. As I said previously, the whole thing seems a ploy to get some free content–and now it looks like it’s also a ploy to get people to read their original content. Which, I might add, is nowhere to be found on their main news page. Where might they be hiding it? Does anyone care?

 

The only thing I don’t agree with is their use of “Do no evil”.
If you can’t stand behind what you preach, stop preaching it!
I say, the users will be the judges.

 
 

This seems like a step in the wrong direction.
C’mon Google, what are you up to.

I have recently seen other restrictions coming from Google such as certain searches returning.

“We’re sorry…

… but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can’t process your request right now. ”

- I’m not a fan of filters!

 

I’m one of the very, very few traditional book publishers who’s always supported what Google has been doing with books - until I realised they were locking all that data away.

At least with Search and News the content is still out there elsewhere on the web for others to index; with Google Books they’re building a massive databank of previously print-only information, which no one else can search or index.

 

Google’s apparent hypocrisy is a bigger nuisance for Science, Technical and Medical publishers, who can provide links for their readers to go directly into searching Google Scholar, but are prevented from scraping Scholar’s citation results and display them on their site - which would provide scientists with useful information all on one screen, and would not even be commercialised.

Perhaps even worse is that Google often DOESN’T EVEN REPLY to publisher requests to crawl it. It’s not exactly evil, but it strikes me as arrogance and opportunism - and will last until everyone wakes up to what is going on.

 

One could say Google is innovating here - it is creating a new product using

a) Indexed content from other sites
b) User-generated content
c) Business “know-how”

Then, it simply moves to protect this product from being used as item a) in somebody else’s “content recipe”.

The issue that some people see here is that this kind of innovation has little to do with Google’s original raison d’être - i.e. being a damn good search engine. One could even argue that they’re trying to build another business on the backs of other people’s content.

But a bigger question is - why can’t newspapers create a joint organization responsible for collecting comments on the member newspapers’ articles? Is it that difficult?

 

I think this is just the matter of time while Google will open their news to others.

Probably they just have to define the format of the outgoing rss. If theye’re “Googlize” it like they do with everything they touch, I suggest this content has better presentation as it used to have in the Google’s sources. I think they jus don’t want to make their work done for other’s who just aggregate it w/out the elegance of Google’s approach of making everything.

 

It appears that everyone is focused on the robots.txt part of this post, but the TOS portion says that Google owns the rights to the comment. So, somebody could respond to a NYTimes story which Google scraped from their site, but the NYTimes would be forbidden from printing the response or possibly even quote it in a story.

 

#6 “I think Google has special agreements with some of the sites they crawl , simply they pay for that content.

My guess is they are not allowed to open that content and offer it to any body.”

This is not the case. They do not pay the sites they crawl news from.

 

The aggregation of content on the web is the key to successfully monetizing on the content and optimizing the content for users. The duplicate content issue is a big thing and I pride Google for keeping this walled off…

Option #1: Just scrape and then spin the content for maximum performance

Option #2: Forget the rules and just scrape the content for Google is just hypocritical sometimes

Option #3: Report on the content and try to become the Reader’s Digest of Google News.

Option #4: I really just can’t think of any other ideas, anyone have anything to add?

 

11 posts to Godwin’s law. Bravo!

 

Hypocrisy? Of course it is, and it’s just one more example in a long list of examples. Refunding only 60% of money spent on “purchased” google videos is the most recent one. Refunding that via “google checkout” credits is just a slap in the user’s face.

This isn’t a issue, Dave Winer, of telling people to “calm down” and ending it with a cute smiley face. This is an issue of calling the spade a spade.

Me, I’m more old-school with my complaints: I still can’t get over why no one is on a rant that GoogleBot can’t even read an XML document and the web is stuck publishing in HTML. Seriously, being constrained by a search engine’s ability to parse is limiting all of us.

Letting browsers do transforms makes development much faster and easier, but we have to make workarounds for the self-proclaimed organizer of the world’s information because 4000 Phd’s can’t even follow links in an XML page. (??!) Maybe powerset will step up.

 

I do not understand why do you need a crowler, when you can subscribe to the rss feed and grab the content from there!

 

Google is the only for getting maximum traffic and getting the advantage of sales through this search engines…I prefers others too…!!

 

Sorry, the comment form is closed at this time.