Google Takes Steps Towards A More Structured Web
by Jason Kincaid on May 12, 2009

Earlier today Google announced that it was going to begin limited support of RDFa, a framework that allows web developers to incorporate structured metadata into their sites. To most people, this probably doesn’t sound particularly exciting, but it’s an important step that may indicate that the search giant is going to embrace structured data on the web – something that it has long shied away from.

I’m not going to get into the specifics of the RDFa standard (if you’d like a more thorough explaination you can find one here and here). But the benefits of using such semantic tagging can be seen in a few basic examples. If I was to write a post that mentioned “The President” without naming him, Google probably wouldn’t realize that I was talking about President Obama – it might think I was referring to another US president, or perhaps the leader of a company. But using RDFa I could tag the words “The President” with “Barack Obama”. That tag would be visible to machines spidering the page for indexing (resulting in smarter search results), but wouldn’t be shown to users reading the post. In effect, it’s a way to tell search engines about your content without exposing your visitors to extraneous text.

RDFa tags will also allow search engines to identify structural data on a web page and present it in search results (Google is using it to generate its rich snippets). And browsers could potentially read the data and use it to present maps or other elements outside of the web page.



Mark Birbeck, who first proposed the standard and will be speaking at a conference on the semantic web this June, says that this is a big step for Google. He explains that Google has always tried to use its algorithms to derive context from the content on webpages. This usually works pretty well, but as we’ve noted before, there are some things that algorithms just can’t identify properly (at least, not yet).

Now, it may be some time before we start seeing any real benefits from Google’s implementation of RDFa. For starters the search engine is only using it in a limited fashion, and it isn’t clear how long it will take for Google to begin incorporating it in other ways. But the standard is already spreading without Google’s help – Yahoo supports RDFa, and many sites including the UK government are implementing it too. Of course, with its dominant market share Google’s stamp of approval is huge for RDFa’s acceptance, and we’ll probably begin to see more services follow suit (Drupal 7, for one, will include it by default).

That said, not everyone is happy with the way Google is using the standard. There are complaints that Google is using a hobbled implementation of RDFa, ignoring some of the established conventions that many webpages have already used to tag their data. Birbeck acknowledges that Google could have implemented RDFa better, but says that “the only reason they can even raise the question of whether Google used the right vocabulary is because they are using RDFa now.. And that is huge.”



Advertisement

Comments rss icon

  • I’m torn. Half love the idea of RDFA tagging, yet I can just imagine the fun the blackhack brigade will have with it. Hidden tagging that visitors can’t see – woohoo. Not cynical – just been online a long time. :)

    • Yeah, that’s my concern too. If I was a spammer I’d just tag my content with totally unrelated (but lucrative) attributes. Hopefully some combination of PageRank + RDFa would cut back on that.

      • My thought on this is that Google will never replace their existing algorithms for RDFa but will rather add RDFa support as a desert after the (algorithm) supper.

        So, most blackhat techniques will remain on radar.

      • Search engines can read dates. And h1 h2 titles. And probably Names too. And PageRank is powerful to organise related contents.

        Yahoo searchmonkey is far ahead in structured search, but yhoo still shows duplicate wiki redirects as 2 separate articles :(

      • Jason said…
        Hopefully some combination of PageRank + RDFa would cut back on that.

        I suspect that Google will make PageRank able to crunch higher dimension data (also known as tensor), ie, a tensorised PageRank that includes other RDF objects. Variants of of technique had appeared over recent years from IR research community.

  • I think for the moment Google is not much interested in semantics.This is a move to satisfy Google community.

  • Structured website is the future?

  • OK lets take more content from those unsuspecting webmasters, sell it to them that their click through rate will go up while all we Google is worried about is increasing their user engagement and revenue on the SERP pages of Google.

  • Another can of worms. Unrelated tags, it sounds like more work. More pandering to Google’s robots.

  • Google is focusing to increase its revenue so let’s see what will the changes due to RDFa.

  • sounds like they are trying to improve relevant keyword searches by getting other people to insert more relevant tags (RDFa) to help google bots sorts things out…..lol

  • Indexing content explicitly rather than parsing it with generic technology is the best.

    No technology Google could have come up with could have mapped EVERY page’s content perfectly.

    What you now need is an AJAX tool to let users hover over page elements on their site and produce these tags automatically.

    But I only thought of that because I’m a genius. Somebody will no doubt steal this idea and make at least 50,000 off of it.

    Fortunately for me Akamai is not a 50k business but a 50B business. I’m shoot higher than this. I don’t care about the nerdiness factor anymore.

    • I use Drupal sometimes. Every CMS from Drupal to Joomla is going to emit RDFa if they know it’s going to be indexed that way for “very” deep linked search results.

      Every blogging software package like Word press is going to use these tags. Eventually Microsoft will build it into blend, VS, and Adobe into DW.

      There is only a slight opportunity for an AJAX service that loads pages from a person’s website via CURL and embeds the code in the DOM with JQuery or something.

      Other than that, the only people that are going to see the windfall from this is Google themselves, when they show off their pretty new search results, with super deep anchor like content linking.

      If you were going to invest in this for a single site, it’s not even worth it, because your CMS, CRM, whatever is going to do it for you.

      It’s a shame but Google is so big that nobody else could possibly make a dime off of this. Akamai, Brightcove, ect… on the other hand are big fat sitting ducks and I am hedging my small cap investment bets against them.

      • repeating the comment because you dont get the drift first time:

        you are a loser, and also a spammer. dont forget to renew the domain name of your soeet.com for next year. LOL
        have put you on the list of TC shameless spammers – the locatorguy, you and the smelly indian chick

  • Just another option for black hatters. Only people involved with SEO will use it. Therefore it will be completely unreliable for Google to trust for indexing.

    • I bet it will be as hard to cheat as pagerank was when it initially was used.
      Google has a relatively large development team. Expect them to pull this off well.
      They will also no doubt release a suite of webmaster tools to automate the tagging of your page content.

      I wouldn’t underestimate Google.

      I am starting to find that many people will invest in software development, but not many will buy large, heavy, truckable hardware to build large systems with.

      A. You don’t get a full tax deduction when you buy such hardware. It’s not rentable at reasonable rates. Try to rent SANs from HP for example or worse from SUN with those new solid state drives.

      B. They don’t want to get involved with large physical goods. They want it to be easy to clear out if and when it goes insolvent.

      I think a safe bet, the bet I took with our investment is to zig to hardware, when others zag to make the most out of virtualization. You just can’t beat Google at their game. Too many devs, too much talent, too many resources, too much information in their hard drives.

      • In the USA, you can only deduct the depreciation value and only after one year. So it puts a lot of people that build on credit lines off.

        Since the whole of the cali web 2.0 is built on borrowed money, that’s a good bet.

      • yes, we all know your bets against the biggies – and you are on the right path since you have registered the domain at least. like you always say – well begun is half done. woo hoo LOL

        make sure that you update your address with the govt for the unemployment checks. LOL

        • I get employment offers all the time and have to turn them down because they don’t offer the right benefits. Only sh1tty startups like the ones on TC offer jobs with no benefits.

          You probably work at one.

    • You are right . semantic meta data is for SEO’s only . Not an average Joe would like to get involved in that.

  • i think that this has got lots of potential.Nobody knows what thing has got how much potential – so was the case with search industry.

  • Looking five years back I remember all grand expectations of semantic web and RSS. Semantic web is still not there and RSS are still tool of the nerds.

  • This is good information. Thanks techcrunch!

  • http://search.y...b-top&sao=1

    Check out the topic suggestions

  • It will be interesting to see how much effect this has on rankings and how long it will stick around. Just when you think you have a handle on something, it goes and changes!

  • Maybe with the RDFa impact in SEO practice we will see more pages with semantic metadata .

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbugbug
Techcrunch on Facebook