Do not panic. We accept late submissions for TechCrunch50, but please submit soon. »
Is Keyword Search About To Hit Its Breaking Point?
by Erick Schonfeld on April 25, 2008

keyword-search-slide.png

As the Web swells with more and more data, the predominant way of sifting through all of that data—keyword search—will one day break down in its ability to deliver the exact information we want at our fingertips. In fact, some argue that keyword search is already delivering diminishing returns—as the slide above by Nova Spivack implies. Spivack is the CEO and founder of semantic Web startup Radar Networks and is pushing his view that semantic search will help solve these problems. But anyone frustrated by the sense that it takes longer to find something on Google today than it did even a year ago knows there is some truth to his argument.

internet-user-chart-tiny.png“Keyword search is okay,” he says, “but if the information explosion continues we need something better.” Today, there are about 1.3 billion people on the Web, and more than 100 million active Websites. As more people pile on, the amount of information on the Web keeps growing exponentially to accommodate all those seekers, and they themselves feel compelled to put their own personal and social information onto the Web as well.

At a certain point, with billions and billions of Web pages to sift through, keyword search just won’t cut it anymore. It’s a needle-in-the-haystack problem, with the haystacks just getting bigger and bigger every second.

Spivack explains:

Keyword search engines return haystacks, but what we really are looking for are the needles . The problem with keyword search such as Google’s approach is that only highly cited pages make it into the top results. You get a huge pile of results, but the page you want—the “needle” you are looking for—may not be highly cited by other pages and so it does not appear on the first page. This is because keyword search engines don’t understand your question, they just find pages that match the words in your question.

So how do we get beyond keyword search and Google’s PageRank? There are many approaches being tried: social search, tagging, guided search, natural-language search, statistical methods, open search, semantic search, and (way out there) artificial intelligence. They all have their problems. Tags are too messy and inconsistent. Natural-language requires too much computing power, is difficult to scale, and doesn’t deal with structured data well. Semantic search is perhaps the most promising, but it essentially requires every single Webpage to be re-written.

Spivack covered these issues during a presentation earlier this month at the Next Web conference in Amsterdam. It was one of the clearest explanations of the semantic Web I’ve heard so far (I’ve embedded his full slide show below). The semantic Web is nothing more than a set of standards that, if broadly adopted, would help computers extract meaning from the flood of data on the Web. But instead of a brute software approach, it puts intelligence into the data. “All you need to use that data is carried by the data itself,” says Spivack. Dumb software, smart data. That is an approach that scales no matter how many billions of Web pages are created.

The point, says Spivack, is:

To do for data what the Web did for documents.

You are turning the Web into a database, and your data becomes a part of it. Your data becomes part of the worldwide database. The semantic Web will let you move from data record to data record, just like you go from Web page to Web page.

There are many obstacles to the adoption of the semantic Web, but its goals are something worth striving for. What is certain is that search needs to evolve, and Google and Yahoo and Microsoft with it. Of course, they can adopt whichever approach or combination proves most effective.

The question is: Will they, or are they too wedded to keyword search to move beyond it?

Responses

Comments rss icon

  • Gmail.com is down since long time now. Even Google talk doesnt seem to be working all over USA.

  • Of course, that’s the decades old classic AI problem. Forget about the web, you can’t even extract meaning out of a random unstructured text document. It’s obvious though that both Microsoft and Google have been thinking about this every single day — just like any student of computer science. In other words, there’d always be skepticism about the term ’semantic web’ unless there’s some sort of a breakthrough. Currently, we don’t have anything better than a database (manually classify and index) approach, and one doubts anything more intelligent would come along soon.

  • I have also noticed I am finding difficulty getting good search results for what should be dead simple searches. For example, I was looking for a quick howto article on running a script on XP at boot time–I had to change keywords in 12 different ways and wade through a morass of invalid results, bad articles, and third party “suggestions” before I found the straightforward way it should be done.

  • Juggle.com may be working on this solution right now.

  • Its good to know that web 2.0 will end in 2010. And why does tagging appear under web 3.0 like its some sort of awesome future technology that we should be looking forward to.

  • I think the semantic web makes sense, but the whole idea of natural language search seems unnecessary. Keyword search generally works pretty well, and search engines don’t often return haystacks instead of needles unless you’ve failed to use related keywords and other parameters such as limiting to a specific website or timeframe.

  • That graph is ridiculous. I’m going to go out on a limb and guess that presentation to the user will likely take the form of words for a while yet. And somewhere, some algorithm will decide what ads make the most sense to match with what is being presented on screen. If everyone feels better calling it a new name, go for it.

  • I am an idiot an my last comment made no sense.

    However, I do think it’s going to be a lot longer than that graph that I’ll be searching by typing word queries. I’m sure someone will blow my mind with an interface that suddenly rings in the promise of sentient computing, but for now I’ll keep on with words.

  • I never have, and probably never will, understand why natural language search is “cutting edge”. It seems like a lot of work for nothing, keywords are just as good by my count.

  • What a great chart….

    - my curve goes up, see

    - the competitor’s curve does down, see

    ipso facto, google sucks and I don’t

  • The first graph is crap.

    I’m using wikipedia about as often as I use google…. 3rd most used is delicious.

    Plus through RSS & “web 2.0″ I’m getting a lot of stuff recommended/pushed to me, which cuts down my searching time.

  • This underscores the role that Q&A sites will play as time goes on.

    We’ve got natural languages searches working right now, and by relying on actual humans we can solve those tough AI problems that may or may not ever be solvable.

    I believe the long term challenge with search is the very assumption that search itself is the primary web tool of the future. As social networks widen and become more flexible and usable, I think a significant amount of data acquisition will move to these networks, and the role of search will diminish.

    Ben Finkel
    http://www.fluther.com

  • Also, I still think search will utilise collective human-powered efforts. (No, Mahalo won’t scale, and no Facebook won’t have anything useful to add)

    I am yet to be convinced that the “semantic” web won’t be overtaken by spammers. Much like meta-tags were spammed early on in web 1.0

  • I like the Google Docs PPT widget. tits.

  • That first graph is horrible. It takes a whole lot of made up assumptions and predictions, and packages it to look as authoritative as possible, using a form normally used to present actual real world data.

    “See? This stuff I pulled out of my ass is real!”

  • Loudacris,

    Tags are nothing new, that is for sure. But what if you could tag an object, or entity, with another object. So instead of tagging objects with strings, which falls back on a simple full-text search, you could tag something with an actual representation?

  • Keyword search isn’t going away anytime soon. Keyword searching is ingrained into the internet habits of those 1.3 bn internet users. To much cognitive dissonance involved with changing to another way of interacting with the web for the majority of them.

  • The top slide is conceptual, boys. But it does feel right to me. I’d love to see some real data to back it up. If anyone has that, please add to comments or send me a note.

  • i use del.icio.us more often these days. zoominfo, wikipedia, krugle, technorati etc etc help me with my search. just google is not enough anymore. some times yahoo works better than Google. only 20% of keyword search returns what i am looking for. even those results are just more popular, they may not be the best. some times i never find what i want even though i know that it exists somewhere on the web. custom search engines like google co-op must become more widespread

  • Erick, I sent you an email tomorrow with actual data from 2030.

  • This underscores the role that Q&A sites will play as time goes on.

    We’ve got natural languages searches working right now, and by relying on actual humans we can solve those tough AI problems that may or may not ever be solvable.

    I believe the long term challenge with search is the very assumption that search itself is the primary web tool of the future. As social networks widen and become more flexible and usable, I think a significant amount of data acquisition will move to these networks, and the role of search will diminish.

  • I believe that Keywords search is not mutually exclusive with semantic search, but rather is part of it. People will still likely use keywords when searching for content. What will change though, is the ability to collect and organize relevant data based on your natural interests, your social groups and your location. Where semantics helps is when the service being used understands the contextual interactions users have on the web and proposes the proper actions based on the preferences and context. Many services such as Twine, Faves or Yokway are going in the right direction in making this new web a reality.

  • Keyword search successfully fill up all my reqest. What to bee tomorow, it´s qestion.

  • proper communication between human & computer is a key factor in achieving “semantic search”.
    I see computer as a blind-deaf person with very less IQ. Its pretty hard to communicate in such situation.
    I think “semantic” search can be a reality only if we could find better ways & ideas to communicate and express our thoughts to today’s computer.

  • Hey Radar, Toy’s R Us just called, and they want the backwards ‘R’ on your logo back. Hahahahaha. Ahh, I just slay myself… :-)

  • AI in 2020 is like Kubrick expecting HAL by 2001.

  • one thing that is absolutely not going to work is relying on the content owners to put the semantic smart data into their pages. that would be no better, because it relies on people a) knowing what they’re doing, and more importantly b) not lying about it to boost their search rankings. effectively it’d just be pushing the keywords down into the metadata instead of the page content itself, unless i’m missing something.

    i’m totally with you though that keyword search sucks, especially for tech stuff. i search for some obscure thing and have to wade through 500 shitty blogs that don’t have anything to do with my problem but mention the same words.

  • I’ve been participating in the Twine beta, and I don’t yet see the value. Presumably they are depending on scaling up long before 2030.

    But I note the glaring absence of adversarial information retrieval in their discussion. Until and unless someone can reinvent web information retrieval to not be a no-holds-barred competition among publishers (particularly spammers) for the attention of users, mitigated only by the secret relevancy computations of search engines, we’re not going to advance far beyond the present state.

    More discussion at http://thenoisychannel.blogspo.....nghal.html.

  • Keyword search works effectively for a large fraction of searches and in these cases it is probably the most efficient way for the user to get what they want. As Charles Parker pointed out, any technology that slows down these easy searches by requiring extra user effort or new user behavior is doomed to failure.

    The approach we are taking at Surf Canyon is to augment the keyword search with an intelligent agent that observes the user’s interaction with the search results page to figure out the true meaning, in context, of the user’s query. By simply observing how the user interacts with the search results you can often figure out which results are relevant to that user at that time and which are noise.

    The ultimate search engine of the future will adapt to the user’s real-time needs (is the user looking for a specific web site? is this an open ended research query? Is this a keyword search or a natural language query?) and respond appropriately.

  • gilltots - I agree, I’m a bit sceptical about the semantic web in general for the reasons you gave. I think better search is going to be about personalization and the interface rather than anything fundamentally changing how keyword search is done.

    On your comment about finding tech stuff - you should try out WebMynd. I am biased since I helped create it, but I use it myself for collecting good tech resources while doing development.

    WebMynd doesn’t help you find stuff in the first place, but it saves and records all the pages you go to as you find them, so when you find something useful, it stays found. When I type ‘MochiKit’ or ‘python’ into Google I get all the resources that are actually useful in the WebMynd results rather than the Google results which are pretty general.

  • Wait, wait… This Semantic Web, which is projected after the Social Web… Wasn’t that the talk of the town right BEFORE all this web 2.0 crap? Yes sir, it was. So either the Semantic Web is at 1.5 or something, or “they” decided to put it on hold and concentrate on 2.0 first.

    *sigh*

    This web stuff, it’s confusing.

  • This is the best article on Techcruch so far. Simply awesome! Well written. Michael Arrogant if you are reading this than learn how to write great and useful article.

  • Just now I was searching for info on Flash classes - but the programming version of “classes.” All the Google search results were for sit-down courses on Flash. Not what I was looking for at all.

  • nice article Erick. keyword search will be supplanted, or heavily supplemented, how quickly is a tough call… and would we even notice it happening?

  • You would be surprised how good Google is at making sense of keywords, especially when you consider all the new tech terms, slang and jargon emerging every day. Just a few days ago nobody knew what a truthbox was, but already Google knows “truthbox” is a Myspace app. It amazes me Google makes these deductions without human intervention.

    Also, it’s a fact search queries are much longer today and I often see queries now that are complete sentences, often in the form of a question.

  • @webomatica, try “flash class files”. Wouldn’t “actionscript class files” be more correct though?

  • “There are many approaches being tried: social search, tagging, guided search, natural-language search, statistical methods, open search, semantic search, and (way out there) artificial intelligence. They all have their problems.”
    Except he doesn’t mention what the problem is with guided search or statistical methods. Maybe because there aren’t problems, it just sounded good to say everything has problems.

    This whole premise of semantic web improving search is BS. Saying that the semantic web is going to save search is like saying that meta tags help. The boost they gave to spammers helped open the door for Google and PageRank. The initial point of PageRank is that you can’t trust the data, so the semantic web approach only improves what you can’t trust. The semantic web crowd are just a bunch of XML jockeys / failed librarians who love telling other people how to organize their own stuff. Things like del.icio.us tags are vastly more effective, when there is a community of sufficient size to outweigh the spam.

    What exactly are people having trouble finding on the web? Guided navigation, even at the level that Google already provides, certainly helps you get the right kind of thing back (video, image, book, journal article, etc.) even if you don’t specify it up front. It’s really the attempts to categorize all the worlds’ information that end up failing- very few people try to browse the web via interfaces now like the old yahoo hierarchy. It can be done at smaller scales, but not for everything.

    I always think people like commenter #3 sound whiney. I just did a google search for-> xp script startup <- (which was the first thing I tried) and got great results. Maybe you need some search training?

  • All of those things but applied in vertical, niche search engines. The specialized search engines can do a lot better job than the general search engines. For example, http://krugle.org/ for code and http://markmail.org/ for email.

  • Keyword search is the NUMBER ONE way to search and always will be.

    There is no needle in a haystack problem, there will always be a #1 site, and a #2 site etc on Google.

    The problem webmasters face is increasing competition to rank #1 and thats a GREAT thing for everyone.

    A semantic web isn’t the answer - proper SEO (that so many sites ignore) is.

    Did you know that you could cut the size of the blogoshpere IN HALF right now if you got webmasters on board? Seriously, take a look at your average blog. Look at it’s source code. notice that the head section is on average 250-300 lines long? Then look at an SEO’s blog like mine, the head section is under 20 lines long. Actually my blog contains more data, uses more tools and is STILL shorter than just the head section of most blogs. Thats part of SEO. Thats helping google not drown in lines of code. Thats the answer.

    Bloated sites belong on page two or beyond… well coded sites that help google deserve a shot at #1. IMO of course.

  • I’m in the camp of the two not being mutually exclusive. Users will continue to seek relevant content using keywords. What will improve is the search engines ability to disambiguate their queries and present more intelligent organization and segmentation of results using semantic-web type processing. Its highly unlikely, however, that content producers will be able to generate proper semantic tags on their own. A big growth area on the web will be technologies that can process unstructured content and create intelligent meta data which can be purposed in the future for new semantic signatures that arise. These won’t appear top down for a long time, as the coordination costs are too high. Early signs of how this might happen are Yahoo’s standards for meta data ingestion within Yahoo search. There will be many flavors of the semantic web for some time to come (and a host of Semantic Web agencies much in the way there are SEO agencies today).

  • When we talk semantics, don’t we also talki about keywords? After all, words are semantics, so Keywords ad will not disappear so quickly, I guess

  • I have to agree with the comment made by gilltots, too much will rely on the human aspect of providing “valuable data”.

    As it is, using keywords is mainly now and will continue to be the means to “search” the internet, it is not totally reliable. Being in the IT support field, I use Google constantly.

    1 out of every 5 searches results in me having to play with the keywords being used. Not every search do I get what I need, or sometimes I get data that is years old.

    Trial and error… we’ll see!

  • @Tony, yep, your suggestions worked for what I was looking for. But I think you get my point: I want Google to be able to read my mind :|

  • A good concept we could extract from some comments here:

    What do you think of a mashup between google, wikipedia and delicious results with a proper algorithm?

    We can start with a simple weighting algorithm and statistically refine it.

    Too many projects in my hands and no time to play with it but let me know if someone gave it a try.

  • Djilali,

    I can tweak my results at Isayhello.com to do that pretty easily. But the problem is that if you use UGC to filter your results then they are too easy to manipulate. I can give something 500 delicious book marks easily enough. I don’t publish how I filter my results for this very reason. I looked at using Digg, and I do use them if a search is detected as “Trendy” meaning based on recent events. But that can be fooled by a game show like Are you Smarter than a 5th grader which often creates “news” like trends for dead people, like Which president was never married….

    Tweaking the results is not as simple as doing a mash-up. It is about refining the search before you process it, and returning results that are based on the type of search. I have started to address the results issue, and am using suggested searches to help users find the right questions for what they are seeking.

    As I grow I hope to handle both, but I started with one side of the equation, looking to add the other later.

  • Great post — the problem is real — the web’s rapidly filling up with speech, and speech as data is infinitely more tricky to analyze than the printed page/post (which is where web pages got started).

    Just getting context out of tweets, feed and status messages, comments etc is a massive headache unto itself, let alone relevance, rank, influence, etc etc. But that is where it’s headed.

  • The semantic web is a great idea but turning it into a widely used practice would take years of retraining the population. Look at CSS: an obviously superior technology but ten years after someone figured out how to hack a design using html code for spreadsheets, half of the internet is still built on tables.

    Anyone wanting to build a better search engine can’t begin by whining that the input data (the entire WWW) is built wrong. It’s the search engine’s job to embrace the chaos and find ways of making sense of the data. The folksonomic approach tries to build out on context to understand relationships and meanings that aren’t necessarily apparent. The most interesting websites now are combining multiple services and different types of data to achieve what I’ve been calling “folksonomic density.” On a daily basis I’m interacting with Techcrunch via RSS/Netvibes, Arrington’s Twitter feed, references I’m seeing on other tech blogs and even sometimes by visiting the site itself. Searches that can figure out these relationships in a sort of fuzzy logic will give better data. Some of it will be semantic, either by design or convention (e.g., Google would be smart to index Twitter’s emerging hash-tags). (Shameless self promotion alert: I wrote about folksonomic search and density in my recently published O’Reilly shortcut Web 2.0 Mashups and Niche Aggregators.)

    The talk in the comments about SEO bring up an interesting point: wasn’t metadata the first attempt at a semantic web? Professional scammers figured out how to abuse it and now search engines ignore the data (except for description). It’s much easier to game a system built on pure logic than it is to fake folksonomic context. The semantic web has a lot to offer but it’s never going to give the best unbiased data.

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
bugbugbug
The CrunchBoard
  • MediaTemple Logo
  • QuickSprout Logo
  • OpenX Logo
  • Cotendo Logo