Last day to vote for your favorite Crunchies finalists. Voting ends at midnight pst. »
Live Web, Real Time . . . Call It What You Will, It’s Gonna Take A While To Get It
by Guest Author on June 30, 2009

This guest post is written by Mary Hodder, the founder Dabble. Prior to Dabble, Hodder consulted for a number of startups, did research at Technorati and wrote her masters thesis at Berkeley focusing on live web search looking at blog data.

Hands on clock

Real time search is nothing new. It is a problem we’ve been working on for at least ten years, and we likely will still be trying to solve it ten years from now. It’s a really hard problem which we used to call “live web search,” which was coined by Allen Searls (Doc’s son) and refers to the web that is alive, with time as an element, in all factors including search.

The name change to “real time search” seems a way to refocus attention toward the issue of time as an important element of filters. We are still presented with the same set of problems we’ve had at least the past ten years. None of the companies that Erick Schonfeld pointed to the other day seem to be doing anything differently from the live web search / discovery companies that came before. The new ones all seem to be fumbling around at the beginning of the problem, and in fact seem to be doing “recent search,” not really real time search. While I’m sure they’ve worked really hard on their systems, they are no closer than the older live web search systems got with the problem. All the new ones give a reverse chron view, with most mixing Twitter with something: blog data, other microblog data, photos, creating some kind of top list of recent trends. Some have context, like a count of activity over a period of time, or how long a trend has gone on or a histogram (Crowdeye) which both Technorati and Sphere experimented with in the early years. Or they show how many links there are to something or the number of tweets. All seem susceptible to spam and other activities degrading to the user experience and none seem to really provide the context and quality filters that one would like to see if this were to really work. All seem to suffer from needing to learn the lessons we already learned in blog search and topic discovery.

Publicly available publishing systems starting in 1999 took the value of time and incorporated it into what was being published (think Pyra which is now Blogger, Moveable Type, Wordpress and Flickr, among the many) as well as search and discovery systems for those published bits like Technorati, Sphere, Rojo, Blogpulse, Feedster, Pubsub and others, to walk down memory lane . . . (btw, for disclosure purposes I should state that I worked for Technorati in 2004 for 10 months, and consulted or advised most all the others in one form or another).

I started working on this problem in 1999, at UC Berkeley, and eventually did my master’s thesis on live web data search and topic discovery at SIMS (or the iSchool as it’s now known). From 2000 to 2004, people at SIMS would say to me, “What are you doing with blogs and data, it’s just weird. Why does it matter?” But the element of time was the captivating piece that was missing for me from regular search. It’s the element that makes something news, as well as the element that can group items together in a short period to show a focus of attention and activity that often legacy news outlets miss (until more recently when they decided that live web activity was interesting).

Barney said, you have my explicit permission to flickr me, so get your camera..

At Burning Man in 2005, under a shade structure during a hot, quiet afternoon, I remember having a four or five hour conversation with Barney Pell (who would later found Powerset) about the Live Web and Live Web Search, how to do it, what it meant, how to understand and present time to the user, how much was discovery and how much was search, how structured was the data you could get and how reliant on the time could you be with the data, what meaning you could make from that data, etc. Sergey Brin was sitting and listening, and finally, after a couple of hours, he asked me, “What is the live web and what is live web search?” Since Barney and I had already been doing a deep dive, I assumed Sergey knew what we were talking about, so it surprised me, but I explained why I thought time was a huge missing element of regular search, and that this was the type of search I worked on. Barney and I continued for a couple more hours. And it got cooler so it was time to go admire the art and that was the end of that. But I have wondered over the years where Google is with the live web and when they might do something with time. Twitter seems to be prodding them.

In 2006, “The Living Web” Newsweek cover story by Steven Levy and Brad Stone poked at this issue for the first time in a national forum.

When I look at the latest crop of search startups, I think: Why are we doing it all the same way again? Reinventing the wheel? Is anyone doing anything original either with data or interface? Is anyone building on what we’ve learned before about the backend or UI’s?

Frankly, our filters suck.. and I suppose that if a name change gets us to think anew about better filters, well, I should rejoice. I’m partly to blame for the bad filters we have to date because in having worked on this problem, I’ve contributed to some of the various live web or real time or whatever the word of the moment is to describe trying to solve this problem. We are very good at publishing our thoughts and visions, with time stamps, but not very good at the filtering side of things. The old method of information search and discovery was to open the paper or magazine, turn the pages with editorially filtered and placed information, and when you were finished, you said, “Okay, I’m informed” (whether you really were or not). But the media got complacent, missed stories and with the ease of blog publishing and sites like Flickr for photos, we could replace paper and supplement our information needs with the whole web. The only problem is, it’s the whole freaking web. An avalanche. We feel anxiety on the web from the lack of filter and editorial grace that one or two printed news sources used to give us.

I did a study in 2002, which I repeated in 2004 and again last year in 2008. I asked users to track their online information intake for one week. There were only 30 people in each study, chosen randomly from Craiglist ads, but what I found across each group of 30 was that the average time spent online with news and information sites was 1.25 hours in 2002, 1.85 hours in 2004 and 2.45 hours in 2008. These people are not in Silicon Valley, but they do all have broadband at home and live in the US. Every one of them reported some level anxiety over the amount of data they felt they needed to take in in order to feel informed. They often dealt with it by increasing the time they took to stay informed. They didn’t know that better filters might actually reduce their anxiety.

As Erick noted, the tension to solve this problem is between memory and consciousness; or as Bob Wyman and Salim Ismail called it at Pubsub: retrospective verses prospective search. And it is part of the issue. But there is more.

Discovery does mean you have to introduce time as an element. The user cannot be expected to know what is bubbling up, or the specific phrases that will name the latest thing.

Some people will say “michael jackson” and some will say “MJ” and some will say “king of pop.” And Michael Jackson as a topic is actually pretty easy. I remember once doing usability tests for a live web search and discovery system in 2003, where we asked users to search on Google News and various live web systems for an incident in Australia where a “giant sea creature” was found. But since all the media covering it originated in Australia, and they’d all called it a “massive squid,” and all the follow-on American sources including bloggers had copied the Aussie language, there were no recent hits for “massive sea creature.” Testers had to think creatively about how to get to the info they knew was there, and yet it was a semantic leap. One search tester actually cried as she refused to give up, she was so determined to find the result in any of the live web systems we were testing. We begged her to stop; it was painful. Good discovery could have helped.

Another key element of discovery and live web search is getting structured data, because spidering, which Google uses to get data from the web for it’s regular retrospective web search, makes understanding time with a published work more difficult. It’s hard to work with time if you only know for sure when you spidered the page. Twitter on the other hand has structured data because everything is published in their silo so the sites they provide their complete stream to get it in a structured format. They know the time of each tweet. Not to mention the data is available through API’s. This is the most efficient way to draw out meaning for search because you know for sure about the context of each piece of data, with time as one of the pivots, for search and discovery.

You also need to get the data model right for the backend search data base, in order to get meaning and link metrics. And you need to understand the different corpuses of data to know what things mean to users (not engineers), and figure out the spam and bad actor problems. There is the original context the data had and there is the UI which is so difficult when trying to make time understandable for many users. In fact some think that communicating the time element to regular users is so hard that making time focused search is really an “advanced search” problem.

If designed poorly, the system can contribute to the unnatural production of skewed data by users. If the system involves some sort of filter for authority or popularity, they are subject to power law effects (Technorati calls their metric “authority” but inbound link counts from blogs are not authority, they’re just a measure of popularity). What’s a power law effect? It’s when a system drives activity to reinforce unnaturally the behavior that caused something to be there in the first place. For example, if one of the metrics of a filter counts the number of people clicking on a top search, then the more clicks, the longer the item will stay at the top of the list of searches, even if naturally it would have fallen off the list earlier. Conversely if a metric for a filter involves a spontaneous act, driven by imagination, like writing a tweet, then exposing those items at the top of the filter might be less likely to drive up activity. However, if you show the results to the users, upon seeing a popular topic, they might begin tweeting about that topic without having thought of it before seeing the popular topic. In other words, by revealing the metrics you focus on, you can push users to change their behavior. By driving behavior, power-law distributions keep things with some power at the top because they are at the top or can drive them higher. It becomes a loop. And because no distinction is made between the quality or strength of a unit or what that unit might mean to a group of users in a topic area, straight number counts just aren’t very smart.

For example, if we made a system that counted Om Malik’s inbound links and called it authority, no matter the topic, I think Om would agree that even he wouldn’t have great authority and insight on the subjects of say, modern dance or metal working, if he happened to mention those words in a blog post. But on broadband issues, he is most definitely an authority. But Technorati, OneRiot, and other services that take a metric count and apply it for all topics, all circumstances, all search result matches, without context, randomize the quality of the information the user sees. They may provide a filter across the whole web, but they don’t give us any real help in judging what is useful or not. It’s why topic communities are helpful, and once you find a good editorial filter, driven by the human touch, you glom onto it for dear life because it’s such a time and energy saver.

I’m under no illusions that we’re remotely close to solving Live Web or Real Time search or even recent search. We are not. Nor are we near solving discovery. But I hope we will. Sooner rather than later. Because I need it now. The opportunity is huge. It means really building algorithmically the editorial filters we have today in the form of people, while balancing the mobs’ activities. Solve that and the prize will be big.

Advertisement

Comments rss icon

  • Excellent article; thank you.

    I remember hearing about the call to action at Google where the bosses would say that the data is never fresh enough. A few hours old? It should be minutes old. A few minutes old? Get it down to seconds.

    Having a search engine that captures the real time trends without being gamed for publicity will still be a challenge in 3 or 5 years just as it was in 1999. Anytime there is traffic (profit) to be had, more people will be working on how to game the system than there are working on how to sanitize and filter it.

    • That’s a great example of the problem with the guys trying to game the system. Just like blogging took ages (3-5 years) for it’s potential to be realized, along with search engines vs directories, real time search will be in there as well.

      • I personally think that a consolidated search approach like http://www.yauba.com that has real time results but within the context of other results makes the most sense.

        • I think Yauba and Oneriot are the most innovative of the new real time searches, as they are adding a layer of intelligence and semantic analysis on top of the results.

          As for the others, my guess is that the best they can hope for is to be bought out by a larger player.

    • very interesting article

      i personally think real time/live web is basically real time/live news. the latter focusing on authoritative, important, recent activity. twitter provides a more blog like (and faster) view of the fresh world. the synthesis of these two worlds is what i find most interesting and is highlighted in tweetnews

      http://tweetnews.me

      but it’s still scratching the surface. more needs to be done in this space. we need a pagerank algorithm to make the results more relevant. we need personalization that focuses on results based on your network or previous activity (tweetnews allows re-ranking based on location). can’t wait to see how this industry develops.

  • Cool way to explain about things! Even though Real-Time Search has the potential to be the next-big-hit, I seriously doubt it is going to be anytime soon..

  • Great piece Mary. It’s good to read something with that much passion, and to see arguments laid out with such precision. Interested readers might consider this blog post too, which delves a little deeper into our (OneRiot’s) realtime indexing and ranking approach: http://blog.one...e-search-engine
    Good stuff. Tobias @ OneRiot

  • What is real/live search? search based on articles/news/blogs/tweets written at current time or results changing real time based on dynamic analytics collected/processed/analyzed and presented similar to google results but continuously changing?

  • I completely understand why Sergey Brin, after two hours, would still ask the question, “so what is live web search?”

    • The search is live. What is not live is the Google link-structure adjacency matrix computation. This cannot be live as it is physically impossible to do it live. This computation takes time (it is not instant) and by the time it is completed, there is already millions of new documents appeared on the internet perhaps in every milliseconds, so the matrix computation starts again, because the whole dataset has to be used again, unless Google is using an online version of PageRank that only updates the recently arrived documents and not re-do the computation from the beginning again.

      Note that why Twitter seems real-time is because they don’t do this types of heavy computations that Google is doing (I am confident that they don’t).

  • Great article, thank you for your insight. A big help for some things I’m working on. :)

  • Doesn’t this interesting piece sums up in
    “we should find a way to clean/filter out the daily information overload” ?

    I wish some people would do that for me and deliver their daily selection every morning, so I could read it while I’m having my breakfast… A man can dream ;)

  • Wow, I had to stop reading this when the author dropped Sergey’s name and wrote all facetiously / arrogantly. Get your head out of your ass.

  • anyone else confuses by the $1495 tickets?

  • It seems to me that the tech sector could probably learn a lot from the financial sector on this one. Real time price feeds and low latency data transfer has been an obsession for exchanges and traders for a while now.

    As I understand it, “real time search”, wouldn’t actually be a search, it would just be a filter that you send out, you dip it into the internet stream, and when you take it out again you see what you’ve caught. Alternatively, you keep the filter in the stream, and watch things get caught in it in real time, as the stream flows. Not sure how useful that would be though.

    In order to set something like that up, just like the financial exchanges, you’d need to have the right network architecture and everyone would need to play along.

  • Let me summarize. “Real Time Search” is hard, very important, nobody does it right, everything you’ve done sucks, and if somebody does it right, it’ll be HUGE.

    OK, but after reading through pages of text all over the map, I have no clue how “Real Time Search” is different than regular search ordered by time. And what does the unstructured data/semantic web issue have to do with it?

    I’d suggest in the future edit your articles down by a factor of 2. Perhaps they would be clearer.

  • Twitter itself should really create the real time search engine they are talking about. That is how this company could be huge. There’s nothing special with a real time Facebook search or anything else for that matter. We are talking about Twitter. This company could be worth billions if it wanted to be. Search with advertising is one key way to get there. They need to focus on it more and make it more prominent on their site and with api’s.

  • Do not understand what ?????

  • Twitter maybe it’s a good solution

  • It simultaneously amazes and appalls me how little is understood about this topic given how much money is thrown at it.

    The reason “real-time” search seems useless to people commenting here is that the current incarnations really are useless. You cannot build an interesting real-time search platform because there are a number of open computer science problems that have to be solved before that is possible as a practical matter. Period. Full stop.

    Until we can do things like dynamic constraint indexing and fast graph queries at scale, real-time search will largely be a toy. To the extent it is successful in its current form, it will be eclipsed when someone *does* solve the fundamental theoretical problems.

    In short, it is real-time search is an incredibly vast opportunity, all the people hating the current implementations notwithstanding. That said, no one is going to produce an interesting implementation without some radically new contributions to computer science that none of current ventures are claiming to have solved. The idea has substance, but virtually all of the companies in the space are hype.

  • I dont get it, why is everybody is so hyped about real time search. Does it really matter if you have the news in a few seconds or in a few minutes or even hours? Also I think the search results must be very bad when they havent been “rated” or anything through time.

    This is a typical hype, like semantic websearch last year.

    • Mark, I get my real-time news from radio. I have a portable head-phone radio that I listen to, while working on my laptop or my desktop machine. I don’t waste my time searching for live news on the internet. I do search for old archived news on internet but not live news.

  • Any company that can break down the “page rank” and “inbound link” metric to judge popularity/authority and make a search engine that makes discovery of the deep web possible will make billions

  • I think some of the tools and ideas piloted at Digg Labs is opening new frameworks for display of real time news. Sure it’s based exclusively on Digg’s data structure and data, but it’s different.

    (if you haven’t seen this go to digg.com and look down in the footer).

    • I agree with you. Essentially, a combination of dynamic analytics with filters (manual, automated) is real-time search. Now making it computationally fast enough is an algorithmic problem, hopefully, will be solved soon.

  • this excellent post is a first for tc–well written, typo-free, and written by someone who actually knows what she’s talking about.

  • To be correct, the Power Law describes the nature of the data, not the process which generates the data. Is is the result, not the cause.

    The Power Law describes a distribution where big events are infrequent and small events are quite common.

    The phenomena Mary alludes to is often referred to as the Success Breeds Success Principle or a Polya urn NBD.

    This simply states that success is rewarded by increased chance of further success, i.e., you have 50 black balls and 50 white balls in an urn , choose a ball and replace the one ball selected with two balls of the that same color, thus increasing your odds of picking a ball of that color again.

    This process forms a Power Law distribution, as do so many other things: income, words in texts, earthquake magnitudes, etc.

  • The skewing of the power-law distribution you mention is not because of the real-time nature of the data, but because the data is massively crowdsourced.

    In essence, if google can now provide quality results despite the fact that millions of people constantly try to game the system, i do not see why it’s impossible to built something similar for the “real time” web

  • very good article – thanks!!

  • Mary said…
    The user cannot be expected to know what is bubbling up, or the specific phrases that will name the latest thing.

    Burst detection can be used to detect (as a time-series) of what’s bubbling up (ie, topic/s) and then fades away, while another one bubbling up and then fades away (perhaps a longer decay time before it completely disappears) and the cycle goes on. There are some papers on this, but I found John Kleinberg’s publication very interesting.

    Abstract:
    ——–
    In large scale online systems like Search, eCommerce, or social network applications, user queries represent an important dimension of activities that can be used to study the impact on the system, and even the business. In this paper, we describe how to detect, characterize and classify bursts in user queries in a large scale eCommerce system. We build upon the approaches discussed in KDD 2002 “Bursty and Hierarchical Structure in Streams” [3] and apply them to a high volume industrial context. We describe how to identify bursts on a near real-time basis, classify them, and apply them to build interesting merchandizing applications.

    To use time-stamp as a dimension, then tensor calculus is applicable. There are already papers being published where time-stamp is added as extra-dimension in LSI search such as [time x word x document] a 3D matrix.

  • Interesting thoughts..

    I think we need to go the basics..

    When search started late 19th , many companies where tuning their crawlers and trying to extract/provide better snippets alongside the results, etc ..Google came in and did what was considered crazy and stupid.. they simply stored complete version of any page they index , even though it was very expensive and .. expensive.. but it did simply provide better user experience and they won .. every one followed them afterward.

    Some one has to do another crazy/expensive move now (Was expecting Microsoft) by running unlimited simultaneous crawlers to index the 50 million public sites out there. (I know its not easy but massive computing power and bandwidth will just do it).

    Or..

    The other and long approach (while talking about 10 years), some one has to develop and market/push tools that can be installed on “Web Servers” to Ping-Back the crawlers and update them when there is new data.

    I know non of the solutions above is easy , and I’m quite sure many has though or even tried them , but if some has the he well , money , and craziness , it may work and I’m positive its worth it.

    Cheers

  • The question I extract from all of this is exactly what problem would real time search be solving? I’m having a hard time wrapping my head around the idea that this real time “live” search will help to fill some kind of enormous void that we currently have in computing. It seems to me that everyone is seeking a solution to a problem that hasn’t quite been defined yet?

  • Not sure if you have seen Google Wave – this to me looks like the “live web” but it will take some time to adopt it to many blogs, like WordPress. I’m sure the plugin community will amaze us.

  • I’d suggest in the future edit your articles down by a factor of 2. Perhaps they would be clearer.

  • In large scale online systems like Search, eCommerce, or social network applications, user queries represent an important dimension of activities that can be used to study the impact on the system, and even the business.

  • really good article, I like this theme.

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL
Short URL
bugbugbug
Techcrunch on Facebook