Live Web, Real Time . . . Call It What You Will, It's Gonna Take A While To Get It

This guest post is written by Mary Hodder, the founder Dabble. Prior to Dabble, Hodder consulted for a number of startups, did research at Technorati and wrote her masters thesis at Berkeley focusing on live web search looking at blog data.

Real time search is nothing new. It is a problem we’ve been working on for at least ten years, and we likely will still be trying to solve it ten years from now. It’s a really hard problem which we used to call “live web search,” which was coined by Allen Searls (Doc’s son) and refers to the web that is alive, with time as an element, in all factors including search.

The name change to “real time search” seems a way to refocus attention toward the issue of time as an important element of filters. We are still presented with the same set of problems we’ve had at least the past ten years. None of the companies that Erick Schonfeld pointed to the other day seem to be doing anything differently from the live web search / discovery companies that came before. The new ones all seem to be fumbling around at the beginning of the problem, and in fact seem to be doing “recent search,” not really real time search. While I’m sure they’ve worked really hard on their systems, they are no closer than the older live web search systems got with the problem. All the new ones give a reverse chron view, with most mixing Twitter with something: blog data, other microblog data, photos, creating some kind of top list of recent trends. Some have context, like a count of activity over a period of time, or how long a trend has gone on or a histogram (Crowdeye) which both Technorati and Sphere experimented with in the early years. Or they show how many links there are to something or the number of tweets. All seem susceptible to spam and other activities degrading to the user experience and none seem to really provide the context and quality filters that one would like to see if this were to really work. All seem to suffer from needing to learn the lessons we already learned in blog search and topic discovery.

Publicly available publishing systems starting in 1999 took the value of time and incorporated it into what was being published (think Pyra which is now Blogger, Moveable Type, WordPress and Flickr, among the many) as well as search and discovery systems for those published bits like Technorati, Sphere, Rojo, Blogpulse, Feedster, Pubsub and others, to walk down memory lane . . . (btw, for disclosure purposes I should state that I worked for Technorati in 2004 for 10 months, and consulted or advised most all the others in one form or another).

I started working on this problem in 1999, at UC Berkeley, and eventually did my master’s thesis on live web data search and topic discovery at SIMS (or the iSchool as it’s now known). From 2000 to 2004, people at SIMS would say to me, “What are you doing with blogs and data, it’s just weird. Why does it matter?” But the element of time was the captivating piece that was missing for me from regular search. It’s the element that makes something news, as well as the element that can group items together in a short period to show a focus of attention and activity that often legacy news outlets miss (until more recently when they decided that live web activity was interesting).

At Burning Man in 2005, under a shade structure during a hot, quiet afternoon, I remember having a four or five hour conversation with Barney Pell (who would later found Powerset) about the Live Web and Live Web Search, how to do it, what it meant, how to understand and present time to the user, how much was discovery and how much was search, how structured was the data you could get and how reliant on the time could you be with the data, what meaning you could make from that data, etc. Sergey Brin was sitting and listening, and finally, after a couple of hours, he asked me, “What is the live web and what is live web search?” Since Barney and I had already been doing a deep dive, I assumed Sergey knew what we were talking about, so it surprised me, but I explained why I thought time was a huge missing element of regular search, and that this was the type of search I worked on. Barney and I continued for a couple more hours. And it got cooler so it was time to go admire the art and that was the end of that. But I have wondered over the years where Google is with the live web and when they might do something with time. Twitter seems to be prodding them.

In 2006, “The Living Web” Newsweek cover story by Steven Levy and Brad Stone poked at this issue for the first time in a national forum.

When I look at the latest crop of search startups, I think: Why are we doing it all the same way again? Reinventing the wheel? Is anyone doing anything original either with data or interface? Is anyone building on what we’ve learned before about the backend or UI’s?

Frankly, our filters suck.. and I suppose that if a name change gets us to think anew about better filters, well, I should rejoice. I’m partly to blame for the bad filters we have to date because in having worked on this problem, I’ve contributed to some of the various live web or real time or whatever the word of the moment is to describe trying to solve this problem. We are very good at publishing our thoughts and visions, with time stamps, but not very good at the filtering side of things. The old method of information search and discovery was to open the paper or magazine, turn the pages with editorially filtered and placed information, and when you were finished, you said, “Okay, I’m informed” (whether you really were or not). But the media got complacent, missed stories and with the ease of blog publishing and sites like Flickr for photos, we could replace paper and supplement our information needs with the whole web. The only problem is, it’s the whole freaking web. An avalanche. We feel anxiety on the web from the lack of filter and editorial grace that one or two printed news sources used to give us.

I did a study in 2002, which I repeated in 2004 and again last year in 2008. I asked users to track their online information intake for one week. There were only 30 people in each study, chosen randomly from Craiglist ads, but what I found across each group of 30 was that the average time spent online with news and information sites was 1.25 hours in 2002, 1.85 hours in 2004 and 2.45 hours in 2008. These people are not in Silicon Valley, but they do all have broadband at home and live in the US. Every one of them reported some level anxiety over the amount of data they felt they needed to take in in order to feel informed. They often dealt with it by increasing the time they took to stay informed. They didn’t know that better filters might actually reduce their anxiety.

As Erick noted, the tension to solve this problem is between memory and consciousness; or as Bob Wyman and Salim Ismail called it at Pubsub: retrospective verses prospective search. And it is part of the issue. But there is more.

Discovery does mean you have to introduce time as an element. The user cannot be expected to know what is bubbling up, or the specific phrases that will name the latest thing.

Some people will say “michael jackson” and some will say “MJ” and some will say “king of pop.” And Michael Jackson as a topic is actually pretty easy. I remember once doing usability tests for a live web search and discovery system in 2003, where we asked users to search on Google News and various live web systems for an incident in Australia where a “giant sea creature” was found. But since all the media covering it originated in Australia, and they’d all called it a “massive squid,” and all the follow-on American sources including bloggers had copied the Aussie language, there were no recent hits for “massive sea creature.” Testers had to think creatively about how to get to the info they knew was there, and yet it was a semantic leap. One search tester actually cried as she refused to give up, she was so determined to find the result in any of the live web systems we were testing. We begged her to stop; it was painful. Good discovery could have helped.

Another key element of discovery and live web search is getting structured data, because spidering, which Google uses to get data from the web for it’s regular retrospective web search, makes understanding time with a published work more difficult. It’s hard to work with time if you only know for sure when you spidered the page. Twitter on the other hand has structured data because everything is published in their silo so the sites they provide their complete stream to get it in a structured format. They know the time of each tweet. Not to mention the data is available through API’s. This is the most efficient way to draw out meaning for search because you know for sure about the context of each piece of data, with time as one of the pivots, for search and discovery.

You also need to get the data model right for the backend search data base, in order to get meaning and link metrics. And you need to understand the different corpuses of data to know what things mean to users (not engineers), and figure out the spam and bad actor problems. There is the original context the data had and there is the UI which is so difficult when trying to make time understandable for many users. In fact some think that communicating the time element to regular users is so hard that making time focused search is really an “advanced search” problem.

If designed poorly, the system can contribute to the unnatural production of skewed data by users. If the system involves some sort of filter for authority or popularity, they are subject to power law effects (Technorati calls their metric “authority” but inbound link counts from blogs are not authority, they’re just a measure of popularity). What’s a power law effect? It’s when a system drives activity to reinforce unnaturally the behavior that caused something to be there in the first place. For example, if one of the metrics of a filter counts the number of people clicking on a top search, then the more clicks, the longer the item will stay at the top of the list of searches, even if naturally it would have fallen off the list earlier. Conversely if a metric for a filter involves a spontaneous act, driven by imagination, like writing a tweet, then exposing those items at the top of the filter might be less likely to drive up activity. However, if you show the results to the users, upon seeing a popular topic, they might begin tweeting about that topic without having thought of it before seeing the popular topic. In other words, by revealing the metrics you focus on, you can push users to change their behavior. By driving behavior, power-law distributions keep things with some power at the top because they are at the top or can drive them higher. It becomes a loop. And because no distinction is made between the quality or strength of a unit or what that unit might mean to a group of users in a topic area, straight number counts just aren’t very smart.

For example, if we made a system that counted Om Malik’s inbound links and called it authority, no matter the topic, I think Om would agree that even he wouldn’t have great authority and insight on the subjects of say, modern dance or metal working, if he happened to mention those words in a blog post. But on broadband issues, he is most definitely an authority. But Technorati, OneRiot, and other services that take a metric count and apply it for all topics, all circumstances, all search result matches, without context, randomize the quality of the information the user sees. They may provide a filter across the whole web, but they don’t give us any real help in judging what is useful or not. It’s why topic communities are helpful, and once you find a good editorial filter, driven by the human touch, you glom onto it for dear life because it’s such a time and energy saver.

I’m under no illusions that we’re remotely close to solving Live Web or Real Time search or even recent search. We are not. Nor are we near solving discovery. But I hope we will. Sooner rather than later. Because I need it now. The opportunity is huge. It means really building algorithmically the editorial filters we have today in the form of people, while balancing the mobs’ activities. Solve that and the prize will be big.

CrunchBase Information

Information provided by CrunchBase