Stephen E. Robertson
Forward to the past: notes towards a pre-history of web search
To me, an awareness of history is a fundamental requirement for progress; and I believe that we in the field of information retrieval are currently ill-served in this domain, or at least not as aware as we should be. While it is true that a researcher in IR is expected to acquire some knowledge of what has gone before, this knowledge is typically fairly narrow in scope. There are some IR researchers who regard IR as a branch of computer science, for example, despite the fact that the field has a long and venerable history entirely outside the domain of computers, as well as a considerable current presence in academic departments well away from CSDs. Outside of our immediate community, there is a widespread belief that web search engines arose out of nothing, a totally new invention for the totally new world of the web.
This is unfortunate. It is true that there were huge changes in the information retrieval world in the second half of the twentieth century. The ideas, methods and systems that were around in 2000 (particularly the early web search engines, up to and including Google) seem at first glance to be completely different from the ideas, methods and systems that were around in 1950. But these developments involved in large measure the usual process of evolution rather than revolution, and the usual mix of steps large or small, forwards or backwards or sideways. Some of these steps were taken in the world of IR research, and some in the domain of practical commercial systems.
One of the real achievements of the web search engine world was to marry the basic technologies of large-scale inverted files and free-text indexing (which had been developed commercially over the previous two or three decades) with the ideas of natural language queries and search output ranking. These last had up to that point had been largely confined to research systems. However, it is important to be clear that the search engines did not invent (nor even re-invent) search output ranking — they adapted and developed it, and added new evidence. I believe this fact is very clear to the IR research community, but not at all outside. But the histories of all three (free-text indexing, natural language queries, and ranking) are worth some further attention.
Another real advance was the discovery of web pages by crawling — although spiders were not invented exclusively for web search, that was their major raison d’être and the major stimulus to their development. This represents a necessary function that simply did not exist in most pre-web search environments. A third was the extraction of anchor text from incoming links, to improve the representation of the pointed-to page — this, by contrast, represents an opportunity that did not really exist in other environments. (A third major advance, deriving from the exploitation of searcher behaviour as reflected in search logs, lies outside the scope of my pre-history.)
This talk is not a history: that would require a lot more gathering and reconciling of evidence than I am able to do. But I would like to put on record my understanding of this period of IR history (which I have lived through, been part of, and very much enjoyed), and provide some anecdotal evidence that may help a future historian. I will start by briefly reviewing two books which cover web search but which (in my opinion) have a too-limited view of the history.
Stephen Robertson has been researching in the field of information retrieval since 1968, and has been publishing since 1969. He is currently retired from full-time work, but remains Professor Emeritus at City University of London, Visiting Professor at University College London, and a Life Fellow at Girton College Cambridge. His previous employment includes five years at University College, twenty at City University, and fifteen at Microsoft Research Cambridge. In 1981 he received a Fulbright Award and spent some months at the University of California Berkeley. He won the Gerard Salton Award in 2000 and the Strix Award in 1998; he is an ACM Fellow. His main areas of research have been the evaluation of IR systems, particularly evaluation metrics, and probabilistic models. The latter led him in the early 1990s to invent the BM25 ranking function, which remains a benchmark for effective ranking of search results.