August 8 (Tuesday) 9:30-10:30, 5F Eminence Hall
Chair: Tetsuya Sakai (Waseda University)
Forward to the past: notes towards a pre-history of web search
To me, an awareness of history is a fundamental requirement for progress; and I believe that we in the field of information retrieval are currently ill-served in this domain, or at least not as aware as we should be. While it is true that a researcher in IR is expected to acquire some knowledge of what has gone before, this knowledge is typically fairly narrow in scope. There are some IR researchers who regard IR as a branch of computer science, for example, despite the fact that the field has a long and venerable history entirely outside the domain of computers, as well as a considerable current presence in academic departments well away from CSDs. Outside of our immediate community, there is a widespread belief that web search engines arose out of nothing, a totally new invention for the totally new world of the web.
This is unfortunate. It is true that there were huge changes in the information retrieval world in the second half of the twentieth century. The ideas, methods and systems that were around in 2000 (particularly the early web search engines, up to and including Google) seem at first glance to be completely different from the ideas, methods and systems that were around in 1950. But these developments involved in large measure the usual process of evolution rather than revolution, and the usual mix of steps large or small, forwards or backwards or sideways. Some of these steps were taken in the world of IR research, and some in the domain of practical commercial systems.
One of the real achievements of the web search engine world was to marry the basic technologies of large-scale inverted files and free-text indexing (which had been developed commercially over the previous two or three decades) with the ideas of natural language queries and search output ranking. These last had up to that point had been largely confined to research systems. However, it is important to be clear that the search engines did not invent (nor even re-invent) search output ranking — they adapted and developed it, and added new evidence. I believe this fact is very clear to the IR research community, but not at all outside. But the histories of all three (free-text indexing, natural language queries, and ranking) are worth some further attention.
Another real advance was the discovery of web pages by crawling — although spiders were not invented exclusively for web search, that was their major raison d’être and the major stimulus to their development. This represents a necessary function that simply did not exist in most pre-web search environments. A third was the extraction of anchor text from incoming links, to improve the representation of the pointed-to page — this, by contrast, represents an opportunity that did not really exist in other environments. (A fourth major advance, deriving from the exploitation of searcher behaviour as reflected in search logs, lies outside the scope of my pre-history.)
This talk is not a history: that would require a lot more gathering and reconciling of evidence than I am able to do. But I would like to put on record my understanding of this period of IR history (which I have lived through, been part of, and very much enjoyed), and provide some anecdotal evidence that may help a future historian. I will start by briefly reviewing two books which cover web search but which (in my opinion) have a too-limited view of the history.
Stephen Robertson has been researching in the field of information retrieval since 1968, and has been publishing since 1969. He is currently retired from full-time work, but remains Professor Emeritus at City University of London, Visiting Professor at University College London, and a Life Fellow at Girton College Cambridge. His previous employment includes five years at University College, twenty at City University, and fifteen at Microsoft Research Cambridge. In 1981 he received a Fulbright Award and spent some months at the University of California Berkeley. He won the Gerard Salton Award in 2000 and the Strix Award in 1998; he is an ACM Fellow. His main areas of research have been the evaluation of IR systems, particularly evaluation metrics, and probabilistic models. The latter led him in the early 1990s to invent the BM25 ranking function, which remains a benchmark for effective ranking of search results.
August 9 (Wednesday) 9:00-10:00, 5F Eminence Hall
Chair: Hideo Joho (University of Tsukuba)
Mail Search: It’s Getting Personal!
Web Mail has significantly changed in the last decade. It keeps growing with 90% of its traffic being generated by automated scripts or “machines”. At the same time, major mail services offer more and more free storage, ranging from 15GB for Gmail and Outlook.com to 1TB for Yahoo mail. As a result, we keep accumulating messages in our inbox, rarely deleting (and sometimes not even opening) many. Our inbox has become a big store mixing important information, such as e-tickets or bills, together with newsletters or promotions, from which we forgot to unsubscribe. Search is therefore a critical mechanism in order to retrieve the specific messages we need. Unfortunately, search in mail is far from being as trusted (and used) as in the Web today. Everything is personal and often private, from the content of the mailbox, to the search strategies, users’ needs and queries, thus making traditional Web search techniques inapplicable “as is”. Failure is evident when we can’t find a message that we remember having read, and this increases our frustration. Most mail search services return sorted-by-time results in order for us to scan results chronologically and increase our perception of perfect recall. At the same time, the ranking mechanism drops less relevant results, in order to prevent them from being ranked first if recent. So in order to increase a (false) perception of recall, these systems actually hurt recall! Ranking results by mail-specific relevance would actually increase search success, yet it is not widely adopted, with the exceptions of Inbox by Gmail and Yahoo Mail that show a few relevant results on top of traditional ranked-by-time results. In addition, many of us still struggle with expressing our needs, typically issuing very short queries, like in the early days of the Web. In this talk, I will first highlight the key characteristics of mail search and how they differ from Web search, in terms of searchers’ needs and behavior. I will then present recent progress in mail ranking as well as in query assistance tools. Finally, I will discuss directions for future research, and the need to revisit mail search and invent search mechanisms specifically tailored to the personal data store that our inbox has become.
Yoelle Maarek is Vice-President of Research at Yahoo. She holds an engineering degree from the Ecole des Ponts et Chaussees, a DEA in Computer Science from Paris VI university and a PhD in Computer Science from the Technion. Upon graduating, she joined IBM Research first in the US then in Israel, eventually becoming an IBM Distinguished Engineer. She later joined Google, as its first engineer in Israel and opened the Haifa engineering center. Her team there developed several notable features including Suggest, the Google query completion service. She returned to research, upon joining Yahoo, where she today serves as site manager of the Haifa office and guides the research organization of Yahoo worldwide. She has been serving in various senior roles at most of the recent SIGIR, WWW and WSDM conferences. She is a member of the Technion Board of Governors, and was inducted as an ACM Fellow in 2013.