• Keynotes

Generative Information Retrieval

Marc Najork

Distinguished Research Scientist, Google DeepMind

Marc Najork

Sr. Director, Google Research

Abstract:

Historically, information retrieval systems have all followed the same paradigm: information seekers frame their needs in the form of a short query, the system selects a small set of relevant results from a corpus of available documents, rank-orders the results by decreasing relevance, possibly excerpts a responsive passage for each result, and returns a list of references and excerpts to the user. Retrieval systems typically did not attempt fusing information from multiple documents into an answer and displaying that answer directly. This was largely due to available technology: at the core of each retrieval system is an index that maps lexical tokens or semantic embeddings to document identifiers. Indices are designed for retrieving responsive documents; they do not support integrating these documents into a holistic answer. More recently, the coming-of-age of deep neural networks has dramatically improved the capabilities of large language models (LLMs). Trained on a large corpus of documents, these models not only memorize the vocabulary, morphology and syntax of human languages, but have shown to be able to memorize facts and relations [2]. Generative language models, when provided with a prompt, will extend the prompt with likely completions – an ability that can be used to extract answers to questions from the model. Two years ago, Metzler et al. argued that this ability of LLMs will allow us to rethink the search paradigm: to answer information needs directly rather that directing users to responsive primary sources [1]. Their vision was not without controversy; the following year Shaw and Bender argued that such a system is neither feasible nor desirable [3]. Nonetheless, the past year has seen the emergence of such systems, with offerings from established search engines and multiple new entrants to the industry. The keynote will summarize the short history of these generative information retrieval systems, and focus on the many open challenges in this emerging field: ensuring that answers are grounded, attributing answer passages to a primary source, providing nuanced answers to non-factoid-seeking questions, avoiding bias, and going beyond simple regurgitation of memorized facts. It will also touch on the changing nature of the content ecosystem. LLMs are starting to be used to generate web content. Should search engines treat such derived content equal to human-authored content? Is it possible to distinguish generated from original content? How should we view hybrid authorship where humans contribute ideas and LLMs shape these ideas into prose? And how will this parallel technical evolution of search engines and content ecosystems affect their respective business models?

———–

Marc Najork is a Distinguished Research Scientist in Google DeepMind, working on new techniques to make it easier for people to obtain relevant and useful information when and where they need it. Marc is interested in using generative language models to answer questions directly, rather than referring users to relevant sources.  Direct answers represent a major paradigm shift in Information Retrieval, affecting the user experience, the fundamental architecture of the retrieval system, and the economic foundation of commercial web search and the entire web content ecosystem.  Prior to joining Google, Marc was a Principal Researcher at Microsoft Research, and a Research Scientist at Digital Equipment Corporation.  He is an ACM Fellow, IEEE Fellow, and a SIGIR Academy member.

On the Rough Use of Machine Learning Techniques

Chih-Jen Lin

Distinguished Professor, 

Department of Computer Science, National Taiwan University 

Department of Machine Learning, MBZUAI

Chih-Jen Lin

Distinguished Professor, 

Department of Computer Science National Taiwan University 

Department of Machine Learning MBZUAI

Abstract: 

Machine learning is everywhere, but unfortunately, we are not experts of every method. Sometimes we “inappropriately” use machine learning techniques. Examples include reporting training instead of test performance and comparing two methods without suitable hyper-parameter searches. However, the reality is that there are more sophisticated or more subtle examples, which we broadly call the “rough use” of machine learning techniques. The setting may be roughly fine, but seriously speaking, is inappropriate. We briefly discuss two intriguing examples. 

  • In the topic of graph representation learning, to evaluate the quality of the obtained representations, the multi-label problem of node classification is often considered. An unrealistic setting was used in almost the entire area by assuming that the number of labels of each test instance is known in the prediction stage. In practice, such ground truth information is rarely available. Details of this interesting story are in Lin et al. [1]. 
  • In training deep neural networks, the optimization process often relies on the validation performance for termination or selecting the best epoch. Thus in many public repositories, training, validation, and test sets are explicitly provided. Many think this setting is standard in applying any machine learning technique. However, except that the test set should be completely independent, users can do whatever the best setting on all the available labeled data (i.e., training and validation sets combined). Through real stories, we show that many did not clearly see the relation between training, validation, and test sets. The rough use of machine learning methods is common and sometimes unavoidable. The reason is that nothing is called a perfect use of a machine learning method. Further, it is not easy to assess the seriousness of the situation. We argue that having high-quality and easy-to-use software is an important way to improve the practical use of machine learning techniques.

———–

Chih-Jen Lin is currently a distinguished professor at the Department of Computer Science, National Taiwan University, and an affiliated professor at the Department of Machine Learning, MBZUAI. He obtained his B.S. degree from National Taiwan University in 1993 and Ph.D. degree from University of Michigan in 1998. His major research areas include machine learning, data mining, and numerical optimization. He is best known for his work on support vector machines (SVM) for data classification. His software LIBSVM is one of the most widely used and cited SVM packages. For his research work he has received many awards, including best paper awards in some top computer science conferences. He is an IEEE fellow, a AAAI fellow, and an ACM fellow for his contribution to machine learning algorithms and software design. More information about him can be found at http://www.csie.ntu.edu.tw/~cjlin.

Bridging Quantitative and Qualitative Digital Experience Testing

Ranjitha Kumar 

Associate Professor, University of Illinois at Urbana-Champaign

Chief Research Scientist, UserTesting, Inc.

Ranjitha Kumar 

Associate Professor, University of Illinois at Urbana-Champaign

Chief Research Scientist, UserTesting, Inc.

Abstract:

Digital user experiences are a mainstay of modern communication and commerce; multi-billion dollar industries have arisen around optimizing digital design. Usage analytics and A/B testing solutions allow growth hackers to quantitatively compute conversion over key user journeys, while user experience (UX) testing platforms enable UX researchers to qualitatively analyze usability and brand perception. Although these workflows are in pursuit of the same objective — producing better UX — the gulf between quantitative and qualitative testing is wide: they involve different stakeholders, and rely on disparate methodologies, budget, data streams, and software tools. This gap belies the opportunity to create one platform that optimizes digital experiences holistically: using quantitative methods to uncover what and how much and qualitative analysis to understand why. Such a platform could monitor conversion funnels, identify anomalous behaviors, intercept live users exhibiting those behaviors, and solicit explicit feedback in situ. This feedback could take many forms: survey responses, screen recordings of participants performing tasks, think-aloud audio, and more. By combining data from multiple users and correlating across feedback types, the platform could surface not just insights that a particular conversion funnel had been affected, but hypotheses about what had caused the change in user behavior. The platform could then rank these insights by how often the observed behavior occurred in the wild, using large-scale analytics to contextualize the results from small-scale UX tests. To this end, a decade of research has focused on interaction mining: a set of techniques for capturing interaction and design data from digital artifacts, and aggregating these multi-modal data streams into structured representations bridging quantitative and qualitative experience testing [1–5]. During user sessions, interaction mining systems capture user interactions (e.g., clicks, scrolls, text input), screen captures, and render-time data structures (e.g., website DOMs, native app view hierarchies). Once captured, these data streams are aligned and combined into user traces, sequences of user interactions contextualized by the design data of their UI targets. The structure of these traces affords new workflows for composing quantitative and qualitative methods, building toward a unified platform for optimizing digital experiences.

———–

Ranjitha Kumar is an Associate Professor of Computer Science at the University of Illinois at Urbana-Champaign and the Chief Scientist at UserTesting, Inc. Her research has won best paper awards and nominations at premier conferences in human-computer interaction, and is supported by grants from the NSF, Google, Amazon, and Adobe. She received her BS and PhD from the Computer Science Department at Stanford University, and was previously a co-founder and the Chief Scientist of Apropose, Inc., a data-driven design startup.

Tasks, Copilots, and the Future of Search

Ryen W. White

General Manager and Deputy Lab Director, Microsoft Research

Ryen W. White

General Manager and Deputy Lab Director, Microsoft Research

Abstract:

Tasks are central to information retrieval (IR) and drive interactions with search systems. Understanding and modeling tasks helps these systems better support user needs. This keynote focuses on search tasks, the emergence of generative artificial intelligence (AI), and the implications of recent work at their intersection for the future of search. Recent estimates suggest that half of Web search queries go unanswered, many of them connected to complex search tasks that are ill-defined or multi-step and span several queries. AI copilots, e.g., ChatGPT and Bing Chat, are emerging to address complex search tasks and many other challenges. These copilots are built on large foundation models such as GPT-4 and are being extended with skills and plugins. Copilots can broaden the surface of tasks achievable via search, moving toward creation not just finding (e.g., interview preparation, email composition), and can make searchers more efficient and more successful.  

Users currently engage with AI copilots via natural language queries and dialog and the copilots generate answers with source attribution. However, in delegating responsibility for answer generation, searchers also lose some control over aspects of the search process, such as directly manipulating queries and examining lists of search results. The efficiency gains from auto-generating a single, synthesized answer may also reduce opportunities for user learning and serendipity. A wholesale move to copilots for all search tasks is neither practical nor necessary: model inference is expensive, conversational interfaces are unfamiliar to many users in a search context, and traditional search already excels for many types of task. Instead, experiences that unite search and chat are becoming more common, enabling users to adjust the modality and other aspects (e.g., answer tone) based on the task.  

The rise of AI copilots creates many opportunities for IR, including aligning generated answers with user intent, tasks, and applications via human feedback, using context and data to tailor responses to people and situations (e.g., grounding, personalization), new search experiences (e.g., unifying search and chat), reliability and safety (e.g., accuracy, bias), understanding impacts on user learning and agency, and evaluation (e.g., model-based feedback, searcher simulations, repeatability). Research in these and related areas will enable search systems to more effectively utilize new copilot technologies together with traditional search to help searchers better tackle a wider variety of tasks.

———–

Ryen White is General Manager and Deputy Lab Director of Microsoft Research in Redmond. His research takes a user- and task-centric view on AI, focused on search and assistance. Ryen led applied science for the Microsoft Cortana digital assistant, and he was chief scientist at Microsoft Health, establishing a science culture and infusing AI in both products. Technology derived from his and his team’s research has shipped and significantly improved key business metrics in many Microsoft products, including Bing (e.g., using search context to improve result relevance), Windows, Office, and Azure. Ryen is a Fellow of the ACM and of the British Computer Society. He has published over 300 articles on search and related areas, including significant work on mining and modeling search activity at scale. Ryen was named “Center of the SIGIR Universe” (most central author in the co-authorship graph) in the 40 years of ACM SIGIR. He has received over 20 awards for his technical contributions, including three SIGIR best paper awards and a SIGIR test of time award. Ryen has received the Karen Spärck Jones Award (2015) and the Tony Kent Strix Award (2022) for outstanding contributions to information retrieval. He is editor-in-chief of ACM Transactions on the Web and Vice Chair of SIGIR.