ICTIR ’23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information RetrievalFull Citation in the ACM Digital Library
This talk builds on my SWAN (Schematised Weighted Average Nugget) paper published in May 2023, which discusses a generic framework for auditing a given textual conversational system. The framework assumes that conversation sessions have already been sampled through either human-in-the-loop experiments or user simulation, and is designed to handle task-oriented and non-task oriented conversations seamlessly. The arxiv paper also discussed a schema containing twenty (+1) criteria for scoring nuggets (i.e., factual statements and dialogue acts within each turn of the conversations) either manually or (semi)automatically.
By ”parrots,” I am referring to the stochastics parrots of Professor Emily M. Bender et al., i.e., large language models. By ”sociopathic liars,” I am referring to the same thing, as Professor Shannon Bowen of the University of South Carolina describes them as follows. ”Sociopathic liars are the most damaging types of liars because they lie on a routine basis without conscience and often without reason. Whereas pathetic liars lie to get along, and narcissistic liars prevaricate to cover their inaction, drama, or ineptitude, sociopaths lie simply because they feel like it. Lying is easy for them, and they lie with no conscience or remorse.”
I would like to primarily discuss how researchers might be able to prevent conversational systems from doing harm to users, to labellers, and to society, rather than how we might evaluate good things that the systems might bring just to privileged people. Furthermore, I would like to argue that ICTIR is a perfect place for such a discussion.
SESSION: Technical Presentations
Instant search has emerged as the dominant search paradigm in entity-focused search applications, including search on Apple Music, Kayak, LinkedIn, and Spotify. Unlike the traditional search paradigm, in which users fully issue their query and then the system performs a retrieval round, instant search delivers a new result page with every keystroke. Despite the increasing prevalence of instant search, evaluation methodologies for instant search have not been fully developed and validated. As a result, we have no established evaluation metrics to measure improvements to instant search, and instant search systems still share offline evaluation metrics with traditional search systems. In this work, we first highlight critical differences between traditional search and instant search from an evaluation perspective. We then consider the difficulties of employing offline evaluation metrics designed for the traditional search paradigm to assess the effectiveness of instant search. Finally, we propose a new offline evaluation metric based on the unique characteristics of instant search. To demonstrate the utility of our metric, we conduct experiments across two very different platforms employing instant search: A commercial audio streaming platform and Wikipedia.
Recent research has proposed retrieval approaches based on sparse representations and inverted indexes, with terms produced by neural language models and leveraging the advantages from both neural retrieval and lexical matching. This paper proposes KALE, a new lightweight method of this family that uses a small model with a k-sparse projector to convert dense representations into a sparse set of entries from a latent vocabulary. The KALE vocabulary captures semantic concepts than perform well when used in isolation, and perform better when extending the original lexical vocabulary, this way improving first-stage retrieval accuracy. Experiments with the MSMARCOv1 passage retrieval dataset, the TREC Deep Learning dataset, and BEIR datasets, examined the effectiveness of KALE under varying conditions. Results show that the KALE terms can replace the original lexical vocabulary, with gains in accuracy and efficiency. Combining KALE with the original lexical vocabulary, or with other learned terms, can further improve retrieval accuracy with only a modest increase in computational cost.
Ranking algorithms as an essential component of retrieval systems have been constantly improved in previous studies, especially regarding relevance-based utilities. In recent years, more and more research attempts have been proposed regarding fairness in rankings due to increasing concerns about potential discrimination and the issue of echo chamber. These attempts include traditional score-based methods that allocate exposure resources to different groups using pre-defined scoring functions or selection strategies and learning-based methods that learn the scoring functions based on data samples. Learning-based models are more flexible and achieve better performance than traditional methods. However, most of the learning-based models were trained and tested on outdated datasets where fairness labels are barely available. State-of-art models utilize relevance-based utility scores as a substitute for the fairness labels to train their fairness-aware loss, where plugging in the substitution does not guarantee the minimum loss. This inconsistency challenges the model’s accuracy and performance, especially when learning is achieved by gradient descent. Hence, we propose a distribution-based fair learning framework (DLF) that does not require labels by replacing the unavailable fairness labels with target fairness exposure distributions. Experimental studies on TREC fair ranking track dataset confirm that our proposed framework achieves better fairness performance while maintaining better control over the fairness-relevance trade-off than state-of-art fair ranking frameworks.
Understanding why a model makes certain predictions is crucial when adapting it for real world decision making. LIME is a popular model-agnostic feature attribution method for the tasks of classification and regression. However, the task of learning to rank in information retrieval is more complex in comparison with either classification or regression. In this work, we extend LIME to propose Rank-LIME, a model-agnostic, local, post-hoc linear feature attribution method for the task of learning to rank that generates explanations for ranked lists. We employ novel correlation-based perturbations, differentiable ranking loss functions and introduce new metrics to evaluate ranking based additive feature attribution models. We compare Rank-LIME with a variety of competing systems, with models trained on the MS MARCO datasets and observe that Rank-LIME outperforms existing explanation algorithms in terms of Model Fidelity and Explain-NDCG. With this we propose one of the first algorithms to generate additive feature attributions for explaining ranked lists.
When asked, large language models~(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for~LLMs to support relevance judgments along with concerns and issues that arise. We devise a human–machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of ‘fully automated judgments’, we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of~LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR~researchers.
In this work, we propose a novel framework to devise features that can be used by Query Performance Prediction (QPP) models for Neural Information Retrieval (NIR). Using the proposed framework as a periodic table of QPP components, practitioners can devise new predictors better suited for NIR. Through the framework, we detail what challenges and opportunities arise for QPPs at different stages of the NIR pipeline. We show the potential of the proposed framework by using it to devise two types of novel predictors. The first one, named MEMory-based QPP (MEM-QPP), exploits the similarity between test and train queries to measure how much a NIR system can memorize. The second adapts traditional QPPs into NIR-oriented ones by computing the query-corpus semantic similarity. By exploiting the inherent nature of NIR systems, the proposed predictors overcome, under various setups, the current State of the Art, highlighting — at the same time — the versatility of the framework in describing different types of QPPs.
Lexical exact-match systems perform text retrieval efficiently with sparse matching signals and fast retrieval through inverted lists, but naturally suffer from the mismatch between lexical surface form and implicit term semantics. This paper proposes to directly bridge the surface form space and the term semantics space in lexical exact-match retrieval via contextualized surface forms (CSF). Each CSF pairs a lexical surface form with a context source, and is represented by a lexical form weight and a contextualized semantic vector representation. This framework is able to perform sparse lexicon-based retrieval by learning to represent each query and document as a “bag-of-CSFs”, simultaneously addressing two key factors in sparse retrieval: vocabulary expansion of surface form and semantic representation of term meaning. At retrieval time, it efficiently matches CSFs through exact-match of learned surface forms, and effectively scores each CSF pair via contextual semantic representations, leading to joint improvement in both term match and term scoring. Multiple experiments show that this approach successfully resolves the main mismatch issues in lexical exact-match retrieval and outperforms state-of-the-art lexical exact-match systems, reaching comparable accuracy as lexical all-to-all soft match systems as an efficient exact-match-based system.
The entities that emerge during a conversation can be used to model topics, but not all entities are equally useful for this task. Modeling the conversation with entity graphs and predicting each entity’s centrality in the conversation provides additional information that improves the retrieval of answer passages for the current question. Experiments show that using random walks to estimate entity centrality on conversation entity graphs improves top precision answer passage ranking over competitive transformer-based baselines.
Variational autoencoders (VAEs) are the state-of-the-art model for recommendation with implicit feedback signals. Unfortunately, implicit feedback suffers from selection bias, e.g., popularity bias, position bias, etc., and as a result, training from such signals produces biased recommendation models. Existing methods for debiasing the learning process have not been applied in a generative setting. We address this gap by introducing an inverse propensity scoring (IPS) based method for training VAEs from implicit feedback data in an unbiased way. Our IPS-based estimator for the VAE training objective, VAE-IPS, is provably unbiased w.r.t. selection bias. Our experimental results show that the proposed VAE-IPS model reaches significantly higher performance than existing baselines. Our contributions enable practitioners to combine state-of-the-art VAE recommendation techniques with the advantages of bias mitigation for implicit feedback.
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
It is often difficult for users to form keywords to express their information needs, especially when they are not familiar with the domain of the articles of interest. Moreover, in some search scenarios, there is no explicit query for the search engine to work with. Query-By-Multiple-Documents (QBMD), in which the information needs are implicitly represented by a set of relevant documents addresses these retrieval scenarios. Unlike the keyword-based retrieval task, the query documents are treated as exemplars of a hidden query topic, but it is often the case that they can be relevant to multiple topics.
In this paper, we present a Hierarchical Interaction-based (HINT) bi-encoder retrieval architecture that encodes a set of query documents and retrieval documents separately for the QBMD task. We design a hierarchical attention mechanism that allows the model to 1) encode long sequences efficiently and 2) learn the interactions at low-level and high-level semantics (e.g., tokens and paragraphs) across multiple documents. With contextualized representations, the final scoring is calculated based on a stratified late interaction, which ensures each query document contributes equally to the matching against the candidate document. We build a large-scale, weakly supervised QBMD retrieval dataset based on Wikipedia for model training. We evaluate the proposed model on both Query-By-Single-Document (QBSD) and QBMD tasks. For QBSD, we use a benchmark dataset for legal case retrieval. For QBMD, we transform standard keyword-based retrieval datasets into the QBMD setting. Our experimental results show that HINT significantly outperforms all competitive baselines.
At times when answers to user questions are readily and easily available (at essentially zero cost), it is important for humans to maintain their knowledge and strong reasoning capabilities. We believe that in many cases providing hints rather than final answers should be sufficient and beneficial for users as it requires thinking and stimulates learning as well as remembering processes. We propose in this paper a novel task of automatic hint generation that supports users in finding the correct answers to their questions without the need of looking the answers up. As the first attempt towards this new task, we design and implement an approach that uses Wikipedia to automatically provide hints for any input question-answer pair. We then evaluate our approach with a user group of 10 persons and demonstrate that the generated hints help users successfully answer more questions than when provided with baseline hints.
Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant publications (i.e. those to be included in the review) and not relevant publications (i.e. those to be excluded). Significant importance is put into promoting the retrieval of all relevant publications through great attention to recall-oriented measures, and demoting the retrieval of non-relevant publications through precision-oriented or cost metrics. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the systematic review. More specifically, if an important publication gets excluded or included, this might significantly change the overall review outcome, while not including or excluding less influential studies may only have a limited impact. However, in terms of evaluation measures, all inclusion and exclusion decisions are treated equally and, therefore, failing to retrieve publications with little to no impact on the review outcome leads to the same decrease in recall as failing to retrieve crucial publications.
We propose a new evaluation framework that takes into account the impact of the reported study on the overall systematic review outcome. We demonstrate the framework by extracting review meta-analysis data and estimating outcome effects using predictions from ranking runs on systematic reviews of interventions from CLEF TAR 2019 shared task. We further measure how closely the obtained outcomes are to the outcomes of the original review if the arbitrary rankings were used. We evaluate 74 runs using the proposed framework and compare the results with those obtained using standard IR measures. We find that accounting for the difference in review outcomes leads to a different assessment of the quality of a system than if traditional evaluation measures were used. Our analysis provides new insights into the evaluation of retrieval results in the context of systematic review automation, emphasising the importance of assessing the usefulness of each document beyond binary relevance.
Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicitly model the importance of these non-mainstream users or, when they do, it is in a way that is not necessarily compatible with the data and recommendation model at hand. In contrast, we use the recommendation utility as a more generic and implicit proxy to quantify mainstreamness, and propose a simple user-weighting approach to incorporate it into the training process while taking the cost of potential recommendation errors into account. We provide extensive experimental results showing that quantifying mainstreamness via utility is better able at identifying non-mainstream users, and that they are indeed better served when training the model in a cost-sensitive way. This is achieved with negligible or no loss in overall recommendation accuracy, meaning that the models learn a better balance across users. In addition, we show that research of this kind, which evaluates recommendation quality at the individual user level, may not be reliable if not using enough interactions when assessing model performance.
The SPLADE (SParse Lexical AnD Expansion) model is a highly effective approach to learned sparse retrieval, where documents are represented by term impact scores derived from large language models. During training, SPLADE applies regularization to ensure postings lists are kept sparse — with the aim of mimicking the properties of natural term distributions — allowing efficient and effective lexical matching and ranking. However, we hypothesize that SPLADE may encode additional signals into common postings lists to further improve effectiveness. To explore this idea, we perform a number of empirical analyses where we re-train SPLADE with different, controlled vocabularies and measure how effective it is at ranking passages. Our findings suggest that SPLADE can effectively encode useful ranking signals in documents even when the vocabulary is constrained to terms that are not traditionally useful for ranking, such as stopwords or even random words.
One of the challenges of math information retrieval is the inherent ambiguity of mathematical notation. The use of various notations, symbols, and conventions can lead to ambiguities in math search queries, potentially causing confusion and errors. Therefore, asking clarifying questions in math search can help remove these ambiguities. Despite advances in incorporating clarifying questions for search, little is currently understood about the characteristics of these questions in math. This paper investigates math clarifying questions asked on the MathStackExchange community question answering platform, analyzing a total of 495,431 clarifying questions and their usefulness. The results of the analysis uncover specific patterns in useful clarifying questions that provide insight into the design considerations for future conversational math search systems. The formulae used in clarifying questions are closely related to those in the initial queries and are accompanied by common phrases, seeking for the missing information related to the formulae. Additionally, experiments utilizing clarifying questions for math search demonstrate the potential benefits of incorporating them alongside the original query.
Online discussions are a ubiquitous aspect of everyday life. An Internet user who interacts with an online discussion may benefit from seeing hyperlinks to webpages relevant to the discussion because the relevant webpages can provide added context, act as citations for background sources, or condense information so that conversations can proceed seamlessly at a high level. In this paper, we propose and study a new task of retrieving relevant webpages given an online discussion. We frame the task as a novel retrieval problem where we treat a sequence of comments in an online discussion as a query and use such a query to retrieve relevant webpages. We construct a new data set using Reddit, an online discussion forum, to study this new problem. We explore and evaluate multiple representative retrieval methods to examine their effectiveness for solving this new problem. We also propose to leverage the comments that contain hyperlinks as training data to enable supervised learning and further improve retrieval performance. We find that results using modern retrieval methods are promising and that leveraging comments with hyperlinks as training data can further improve performance. We release our data set and code to enable additional research in this direction.
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.
There is a long history of work on using relevance feedback for ad hoc document retrieval. The main types of relevance feedback studied thus far are for documents, passages and terms. We explore the merits of using relevance feedback provided for entities in an entity repository. We devise retrieval methods that can utilize relevance feedback provided for tokens whether entities or terms. Empirical evaluation shows that using entity relevance feedback falls short with respect to utilizing term feedback on average, but is much more effective for difficult queries. Furthermore, integrating term and entity relevance feedback is of clear merit; e.g., for augmenting minimal document feedback. We also contrast approaches to presenting entities and terms for soliciting relevance feedback.
Learning Aligned Cross-Modal and Cross-Product Embeddings for Generating the Topics of Shopping Needs
The paper addresses the issue of generating keywords to describe the topic of a shopping need based on the titles and photos of products being browsed or compared. We extend to learn cross-modal and cross-product embeddings to capture the relationships between textual and visual semantics and the shared features between comparable products. Experiments conducted on 3 real-world datasets have shown that the keywords decoded from such embeddings gain significant improvement compared to state-of-the-art cross-modal embeddings.
The fusion task is to aggregate ranked document lists retrieved for a query. The Condorcet voting criterion served as inspiration for a commonly used fusion method proposed by Montague and Aslam (2002). The method is stochastic as it is based on the QuickSort sorting algorithm. We empirically show that the performance of the method can substantially vary due to this stochastic aspect. We propose approaches that improve the performance robustness of this fusion method with respect to its stochastic nature. The resultant performance is on par with the state-of-the-art.
Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document Manipulations
In retrieval settings such as the Web, many document authors are ranking incentivized: they opt to have their documents highly ranked for queries of interest. Consequently, they often respond to rankings by modifying their documents. These modifications can hurt retrieval effectiveness even if the resultant documents are of high quality. We present novel content-based relevance estimates which are “ranking-incentives aware”; that is, the underlying assumption is that content can be the result of ranking incentives rather than of pure authorship considerations. The suggested estimates are based on inducing information from past dynamics of the document corpus. Empirical evaluation attests to the clear merits of our most effective methods. For example, they substantially outperform state-of-the-art approaches that were not designed to address ranking-incentivized document manipulations.
An Analysis of Untargeted Poisoning Attack and Defense Methods for Federated Online Learning to Rank Systems
Federated online learning to rank (FOLTR) aims to preserve user privacy by not sharing their searchable data and search interactions, while guaranteeing high search effectiveness, especially in contexts where individual users have scarce training data and interactions. For this, FOLTR trains learning to rank models in an online manner — i.e. by exploiting users’ interactions with the search systems (queries, clicks), rather than labels — and federatively — i.e. by not aggregating interaction data in a central server for training purposes, but by training instances of a model on each user device on their own private data, and then sharing the model updates, not the data, across a set of users that have formed the federation. Existing FOLTR methods build upon advances in federated learning.
While federated learning methods have been shown effective at training machine learning models in a distributed way without the need of data sharing, they can be susceptible to attacks that target either the system’s security or its overall effectiveness.
In this paper, we consider attacks on FOLTR systems that aim to compromise their search effectiveness. Within this scope, we experiment with and analyse data and model poisoning attack methods to showcase their impact on FOLTR search effectiveness. We also explore the effectiveness of defense methods designed to counteract attacks on FOLTR systems. We contribute an understanding of the effect of attack and defense methods for FOLTR systems, as well as identifying the key factors influencing their effectiveness.
Graph Convolutional Networks (GCNs) are effective in providing more relevant items at higher rankings in recommender systems. However, in real-world scenarios, it is important to provide recommended items with diversity and novelty as well as relevance to each user’s preference. Additionally, users often desire a wide range of recommendations not just based on their past search behaviors and histories. To enhance each user’s satisfaction, it is important to develop a recommender system that provides much more relevant and diverse items. LightGCN can achieve this, which is a GCN-based recommender system that learns latent vectors of users and items using multiple layers of aggregation functions and an adjacency matrix. However, LightGCN often provides recommendations without diversity when the number of layers is insufficient. On the other hand, when the number is excessive, the accuracy declines, which is known as the over-smoothing problem. To overcome this, we propose a novel approach using a continuous-time quantum walk model derived from a quantum algorithm to reconstruct the user-item adjacency matrix of LightGCN, improving the relevance and diversity of recommendations.
Many of the traditional recommendation algorithms are designed based on the fundamental idea of mining or learning correlative patterns from data to estimate the user-item correlative preference. However, pure correlative learning may lead to Simpson’s paradox in predictions, and thus results in sacrificed recommendation performance. Simpson’s paradox is a well-known statistical phenomenon, which causes confusions in statistical conclusions and ignoring the paradox may result in inaccurate decisions. Fortunately, causal and counterfactual modeling can help us to think outside of the observational data for user modeling and personalization so as to tackle such issues. In this paper, we propose Causal Collaborative Filtering (CCF) — a general framework for modeling causality in collaborative filtering and recommendation. We provide a unified causal view of CF and mathematically show that many of the traditional CF algorithms are actually special cases of CCF under simplified causal graphs. We then propose a conditional intervention approach for do-operations so that we can estimate the user-item causal preference based on the observational data. Finally, we further propose a general counterfactual constrained learning framework for estimating the user-item preferences. Experiments are conducted on two types of real-world datasets—traditional and randomized trial data—and results show that our framework can improve the recommendation performance and reduce the Simpson’s paradox problem of many CF algorithms.
Knowledge distillation is commonly used in training a neural document ranking model by employing a teacher to guide model refinement. As a teacher may not be correct in all cases, over-calibration between the student and teacher models can make training less effective. This paper focuses on the KL divergence loss used for knowledge distillation in document re-ranking, and re-visits balancing of knowledge distillation with explicit contrastive learning. The proposed loss function takes a conservative approach in imitating teacher’s behavior, and allows student to deviate from a teacher’s model sometimes through training. This paper presents analytic results with an evaluation on MS MARCO passages to validate the usefulness of the proposed loss for the transformer-based ColBERT re-ranking.
As the popularity of voice assistants continues to surge, conversational search has gained increased attention in Information Retrieval. However, data sparsity issues in conversational search significantly hinder the progress of supervised conversational search methods. Consequently, researchers are focusing more on zero-shot conversational search approaches. Nevertheless, existing zero-shot methods face three primary limitations: they are not universally applicable to all retrievers, their effectiveness lacks sufficient explainability, and they struggle to resolve common conversational ambiguities caused by omission. To address these limitations, we introduce a novel Zero-shot Query Reformulation (ZeQR) framework that reformulates queries based on previous dialogue contexts without requiring supervision from conversational search data. Specifically, our framework utilizes language models designed for machine reading comprehension tasks to explicitly resolve two common ambiguities: coreference and omission, in raw queries. In comparison to existing zero-shot methods, our approach is universally applicable to any retriever without additional adaptation or indexing. It also provides greater explainability and effectively enhances query intent understanding because ambiguities are explicitly and proactively resolved. Through extensive experiments on four TREC conversational datasets, we demonstrate the effectiveness of our method, which consistently outperforms state-of-the-art baselines.
Topic modelling is an approach to generation of descriptions of document collections as a set of topics where each has a distinct theme and documents are a blend of topics. It has been applied to retrieval in a range of ways, but there has been little prior work on measurement of whether the topics are descriptive in this context. Moreover, existing methods for assessment of topic quality do not consider how well individual documents are described. To address this issue we propose a new measure of topic quality, which we call specificity; the basis of this measure is the extent to which individual documents are described by a limited number of topics. We also propose a new experimental protocol for validating topic-quality measures, a ‘noise dial’ that quantifies the extent to which the measure’s scores are altered as the topics are degraded by addition of noise. The principle of the mechanism is that a meaningful measure should produce low scores if the ‘topics’ are essentially random. We show that specificity is at least as effective as existing measures of topic quality and does not require external resources. While other measures relate only to topics, not to documents, we further show that specificity correlates to the extent to which topic models are informative in the retrieval process.
The ability to detect out-of-distribution (OOD) inputs is essential for safely deploying machine learning models in an open world. Most existing research on OOD detection, and more generally uncertainty quantification, has focused on multi-class classification. However, for many information retrieval (IR) applications, the classification of documents or images is by nature not multi-class but multi-label. This paper presents a pure theoretical analysis of the under-explored problem of OOD detection in multi-label classification using deep neural networks. First, we examine main existing approaches such as MSP (proposed in ICLR-2017) and MaxLogit (proposed in ICML-2022), and summarize them as different combinations of label-wise scoring and aggregation functions. Some existing methods are shown to be equivalent. Then, we prove that JointEnergy (proposed in NeurIPS-2021) is indeed the optimal probabilistic solution when the class labels are conditionally independent with each other for any given data sample. This provides a more rigorous explanation for the effectiveness of JointEnergy than the original joint-likelihood interpretation, and also reveals its reliance upon the assumption of label independence rather than the exploitation of label relationships as previously thought. Finally, we discuss potential future research directions in this area.
As in other fields of artificial intelligence, the information retrieval community has grown interested in investigating the power consumption associated with neural models, particularly models of search. This interest has become particularly relevant as the energy consumption of information retrieval models has risen with new neural models based on large language models, leading to an associated increase of CO2 emissions, albeit relatively low compared to fields such as natural language processing. Consequently, researchers have started exploring the development of a green agenda for sustainable information retrieval research and operation. Previous work, however, has primarily considered energy consumption and associated CO2 emissions alone. In this paper, we seek to draw the information retrieval community’s attention to the overlooked aspect of water consumption related to these powerful models. We supplement previous energy consumption estimates with corresponding water consumption estimates, considering both off-site water consumption (required for operating and cooling energy production systems such as carbon and nuclear power plants) and on-site consumption (for cooling the data centres where models are trained and operated). By incorporating water consumption alongside energy consumption and CO2 emissions, we offer a more comprehensive understanding of the environmental impact of information retrieval research and operation.