SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalFull Citation in the ACM Digital Library
SESSION: Keynote Talks
If we could design the ideal IR "effectiveness" experiment (as distinct from an IR "efficiency" experiment), what would it look like? It would probably be a lab-based observational study  involving multiple search systems masked behind a uniform interface, and with hundreds (or thousands) of users each progressing some "real" search activity they were interested in. And we'd plan to (non-intrusively, somehow) capture per-snippet, per-document, per-SERP, and per-session annotations and satisfaction responses. The collected data could then be compared against a range of measured "task completion quality" indicators, and also against search effectiveness metric scores computed from the elements contained in the SERPs that were served by the systems. That's a tremendously big ask! So we often use offline evaluation techniques instead, employing test collections, static qrels sets, and effectiveness metrics . We abstract the user into a deterministic evaluation script, supposing for pragmatic reasons that we know what query they would issue, and at the same time assuming that we can apply an effectiveness metric to calculate how much usefulness (or satisfaction) they will derive from any given SERP. The great advantage of this approach is that aside from the process of collecting the qrels, it is free of the need for users, meaning that it is repeatable. Indeed, we often do repeat, iterating to set parameters (and to rectify programming errors). Then, once metric scores have been computed, we carry out one or more paired statistical tests and draw conclusions as to relative system effectiveness.
Deep Learning has made tremendous progress in Natural Language Processing (NLP), where large pre-trained language models (PLM) fine-tuned on the target task have become the predominant tool. More recently, in a process called prompting, NLP tasks are rephrased as natural language text, allowing us to better exploit linguistic knowledge learned by PLMs and resulting in significant improvements. Still, PLMs have limited inference ability. In the Textual Entailment task, systems need to output whether the truth of a certain textual hypothesis follows from the given premise text. Manually annotated entailment datasets covering multiple inference phenomena have been used to infuse inference capabilities to PLMs.
This talk will review these recent developments, and will present an approach that combines prompts and PLMs fine-tuned for textual entailment that yields state-of-the-art results on Information Extraction (IE) using only a small fraction of the annotations. The approach has additional benefits, like the ability to learn from different schemas and inference datasets. These developments enable a new paradigm for IE where the expert can define the domain-specific schema using natural language and directly run those specifications, annotating a handful of examples in the process. A user interface based on this new paradigm will also be presented. Beyond IE, inference capabilities could be extended, acquired and applied from other tasks, opening a new research avenue where entailment and downstream task performance improve in tandem.
Search engines were one of the first intelligent cloud-based applications that people used to get things done, and they have since become an extremely important productivity tool. This is in part because much of what a person is doing when they search is thinking. Search engines do not merely support that thinking, however, but can also actually shape it. For example, some search results are more likely to spur learning than others and influence a person's future queries . This means that as information retrieval researchers the new approaches that we develop can actively shape the future of work.
The world is now in the middle of the most significant change to work practices in a generation, and it is one that will make search technology even more central to work in the years to come. For the past several millennia, space was the primary technology that people used to get things done. The coming Hybrid Work Era, however, will be shaped by digital technology. The recent rapid shift to remote work significantly accelerated the digital transformation already underway at many companies, and new types of work-related data are now being generated at an unprecedented rate. For example, the use of meeting recordings in Microsoft Stream has more than doubled from March 2020 to February 2022 .
Knowledge exists in this newly captured data, but figuring out how to make sense of it is overwhelming. The information retrieval community knows how to process large amounts of data, build usable intelligent systems, and learn from behavioral data, and we have a unique opportunity right now to apply this expertise to new corpora and in new scenarios in a meaningful way. In this talk I will give an overview of what research tells us about emerging work practices, and explore how the SIGIR community can build on these findings to help create a new - and better - future of work.
We are in the midst of an AI revolution. Three primary disruptive changes set off this revolution: 1) increase in compute power, mobile internet, and advances in deep learning. The next decade is expected to be about the proliferation of Internet-of-Things (IoT) devices and sensors, which will generate exponentially larger amounts of data to reason over and pave the way for ambient computing. This will also give rise to new forms of interaction patterns with these systems. Users will have to interact with these systems under increasingly richer context and in real-time. Conversational AI has a critical role to play in this revolution, but only if it delivers on its promise of enabling natural, frictionless, and personalized interactions in any context the user is in, while hiding the complexity of these systems through ambient intelligence. However, current commercial conversational AI systems are trained primarily with a supervised learning paradigm, which is difficult, if not impossible, to scale by manually annotating data for increasingly complex sets of contextual conditions. Inherent ambiguity in natural language further complicates the problem. We need to devise new forms of learning paradigms and frameworks that will scale to this complexity. In this talk, we present some early steps we are taking with Alexa, Amazon's Conversational AI system, to move from supervised learning to self-learning methods, where the AI relies on customer interactions for supervision in our journey to ambient intelligence.
SESSION: Topic 1: Bias in IR
Can Clicks Be Both Labels and Features?: Unbiased Behavior Feature Collection and Uncertainty-aware Learning to Rank
Using implicit feedback collected from user clicks as training labels for learning-to-rank algorithms is a well-developed paradigm that has been extensively studied and used in modern IR systems. Using user clicks as ranking features, on the other hand, has not been fully explored in existing literature. Despite its potential in improving short-term system performance, whether the incorporation of user clicks as ranking features is beneficial for learning-to-rank systems in the long term is still questionable. Two of the most important problems are (1) the explicit bias introduced by noisy user behavior, and (2) the implicit bias, which we refer to as the exploitation bias, introduced by the dynamic training and serving of learning-to-rank systems with behavior features. In this paper, we explore the possibility of incorporating user clicks as both training labels and ranking features for learning to rank. We formally investigate the problems in feature collection and model training, and propose a counterfactual feature projection function and a novel uncertainty-aware learning to rank framework. Experiments on public datasets show that ranking models learned with the proposed framework can significantly outperform models built with raw click features and algorithms that rank items without considering model uncertainty.
In this paper we study how to effectively exploit implicit feedback in Dense Retrievers (DRs). We consider the specific case in which click data from a historic click log is available as implicit feedback. We then exploit such historic implicit interactions to improve the effectiveness of a DR. A key challenge that we study is the effect that biases in the click signal, such as position bias, have on the DRs. To overcome the problems associated with the presence of such bias, we propose the Counterfactual Rocchio (CoRocchio) algorithm for exploiting implicit feedback in Dense Retrievers. We demonstrate both theoretically and empirically that dense query representations learnt with CoRocchio are unbiased with respect to position bias and lead to higher retrieval effectiveness. We make available the implementations of the proposed methods and the experimental framework, along with all results at https://github.com/ielab/Counterfactual-DR.
Implicit feedback has been widely used to build commercial recommender systems. Because observed feedback represents users' click logs, there is a semantic gap between true relevance and observed feedback. More importantly, observed feedback is usually biased towards popular items, thereby overestimating the actual relevance of popular items. Although existing studies have developed unbiased learning methods using inverse propensity weighting (IPW) or causal reasoning, they solely focus on eliminating the popularity bias of items. In this paper, we propose a novel unbiased recommender learning model, namely BIlateral SElf-unbiased Recommender (BISER), to eliminate the exposure bias of items caused by recommender models. Specifically, BISER consists of two key components: (i) self-inverse propensity weighting (SIPW) to gradually mitigate the bias of items without incurring high computational costs; and (ii) bilateral unbiased learning (BU) to bridge the gap between two complementary models in model predictions, i.e., user- and item-based autoencoders, alleviating the high variance of SIPW. Extensive experiments show that BISER consistently outperforms state-of-the-art unbiased recommender models over several datasets, including Coat, Yahoo! R3, MovieLens, and CiteULike.
Most recommender systems evaluate model performance offline through either: 1) normal biased test on factual interactions; or 2) debiased test with records from the randomized controlled trial. In fact, both tests only reflect part of the whole picture: factual interactions are collected from the recommendation policy, fitting them better implies benefiting the platform with higher click or conversion rate; in contrast, debiased test eliminates system-induced biases and thus is more reflective of user true preference. Nevertheless, we find that existing models exhibit trade-off on the two tests, and there lacks methods that perform well on both tests.
In this work, we aim to develop a win-win recommendation method that is strong on both tests. It is non-trivial, since it requires to learn a model that can make accurate prediction in both factual environment (ie normal biased test) and counterfactual environment (ie debiased test). Towards the goal, we perform environment-aware recommendation modeling by considering both environments. In particular, we propose an Interpolative Distillation (InterD) framework, which interpolates the biased and debiased models at user-item pair level by distilling a student model. We conduct experiments on three real-world datasets with both tests. Empirical results justify the rationality and effectiveness of InterD, which stands out on both tests especially demonstrates remarkable gains on less popular items.
Recent years have witnessed the great accuracy performance of graph-based Collaborative Filtering (CF) models for recommender systems. By taking the user-item interaction behavior as a graph, these graph-based CF models borrow the success of Graph Neural Networks (GNN), and iteratively perform neighborhood aggregation to propagate the collaborative signals. While conventional CF models are known for facing the challenges of the popularity bias that favors popular items, one may wonder "Whether the existing graph-based CF models alleviate or exacerbate the popularity bias of recommender systems?" To answer this question, we first investigate the two-fold performances w.r.t. accuracy and novelty for existing graph-based CF methods. The empirical results show that symmetric neighborhood aggregation adopted by most existing graph-based CF models exacerbates the popularity bias and this phenomenon becomes more serious as the depth of graph propagation increases. Further, we theoretically analyze the cause of popularity bias for graph-based CF. Then, we propose a simple yet effective plugin, namely r-AdjNorm, to achieve an accuracy-novelty trade-off by controlling the normalization strength in the neighborhood aggregation process. Meanwhile, r-AdjNorm can be smoothly applied to the existing graph-based CF backbones without additional computation. Finally, experimental results on three benchmark datasets show that our proposed method can improve novelty without sacrificing accuracy under various graph-based CF backbones.
Recommender system usually faces popularity bias. From the popularity distribution shift perspective, the normal paradigm trained on exposed items (most are hot items) identifies that recommending popular items more frequently can achieve lower loss, thus injecting popularity information into item property embedding, e.g., id embedding. From the long-tail distribution shift perspective, the sparse interactions of long-tail items lead to insufficient learning of them. The resultant distribution discrepancy between hot and long-tail items would not only inherit the bias, but also amplify the bias. Existing work addresses this issue with inverse propensity scoring (IPS) or causal embeddings. However, we argue that not all popularity biases mean bad effects, i.e., some items show higher popularity due to better quality or conform to current trends, which deserve more recommendations. Blindly seeking unbiased learning may inhibit high-quality or fashionable items. To make better use of the popularity bias, we propose a co-training disentangled domain adaptation network (CD$^2$AN), which can co-train both biased and unbiased models. Specifically, for popularity distribution shift, CD$^2$AN disentangles item property representation and popularity representation from item property embedding. For long-tail distribution shift, we introduce additional unexposed items (most are long-tail items) to align the distribution of hot and long-tail item property representations. Further, from the instances perspective, we carefully design the item similarity regularization to learn comprehensive item representation, which encourages item pairs with more effective co-occurrences patterns to have more similar item property representations. Based on offline evaluations and online A/B tests, we show that CD$^2$AN outperforms the existing debiased solutions. Currently, CD$^2$AN has been successfully deployed at Mobile Taobao App and handling major online traffic.
SESSION: Topic 2: Collaborative Filtering
Collaborative Filtering (CF) has emerged as fundamental paradigms for parameterizing users and items into latent representation space, with their correlative patterns from interaction data. Among various CF techniques, the development of GNN-based recommender systems, e.g., PinSage and LightGCN, has offered the state-of-the-art performance. However, two key challenges have not been well explored in existing solutions: i) The over-smoothing effect with deeper graph-based CF architecture, may cause the indistinguishable user representations and degradation of recommendation results. ii) The supervision signals (i.e., user-item interactions) are usually scarce and skewed distributed in reality, which limits the representation power of CF paradigms. To tackle these challenges, we propose a new self-supervised recommendation framework Hypergraph Contrastive Collaborative Filtering (HCCF) to jointly capture local and global collaborative relations with a hypergraph-enhanced cross-view contrastive learning architecture. In particular, the designed hypergraph structure learning enhances the discrimination ability of GNN-based CF paradigm, in comprehensively capturing the complex high-order dependencies among users. Additionally, our HCCF model effectively integrates the hypergraph structure encoding with self-supervised learning to reinforce the representation quality of recommender systems, based on the hypergraph self-discrimination. Extensive experiments on three benchmark datasets demonstrate the superiority of our model over various state-of-the-art recommendation methods, and the robustness against sparse user interaction data. The implementation codes are available at https://github.com/akaxlh/HCCF.
Learning informative representations of users and items from the historical interactions is crucial to collaborative filtering (CF). Existing CF approaches usually model interactions solely within the Euclidean space. However, the sophisticated user-item interactions inherently present highly non-Euclidean anatomy with various types of geometric patterns (i.e., tree-likeness and cyclic structures). The Euclidean-based models may be inadequate to fully uncover the intent factors beneath such hybrid-geometry interactions. To remedy this deficiency, in this paper, we study the novel problem of Geometric Disentangled Collaborative Filtering (GDCF), which aims to reveal and disentangle the latent intent factors across multiple geometric spaces. A novel generative GDCF model is proposed to learn geometric disentangled representations by inferring the high-level concepts associated with user intentions and various geometries. Empirically, our proposal is extensively evaluated over five real-world datasets, and the experimental results demonstrate the superiority of GDCF.
Collaborative filtering is one of the most common scenarios and popular research topics in recommender systems. Among existing methods, latent factor models, i.e., learning a specific embedding for each user/item by reconstructing the observed interaction matrix, have shown excellent performances. However, such user-specific and item-specific embeddings are intrinsically transductive, making it difficult for them to deal with new users and new items unseen during training. Besides, the number of model parameters heavily depends on the number of all users and items, restricting their scalability to real-world applications. To solve the above challenges, in this paper, we propose a novel model-agnostic and scalable Inductive Embedding Module for collaborative filtering, namely INMO. INMO generates the inductive embeddings for users (items) by characterizing their interactions with some template items (template users), instead of employing an embedding lookup table. Under the theoretical analysis, we further propose an effective indicator for the selection of template users and template items. Our proposed INMO can be attached to existing latent factor models as a pre-module, inheriting the expressiveness of backbone models, while bringing the inductive ability and reducing model parameters. We validate the generality of INMO by attaching it to Matrix Factorization (MF) and LightGCN, which are two representative latent factor models for collaborative filtering. Extensive experiments on three public benchmarks demonstrate the effectiveness and efficiency of INMO in both transductive and inductive recommendation scenarios.
Due to the widespread presence of implicit feedback, recommendation based on them has been a long-standing research problem in academia and industry. However, it suffers from the extremely-sparse problem, since each user only interacts with a few items. One well-known and good-performing method is to treat each user's all uninteracted items as negative with low confidence. The method intrinsically imposes an implicit regularization to penalize large deviation of each user's preferences for uninteracted items from a constant. However, these methods have to assume a constant-rating prior to uninteracted items, which may be questionable. In this paper, we propose a novel ring-based regularization to penalize significant differences of each user's preferences between each item and some other items. The ring structure, described by an item graph, determines which other items are selected for each item in the regularization. The regularization not only averts the introduction of the prior ratings but also implicitly penalizes the remarkable preference differences for all items according to theoretical analysis. However, optimizing the recommenders with the regularization still suffers from computational challenges, so we develop a scalable alternating least square algorithm by carefully designing gradient computation. Therefore, as long as connecting each item with a sublinear/constant number of other items in the item graph, the overall learning algorithm could be comparably efficient to the existing algorithms. The proposed regularization is extensively evaluated with several public recommendation datasets, where the results show that the regularization could lead to considerable improvements in recommendation performance.
Recommender systems aim to provide personalized services to users and are playing an increasingly important role in our daily lives. The key of recommender systems is to predict how likely users will interact with items based on their historical online behaviors, e.g., clicks, add-to-cart, purchases, etc. To exploit these user-item interactions, there are increasing efforts on considering the user-item interactions as a user-item bipartite graph and then performing information propagation in the graph via Graph Neural Networks (GNNs). Given the power of GNNs in graph representation learning, these GNNs-based recommendation methods have remarkably boosted the recommendation performance. Despite their success, most existing GNNs-based recommender systems overlook the existence of interactions caused by unreliable behaviors (e.g., random/bait clicks) and uniformly treat all the interactions, which can lead to sub-optimal and unstable performance. In this paper, we investigate the drawbacks (e.g., non-adaptive propagation and non-robustness) of existing GNN-based recommendation methods. To address these drawbacks, we introduce a principled graph trend collaborative filtering method and propose the Graph Trend Filtering Networks for recommendations (GTN) that can capture the adaptive reliability of the interactions. Comprehensive experiments and ablation studies are presented to verify and understand the effectiveness of the proposed framework. Our implementation based on PyTorch is available: https://github.com/wenqifan03/GTN-SIGIR2022.
Recently, graph neural networks (GNN) have been successfully applied to recommender systems as an effective collaborative filtering (CF) approach. However, existing GNN-based CF models suffer from noisy user-item interaction data, which seriously affects the effectiveness and robustness in real-world applications. Although there have been several studies on data denoising in recommender systems, they either neglect direct intervention of noisy interaction in the message-propagation of GNN, or fail to preserve the diversity of recommendation when denoising.
To tackle the above issues, this paper presents a novel GNN-based CF model, named Robust Graph Collaborative Filtering (RGCF), to denoise unreliable interactions for recommendation. Specifically, RGCF consists of a graph denoising module and a diversity preserving module. The graph denoising module is designed for reducing the impact of noisy interactions on the representation learning of GNN, by adopting both a hard denoising strategy (i.e., discarding interactions that are confidently estimated as noise) and a soft denoising strategy (i.e., assigning reliability weights for each remaining interaction). In the diversity preserving module, we build up a diversity augmented graph and propose an auxiliary self-supervised task based on mutual information maximization (MIM) for enhancing the denoised representation and preserving the diversity of recommendation. These two modules are integrated in a multi-task learning manner that jointly improves the recommendation performance. We conduct extensive experiments on three real-world datasets and three synthesized datasets. Experiment results show that RGCF is more robust against noisy interactions and achieves significant improvement compared with baseline models.
SESSION: Topic 3: Conversational IR
User simulation has been a cost-effective technique for evaluating conversational recommender systems. However, building a human-like simulator is still an open challenge. In this work, we focus on how users reformulate their utterances when a conversational agent fails to understand them. First, we perform a user study, involving five conversational agents across different domains, to identify common reformulation types and their transition relationships. A common pattern that emerges is that persistent users would first try to rephrase, then simplify, before giving up. Next, to incorporate the observed reformulation behavior in a user simulator, we introduce the task of reformulation sequence generation: to generate a sequence of reformulated utterances with a given intent (rephrase or simplify). We develop methods by extending transformer models guided by the reformulation type and perform further filtering based on estimated reading difficulty. We demonstrate the effectiveness of our approach using both automatic and human evaluation.
Conversational question answering (ConvQA) tackles sequential information needs where contexts in follow-up questions are left implicit. Current ConvQA systems operate over homogeneous sources of information: either a knowledge base (KB), or a text corpus, or a collection of tables. This paper addresses the novel issue of jointly tapping into all of these together, this way boosting answer coverage and confidence. We present CONVINSE, an end-to-end pipeline for ConvQA over heterogeneous sources, operating in three stages: i) learning an explicit structured representation of an incoming question and its conversational context, ii) harnessing this frame-like representation to uniformly capture relevant evidences from KB, text, and tables, and iii) running a fusion-in-decoder model to generate the answer. We construct and release the first benchmark, ConvMix, for ConvQA over heterogeneous sources, comprising 3000 real-user conversations with 16000 questions, along with entity annotations, completed question utterances, and question paraphrases. Experiments demonstrate the viability and advantages of our method, compared to state-of-the-art baselines.
Generating fluent and informative natural responses while main- taining representative internal states for search optimization is critical for conversational search systems. Existing approaches ei- ther 1) predict structured dialog acts first and then generate natural response; or 2) map conversation context to natural responses di- rectly in an end-to-end manner. Both kinds of approaches have shortcomings. The former suffers from error accumulation while the semantic associations between structured acts and natural re- sponses are confined in single direction. The latter emphasizes generating natural responses but fails to predict structured acts. Therefore, we propose a neural co-generation model that gener- ates the two concurrently. The key lies in a shared latent space shaped by two informed priors. Specifically, we design structured dialog acts and natural response auto-encoding as two auxiliary tasks in an interconnected network architecture. It allows for the concurrent generation and bidirectional semantic associations. The shared latent space also enables asynchronous reinforcement learn- ing for further joint optimization. Experiments show that our model achieves significant performance improvements.
Conversational recommender systems (CRSs) provide recommendations through interactive conversations. CRSs typically provide recommendations through relatively straightforward interactions, where the system continuously inquires about a user's explicit attribute-aware preferences and then decides which items to recommend. In addition, topic tracking is often used to provide naturally sounding responses. However, merely tracking topics is not enough to recognize a user's real preferences in a dialogue.
In this paper, we address the problem of accurately recognizing and maintaining user preferences in CRSs. Three challenges come with this problem: (1) An ongoing dialogue only provides the user's short-term feedback; (2) Annotations of user preferences are not available; and (3) There may be complex semantic correlations among items that feature in a dialogue. We tackle these challenges by proposing an end-to-end variational reasoning approach to the task of conversational recommendation. We model both long-term preferences and short-term preferences as latent variables with topical priors for explicit long-term and short-term preference exploration, respectively. We use an efficient stochastic gradient variational Bayesian (SGVB) estimator for optimizing the derived evidence lower bound. A policy network is then used to predict topics for a clarification utterance or items for a recommendation response. The use of explicit sequences of preferences with multi-hop reasoning in a heterogeneous knowledge graph helps to provide more accurate conversational recommendation results.
Extensive experiments conducted on two benchmark datasets show that our proposed method outperforms state-of-the-art baselines in terms of both objective and subjective evaluation metric
Conversational search is a crucial and promising branch in information retrieval. In this paper, we reveal that not all historical conversational turns are necessary for understanding the intent of the current query. The redundant noisy turns in the context largely hinder the improvement of search performance. However, enhancing the context denoising ability for conversational search is quite challenging due to data scarcity and the steep difficulty for simultaneously learning conversational query encoding and context denoising. To address these issues, in this paper, we present a novel Curriculum cOntrastive conTExt Denoising framework, COTED, towards few-shot conversational dense retrieval. Under a curriculum training order, we progressively endow the model with the capability of context denoising via contrastive learning between noised samples and denoised samples generated by a new conversation data augmentation strategy. Three curriculums tailored to conversational search are exploited in our framework. Extensive experiments on two few-shot conversational search datasets, i.e., CAsT-19 and CAsT-20, validate the effectiveness and superiority of our method compared with the state-of-the-art baselines.
Recently, pre-training methods have shown remarkable success in task-oriented dialog (TOD) systems. However, most existing pre-trained models for TOD focus on either dialog understanding or dialog generation, but not both. In this paper, we propose SPACE, a novel unified pre-trained dialog model learning from large-scale dialog corpora with limited annotations, which can be effectively fine-tuned on a wide range of downstream dialog tasks. Specifically, SPACE consists of four successive components in a single transformer to maintain a task-flow in TOD systems: (i) a dialog encoding module to encode dialog history, (ii) a dialog understanding module to extract semantic vectors from either user queries or system responses, (iii) a dialog policy module to generate a policy vector that contains high-level semantics of the response, and (iv) a dialog generation module to produce appropriate responses. We design a dedicated pre-training objective for each component. Concretely, we pre-train the dialog encoding module with span mask language modeling to learn contextualized dialog information. To capture the structured dialog semantics, we pre-train the dialog understanding module via a novel tree-induced semi-supervised contrastive learning objective with the help of extra dialog annotations. In addition, we pre-train the dialog policy module by minimizing the ℒ2 distance between its output policy vector and the semantic vector of the response for policy optimization. Finally, the dialog generation model is pre-trained by language modeling. Results show that SPACE achieves state-of-the-art performance on eight downstream dialog benchmarks, including intent prediction, dialog state tracking, and end-to-end dialog modeling. We also show that SPACE has a stronger few-shot ability than existing models under the low-resource setting.
Maintaining a consistent persona is essential for building a human-like conversational model. However, the lack of attention to the partner makes the model more egocentric: they tend to show their persona by all means such as twisting the topic stiffly, pulling the conversation to their own interests regardless, and rambling their persona with little curiosity to the partner. In this work, we propose COSPLAY(COncept Set guided PersonaLized dialogue generation Across both partY personas) that considers both parties as a "team": expressing self-persona while keeping curiosity toward the partner, leading responses around mutual personas, and finding the common ground. Specifically, we first represent self-persona, partner persona and mutual dialogue all in the concept sets. Then, we propose the Concept Set framework with a suite of knowledge-enhanced operations to process them such as set algebras, set expansion, and set distance. Based on these operations as medium, we train the model by utilizing 1) concepts of both party personas, 2) concept relationship between them, and 3) their relationship to the future dialogue. Extensive experiments on a large public dataset, Persona-Chat, demonstrate that our model outperforms state-of-the-art baselines for generating less egocentric, more human-like, and higher quality responses in both automatic and human evaluations.
Proactive dialogue system is able to lead the conversation to a goal topic and has advantaged potential in bargain, persuasion, and negotiation. Current corpus-based learning manner limits its practical application in real-world scenarios. To this end, we contribute to advancing the study of the proactive dialogue policy to a more natural and challenging setting, i.e., interacting dynamically with users. Further, we call attention to the non-cooperative user behavior - the user talks about off-path topics when he/she is not satisfied with the previous topics introduced by the agent. We argue that the targets of reaching the goal topic quickly and maintaining a high user satisfaction are not always converged, because the topics close to the goal and the topics user preferred may not be the same. Towards this issue, we propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting. Specifically, we learn the trade-off via a learned goal weight, which consists of four factors (dialogue turn, goal completion difficulty, user satisfaction estimation, and cooperative degree). The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.
Conversational recommender systems (CRS) aim to provide highquality recommendations in conversations. However, most conventional CRS models mainly focus on the dialogue understanding of the current session, ignoring other rich multi-aspect information of the central subjects (i.e., users) in recommendation. In this work, we highlight that the user's historical dialogue sessions and look-alike users are essential sources of user preferences besides the current dialogue session in CRS. To systematically model the multi-aspect information, we propose a User-Centric Conversational Recommendation (UCCR) model, which returns to the essence of user preference learning in CRS tasks. Specifically, we propose a historical session learner to capture users' multi-view preferences from knowledge, semantic, and consuming views as supplements to the current preference signals. A multi-view preference mapper is conducted to learn the intrinsic correlations among different views in current and historical sessions via self-supervised objectives. We also design a temporal look-alike user selector to understand users via their similar users. The learned multi-aspect multi-view user preferences are then used for the recommendation and dialogue generation. In experiments, we conduct comprehensive evaluations on both Chinese and English CRS datasets. The significant improvements over competitive models in both recommendation and dialogue generation verify the superiority of UCCR.
Asking clarifying questions is an interactive way to effectively clarify user intent. When a user submits a query, the search engine will return a clarifying question with several clickable items of sub-intents for clarification. According to the existing definition, the key to asking high-quality questions is to generate good descriptions for submitted queries and provided items. However, existing methods mainly based on static knowledge bases are difficult to find descriptions for many queries because of the lack of entities within these queries and their corresponding items. For such a query, it is unable to generate an informative question. To alleviate this problem, we propose leveraging top search results of the query to help generate better descriptions because we deem that the top retrieved documents contain rich and relevant contexts of the query. Specifically, we first design a rule-based algorithm to extract description candidates from search results and rank them by various human-designed features. Then, we apply an learning-to-rank model and another generative model for generalization and further improve the quality of clarifying questions. Experimental results show that our proposed methods can generate more readable and informative questions compared with existing methods. The results prove that search results can be utilized to improve users' search experience for search clarification in conversational search systems.
ADPL: Adversarial Prompt-based Domain Adaptation for Dialogue Summarization with Knowledge Disentanglement
Traditional dialogue summarization models rely on a large-scale manually-labeled corpus, lacking generalization ability to new domains, and domain adaptation from a labeled source domain to an unlabeled target domain is important in practical summarization scenarios. However, existing domain adaptation works in dialogue summarization generally require large-scale pre-training using extensive external data. To explore the lightweight fine-tuning methods, in this paper, we propose an efficient Adversarial Disentangled Prompt Learning (ADPL) model for domain adaptation in dialogue summarization. We introduce three kinds of prompts including domain-invariant prompt (DIP), domain-specific prompt (DSP), and task-oriented prompt (TOP). DIP aims to disentangle and transfer the shared knowledge from the source domain and target domain in an adversarial way, which improves the accuracy of prediction about domain-invariant information and enhances the ability for generalization to new domains. DSP is designed to guide our model to focus on domain-specific knowledge using domain-related features. TOP is to capture task-oriented knowledge to generate high-quality summaries. Instead of fine-tuning the whole pre-trained language model (PLM), we only update the prompt networks but keep PLM fixed. Experimental results on the zero-shot setting show that the novel design of prompts can yield more coherent, faithful, and relevant summaries than baselines using the prefix-tuning, and perform at par with fine-tuning while being more efficient. Overall, our work introduces a prompt-based perspective to the zero-shot learning for dialogue summarization task and provides valuable findings and insights for future research.
Conversational recommender systems (CRS) enable traditional recommender systems to interact with users by asking questions about attributes and recommending items. The attribute-level and item-level feedback of users can be utilized to estimate users' preferences. However, existing works do not fully exploit the advantage of explicit item feedback --- they only use the item feedback in rather implicit ways such as updating the latent user and item representation. Since CRS has multiple chances to interact with users, leveraging the context in the conversation may help infer users' implicit feedback (e.g., some specific attributes) when recommendations get rejected. To address the limitations of existing methods, we propose a new CRS framework called Conversational Recommender with Implicit Feedback (CRIF). CRIF formulates the conversational recommendation scheme as a four-phase process consisting of offline representation learning, tracking, decision, and inference. In the inference module, by fully utilizing the relation between users' attribute-level and item-level feedback, our method can explicitly deduce users' implicit preferences. Therefore, CRIF is able to achieve more accurate user preference estimation. Besides, in the decision module, to better utilize the attribute-level and item-level feedback, we adopt inverse reinforcement learning to learn a flexible decision strategy that selects the suitable action at each conversation turn. Through extensive experiments on four benchmark CRS datasets, we validate the effectiveness of our approach, which significantly outperforms the state-of-the-art CRS methods.
SESSION: Topic 4: Cross Domain IR
Data sparsity is a long-standing problem in recommender systems. To alleviate it, Cross-Domain Recommendation (CDR) has attracted a surge of interests, which utilizes the rich user-item interaction information from the related source domain to improve the performance on the sparse target domain. Recent CDR approaches pay attention to aggregating the source domain information to generate better user representations for the target domain. However, they focus on designing more powerful interaction encoders to learn both domains simultaneously, but fail to model different user preferences of different domains. Particularly, domain-specific preferences of the source domain usually provide useless information to enhance the performance in the target domain, and directly aggregating the domain-shared and domain-specific information together maybe hurts target domain performance. This work considers a key challenge of CDR: How do we transfer shared information across domains? Grounded in the information theory, we propose DisenCDR, a novel model to disentangle the domain-shared and domain-specific information. To reach our goal, we propose two mutual-information-based disentanglement regularizers. Specifically, an exclusive regularizer aims to enforce the user domain-shared representations and domain-specific representations encoding exclusive information. An information regularizer is to encourage the user domain-shared representations encoding predictive information for both domains. Based on them, we further derive a tractable bound of our disentanglement objective to learn desirable disentangled representations. Extensive experiments show that DisenCDR achieves significant improvements over state-of-the-art baselines on four real-world datasets.
The goal of cross-domain retrieval (CDR) is to search for instances of the same category in one domain by using a query from another domain. Existing CDR approaches mainly consider the standard scenario that the cross-domain data for both training and testing come from the same categories and underlying distributions. However, these methods cannot be well extended to the newly emerging task of universal cross-domain retrieval (UCDR), where the testing data belong to the domain and categories not present during training. Compared to CDR, the UCDR task is more challenging due to (1) visually diverse data from multi-source domains, (2) the domain shift between seen and unseen domains, and (3) the semantic shift across seen and unseen categories. To tackle these problems, we propose a novel model termed Structure-Aware Semantic-Aligned Network (SASA) to align the heterogeneous representations of multi-source domains without loss of generalizability for the UCDR task. Specifically, we leverage the advanced Vision Transformer (ViT) as the backbone and devise a distillation-alignment ViT (DAViT) with a novel token-based strategy, which incorporates two complementary distillation and alignment tokens into the ViT architecture. In addition, the distillation token is devised to improve the generalizability of our model by structure information preservation and the alignment token is used to improve discriminativeness with trainable categorical prototypes. Extensive experiments on three large-scale benchmarks, i.e., Sketchy, TU-Berlin, and DomainNet, demonstrate the superiority of our SASA method over the state-of-the-art UCDR and ZS-SBIR methods.
Interactive recommender systems (IRS) have received wide attention in recent years. To capture users' dynamic preferences and maximize their long-term engagement, IRS are usually formulated as reinforcement learning (RL) problems. Despite the promise to solve complex decision-making problems, RL-based methods generally require a large amount of online interaction, restricting their applications due to economic considerations. One possible direction to alleviate this issue is cross-domain recommendation that aims to leverage abundant logged interaction data from a source domain (e.g., adventure genre in movie recommendation) to improve the recommendation quality in the target domain (e.g., crime genre). Nevertheless, prior studies mostly focus on adapting the static representations of users/items. Few have explored how the temporally dynamic user-item interaction patterns transform across domains.
Motivated by the above consideration, we propose DACIR, a novel Doubly-Adaptive deep RL-based framework for Cross-domain Interactive Recommendation. We first pinpoint how users behave differently in two domains and highlight the potential to leverage the shared user dynamics to boost IRS. To transfer static user preferences across domains, DACIR enforces consistency of item representation by aligning embeddings into a shared latent space. In addition, given the user dynamics in IRS, DACIR calibrates the dynamic interaction patterns in two domains via reward correlation. Once the double adaptation narrows the cross-domain gap, we are able to learn a transferable policy for the target recommender by leveraging logged data. Experiments on real-world datasets validate the superiority of our approach, which consistently achieves significant improvements over the baselines.
Cross-domain Named Entity Recognition (NER) aims to transfer knowledge from the source domain to the target, alleviating expensive labeling costs in the target domain. Most prior studies acquire domain-invariant features under the end-to-end sequence-labeling framework where each token is assigned a compositional label (e.g., B-LOC). However, the complexity of cross-domain transfer may be increased over this complicated labeling scheme, which leads to sub-optimal results, especially when there are significantly distinct entity categories across domains. In this paper, we aim to explore the task decomposition in cross-domain NER. Concretely, we suggest a modular learning approach in which two sub-tasks (entity span detection and type classification) are learned by separate functional modules to perform respective cross-domain transfer with corresponding strategies. Compared with the compositional labeling scheme, the label spaces are smaller and closer across domains especially in entity span detection, leading to easier transfer in each sub-task. And then we combine two sub-tasks to achieve the final result with modular interaction mechanism, and deploy the adversarial regularization for generalized and robust learning in low-resource target domains. Extensive experiments over 10 diverse domain pairs demonstrate that the proposed method is superior to state-of-the-art cross-domain NER methods in an end-to-end fashion (about average 6.4% absolute F1 score increase). Further analyses show the effectiveness of modular task decomposition and its great potential in cross-domain NER.
Exploiting Variational Domain-Invariant User Embedding for Partially Overlapped Cross Domain Recommendation
Cross-Domain Recommendation (CDR) has been popularly studied to utilize different domain knowledge to solve the cold-start problem in recommender systems. Most of the existing CDR models assume that both the source and target domains share the same overlapped user set for knowledge transfer. However, only few proportion of users simultaneously activate on both the source and target domains in practical CDR tasks. In this paper, we focus on the Partially Overlapped Cross-Domain Recommendation (POCDR) problem, that is, how to leverage the information of both the overlapped and non-overlapped users to improve recommendation performance. Existing approaches cannot fully utilize the useful knowledge behind the non-overlapped users across domains, which limits the model performance when the majority of users turn out to be non-overlapped. To address this issue, we propose an end-to-end Dual-autoencoder with Variational Domain-invariant Embedding Alignment (VDEA) model, a cross-domain recommendation framework for the POCDR problem, which utilizes dual variational autoencoders with both local and global embedding alignment for exploiting domain-invariant user embedding. VDEA first adopts variational inference to capture collaborative user preferences, and then utilizes Gromov-Wasserstein distribution co-clustering optimal transport to cluster the users with similar rating interaction behaviors. Our empirical studies on Douban and Amazon datasets demonstrate that VDEA significantly outperforms the state-of-the-art models, especially under the POCDR setting.
SESSION: Topic 5: CTR and Conversion Rate Prediction
Click-through rate (CTR) prediction plays an important role in online advertising and recommendation systems, which aims at estimating the probability of a user clicking on a specific item. Feature interaction modeling and user interest modeling methods are two popular domains in CTR prediction, and they have been studied extensively in recent years. However, these methods still suffer from two limitations. First, traditional methods regard item attributes as ID features, while neglecting structure information and relation dependencies among attributes. Second, when mining user interests from user-item interactions, current models ignore user intents and item intents for different attributes, which lacks interpretability. Based on this observation, in this paper, we propose a novel approach Hierarchical Intention Embedding Network (HIEN), which considers dependencies of attributes based on bottom-up tree aggregation in the constructed attribute graph. HIEN also captures user intents for different item attributes as well as item intents based on our proposed hierarchical attention mechanism. Extensive experiments on both public and production datasets show that the proposed model significantly outperforms the state-of-the-art methods. In addition, HIEN can be applied as an input module to state-of-the-art CTR prediction methods, bringing further performance lift for these existing models that might already be intensively used in real systems.
Click-Through Rate (CTR) prediction has been widely used in many machine learning tasks such as online advertising and personalization recommendation. Unfortunately, given a domain-specific dataset, searching effective feature interaction operations and combinations from a huge candidate space requires significant expert experience and computational costs. Recently, Neural Architecture Search (NAS) has achieved great success in discovering high-quality network architectures automatically. However, due to the diversity of feature interaction operations and combinations, the existing NAS-based work that treats the architecture search as a black-box optimization problem over a discrete search space suffers from low efficiency. Therefore, it is essential to explore a more efficient architecture search method. To achieve this goal, we propose NAS-CTR, a differentiable neural architecture search approach for CTR prediction. First, we design a novel and expressive architecture search space and a continuous relaxation scheme to make the search space differentiable. Second, we formulate the architecture search for CTR prediction as a joint optimization problem with discrete constraints on architectures and leverage proximal iteration to solve the constrained optimization problem. Additionally, a straightforward yet effective method is proposed to eliminate the aggregation of skip connections. Extensive experimental results reveal that NAS-CTR can outperform the SOTA human-crafted architectures and other NAS-based methods in both test accuracy and search efficiency.
CTR prediction has been widely used in the real world. Many methods model feature interaction to improve their performance. However, most methods only learn a fixed representation for each feature without considering the varying importance of each feature under different contexts, resulting in inferior performance. Recently, several methods tried to learn vector-level weights for feature representations to address the fixed representation issue. However, they only produce linear transformations to refine the fixed feature representations, which are still not flexible enough to capture the varying importance of each feature under different contexts. In this paper, we propose a novel module named Feature Refinement Network (FRNet), which learns context-aware feature representations at bit-level for each feature in different contexts. FRNet consists of two key components: 1) Information Extraction Unit (IEU), which captures contextual information and cross-feature relationships to guide context-aware feature refinement; and 2) Complementary Selection Gate (CSGate), which adaptively integrates the original and complementary feature representations learned in IEU with bit-level weights. Notably, FRNet is orthogonal to existing CTR methods and thus can be applied in many existing methods to boost their performance. Comprehensive experiments are conducted to verify the effectiveness, efficiency, and compatibility of FRNet.
Click-Through Rate (CTR) prediction, which aims to estimate the probability that a user will click an item, is an essential component of online advertising. Existing methods mainly attempt to mine user interests from users' historical behaviours, which contain users' directly interacted items. Although these methods have made great progress, they are often limited by the recommender system's direct exposure and inactive interactions, and thus fail to mine all potential user interests. To tackle these problems, we propose Neighbor-Interaction based CTR prediction (NI-CTR), which considers this task under a Heterogeneous Information Network (HIN) setting. In short, Neighbor-Interaction based CTR prediction involves the local neighborhood of the target user-item pair in the HIN to predict their linkage. In order to guide the representation learning of the local neighbourhood, we further consider different kinds of interactions among the local neighborhood nodes from both explicit and implicit perspective, and propose a novel Graph-Masked Transformer (GMT) to effectively incorporates these kinds of interactions to produce highly representative embeddings for the target user-item pair. Moreover, in order to improve model robustness against neighbour sampling, we enforce a consistency regularization loss over the neighbourhood embedding. We conduct extensive experiments on two real-world datasets with millions of instances and the experimental results show that our proposed method outperforms state-of-the-art CTR models significantly. Meanwhile, the comprehensive ablation studies verify the effectiveness of every component of our model. Furthermore, we have deployed this framework on the WeChat Official Account Platform with billions of users. The online A/B tests demonstrate an average CTR improvement of 21.9% against all online baselines.
Accurate estimation of post-click conversion rate is critical for building recommender systems, which has long been confronted with sample selection bias and data sparsity issues. Methods in the Entire Space Multi-task Model (ESMM) family leverage the sequential pattern of user actions, \ie $impression\rightarrow click \rightarrow conversion$ to address data sparsity issue. However, they still fail to ensure the unbiasedness of CVR estimates. In this paper, we theoretically demonstrate that ESMM suffers from the following two problems: (1) Inherent Estimation Bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) Potential Independence Priority (PIP) for CTCVR estimation, where ESMM might overlook the causality from click to conversion. To this end, we devise a principled approach named Entire Space Counterfactual Multi-task Modelling (ESCM$^2$), which employs a counterfactual risk miminizer as a regularizer in ESMM to address both IEB and PIP issues simultaneously. Extensive experiments on offline datasets and online environments demonstrate that our proposed ESCM$^2$ can largely mitigate the inherent IEB and PIP issues and achieve better performance than baseline models.
SESSION: Topic 6: Domain-Specific IR
The related work section is an important component of a scientific paper, which highlights the contribution of the target paper in the context of the reference papers. Authors can save their time and effort by using the automatically generated related work section as a draft to complete the final related work. Most of the existing related work section generation methods rely on extracting off-the-shelf sentences to make a comparative discussion about the target work and the reference papers. However, such sentences need to be written in advance and are hard to obtain in practice. Hence, in this paper, we propose an abstractive target-aware related work generator (TAG), which can generate related work sections consisting of new sentences. Concretely, we first propose a target-aware graph encoder, which models the relationships between reference papers and the target paper with target-centered attention mechanisms. In the decoding process, we propose a hierarchical decoder that attends to the nodes of different levels in the graph with keyphrases as semantic indicators. Finally, to generate a more informative related work, we propose multi-level contrastive optimization objectives, which aim to maximize the mutual information between the generated related work with the references and minimize that with non-references. Extensive experiments on two public scholar datasets show that the proposed model brings substantial improvements over several strong baselines in terms of automatic and tailored human evaluations.
Information seeking in an academic digital library is complex in nature, often spanning multiple search sessions. Resuming academic search tasks requires significant cognitive effort as searchers must re-acquaint themselves with previous search session activities and previously discovered documents before resuming their search. Further, some academic searchers may find it convenient to initiate such searches on their mobile devices during short gaps in time (e.g., between classes), and resume them later in a desktop environment when they can use the extra screen space and more convenient document storage capabilities of their computers. To support such searching, we have developed an academic digital library search interface that assists searchers in managing cross-session search tasks even when moving between mobile and desktop environments. Using a controlled laboratory study we compared our approach (Dilex) to a standard academic digital library search interface. We found increased user engagement in both the initial (mobile) and resumed (desktop) search activities, and that participants spent more time on the search results pages and had an increased degree of interaction with information and personalization features during the resumed tasks. These results provide evidence that the participants were able to make effective use of the visualization features in Dilex, which enabled them to readily resume their search tasks and stay engaged in the search activities. This work represents an example of how semi-automatic search task/session management and visualization features can support cross-session search, and how designing for both mobile and desktop use can support cross-device search.
With the ever-increasing popularity of microservice architecture, a considerable number of enterprises or organizations have encapsulated their complex business services into various lightweight functions as published them accessible APIs (Application Programming Interfaces). Through keyword search, a software developer could select a set of APIs from a massive number of candidates to implement the functions of a complex mashup, which reduces the development cost significantly. However, traditional keyword search methods for APIs often suffer from several critical issues such as functional compatibility and limited diversity in search results, which may lead to mashup creation failures and lower development productivity. To deal with these challenges, this paper designs DAWAR, a diversity-aware Web APIs recommendation approach that finds diversified and compatible APIs for mashup creation. Specifically, the APIs recommendation problem for mashup creating is modelled as a graph search problem that aims to find the minimal group Steiner trees in a correlation graph of APIs. DAWAR innovatively employs the determinantal point processes to diversify the recommended results. Empirical evaluation is performed on commonly-used real-world datasets, and the statistic results show that DAWAR is able to achieve significant improvements in terms of recommendation diversity, accuracy, and compatibility.
Knowledge tracing (KT) which aims at predicting learner's knowledge mastery plays an important role in the computer-aided educational system. The goal of KT is to provide personalized learning paths for learners by diagnosing the mastery of each knowledge, thus improving the learning efficiency. In recent years, many deep learning models have been applied to tackle the KT task, which has shown promising results. However, most existing methods simplify the exercising records as knowledge sequences, which fail to explore the rich information that existed in exercises. Besides, the existing diagnosis results of knowledge tracing are not convincing enough since they neglect hierarchical relations between exercises. To solve the above problems, we propose a hierarchical graph knowledge tracing model called HGKT to explore the latent complex relations between exercises. Specifically, we introduce the concept of problem schema to construct a hierarchical exercise graph that could model the exercise learning dependencies. Moreover, we employ two attention mechanisms to highlight important historical states of learners. In the testing stage, we present a knowledge&schema diagnosis matrix that could trace the transition of mastery of knowledge and problem schema, which can be more easily applied to different applications. Extensive experiments show the effectiveness and interpretability of our proposed model.
Computerized Adaptive Testing (CAT) is a promising testing mode in personalized online education (e.g., GRE), which aims at measuring student's proficiency accurately and reducing test length. The "adaptive" is reflected in its selection algorithm that can retrieve best-suited questions for student based on his/her estimated proficiency at each test step. Although there are many sophisticated selection algorithms for improving CAT's effectiveness, they are restricted and perturbed by the accuracy of current proficiency estimate, thus lacking robustness. To this end, we investigate a general method to enhance the robustness of existing algorithms by leveraging student's "multi-facet" nature during tests. Specifically, we present a generic optimization criterion Robust Adaptive Testing (RAT) for proficiency estimation via fusing multiple estimates at each step, which maintains a multi-facet description of student's potential proficiency. We further provide theoretical analyses of such estimator's desirable statistical properties: asymptotic unbiasedness, efficiency, and consistency. Extensive experiments on perturbed synthetic data and three real-world datasets show that selection algorithms in our RAT framework are robust and yield substantial improvements.
Knowledge Tracing (KT), which aims to assess students' dynamic knowledge states when practicing on various questions, is a fundamental research task for offering intelligent services in online learning systems. Researchers have devoted significant efforts to developing KT models with impressive performance. However, in existing KT methods, the related question difficulty level, which directly affects students' knowledge state in learning, has not been effectively explored and employed. In this paper, we focus on exploring the question difficulty effect on learning to improve student's knowledge state assessment and propose the DIfficulty Matching Knowledge Tracing (DIMKT) model. Specifically, we first explicitly incorporate the difficulty level into the question representation. Then, to establish the relation between students' knowledge state and the question difficulty level during the practice process, we accordingly design an adaptive sequential neural network in three stages: (1) measuring students' subjective feelings of the question difficulty before practice; (2) estimating students' personalized knowledge acquisition while answering questions of different difficulty levels; (3) updating students' knowledge state in varying degrees to match the question difficulty level after practice. Finally, we conduct extensive experiments on real-world datasets, and the results demonstrate that DIMKT outperforms state-of-the-art KT models. Moreover, DIMKT shows superior interpretability by exploring the question difficulty effect when making predictions. Our codes are available at https://github.com/shshen-closer/DIMKT.
The truncation of ranking lists predicted by retrieval models is vital to ensure users' search experience. Particularly, in specific vertical domains where documents are usually complicated and extensive (e.g., legal cases), the cost of browsing results is much higher than traditional IR tasks (e.g., Web search) and setting a reasonable cut-off position is quite necessary. While it is straightforward to apply existing result list truncation approaches to legal case retrieval, the effectiveness of these methods is limited because they only focus on simple document statistics and usually fail to capture the context information of documents in the ranking list. These existing efforts also treat result list truncation as an isolated task instead of a component in the entire ranking process, limiting the usage of truncation in practical systems. To tackle these limitations, we propose LeCut, a ranking list truncation model for legal case retrieval. LeCut utilizes contextual features of the retrieval task to capture the semantic-level similarity between documents and decides the best cut-off position with attention mechanisms. We further propose a Joint Optimization of Truncation and Reranking (JOTR) framework based on LeCut to improve the performance of truncation and retrieval tasks simultaneously. Comparison against competitive baselines on public benchmark datasets demonstrates the effectiveness of LeCut and JOTR. A case study is conducted to visualize the cut-off positions of LeCut and the process of how JOTR improves both retrieval and truncation tasks.
MetaCare++: Meta-Learning with Hierarchical Subtyping for Cold-Start Diagnosis Prediction in Healthcare Data
Cold-start diagnosis prediction is a challenging task for AI in healthcare, where often only a few visits per patient and a few observations per disease can be exploited. Although meta-learning is widely adopted to address the data sparsity problem in general domains, directly applying it to healthcare data is less effective, since it is unclear how to capture both the temporal relations in clinical visits and the complicated relations among syndromic diseases for precise personalized diagnosis. To this end, we first propose a novel Meta-learning framework for cold-start diagnosis prediction in healthCare data (MetaCare). By explicitly encoding the effects of disease progress over time as a generalization prior, MetaCare dynamically predicts future diagnosis and timestamp for infrequent patients. Then, to model complicated relations among rare diseases, we propose to utilize domain knowledge of hierarchical relations among diseases, and further perform diagnosis subtyping to mine the latent syndromic relations among diseases. Finally, to tailor the generic meta-learning framework with personalized parameters, we design a hierarchical patient subtyping mechanism and bridge the modeling of both infrequent patients and rare diseases. We term the joint model as MetaCare++. Extensive experiments on two real-world benchmark datasets show significant performance gains brought by MetaCare++, yielding average improvements of 7.71% for diagnosis prediction and 13.94% for diagnosis time prediction over the state-of-the-art baselines.
Biomedical natural language processing often involves the interpretation of patient descriptions, for instance for diagnosis or for recommending treatments. Current methods, based on biomedical language models, have been found to struggle with such tasks. Moreover, retrieval augmented strategies have only had limited success, as it is rare to find sentences which express the exact type of knowledge that is needed for interpreting a given patient description. For this reason, rather than attempting to retrieve explicit medical knowledge, we instead propose to rely on a nearest neighbour strategy. First, we retrieve text passages that are similar to the given patient description, and are thus likely to describe patients in similar situations, while also mentioning some hypothesis (e.g.\ a possible diagnosis of the patient). We then judge the likelihood of the hypothesis based on the similarity of the retrieved passages. Identifying similar cases is challenging, however, as descriptions of similar patients may superficially look rather different, among others because they often contain an abundance of irrelevant details. To address this challenge, we propose a strategy that relies on a distantly supervised cross-encoder. Despite its conceptual simplicity, we find this strategy to be effective in practice.
Attributed networks, as a manifestation of data in non-Euclidean domains, have a wide range of applications in the real world, such as molecular property prediction, social network analysis and anomaly detection. Node classification, as a fundamental research problem in attributed networks, has attracted increasing attention among research communities. However, most existing models cannot be directly applied to the data with limited labeled instances (\textiti.e., the few-shot scenario). Few-shot node classification on attributed networks is gradually becoming a research hotspot. Although several methods aim to integrate meta-learning with graph neural networks to address this problem, some limitations remain. First, they all assume node representation learning using graph neural networks in homophilic graphs. %Hence, suboptimal performance is obtained when these models are applied to heterophilic graphs. Second, existing models based on meta-learning entirely depend on instance-based statistics. %which in few-shot settings are unavoidably degraded by data noise or outliers. Third, most previous models treat all sampled tasks equally and fail to adapt their uniqueness. %which has a significant impact on the overall performance of the model. To solve the above three limitations, we propose a novel graph Meta -learning framework called G raph learning based on P rototype and S caling & shifting transformation (Meta-GPS ). More specifically, we introduce an efficient method for learning expressive node representations even on heterophilic graphs and propose utilizing a prototype-based approach to initialize parameters in meta-learning. Moreover, we also leverage S$^2$ (scaling & shifting) transformation to learn effective transferable knowledge from diverse tasks. Extensive experimental results on six real-world datasets demonstrate the superiority of our proposed framework, which outperforms other state-of-the-art baselines by up to 13% absolute improvement in terms of related metrics.
Fashion Compatibility Modeling (FCM) is a new yet challenging task, which aims to automatically access the matching degree among a set of complementary items. Most of existing methods evaluate the fashion compatibility from the common perspective, but overlook the user's personal preference. Inspired by this, a few pioneers study the Personalized Fashion Compatibility Modeling (PFCM). Despite their significance, these PFCM methods mainly concentrate on the user and item entities, as well as their interactions, but ignore the attribute entities, which contain rich semantics. To address this problem, we propose to fully explore the related entities and their relations involved in PFCM to boost the PFCM performance. This is, however, non-trivial due to the heterogeneous contents of different entities, embeddings for new users, and various high-order relations. Towards these ends, we present a novel metapath-guided personalized fashion compatibility modeling, dubbed as MG-PFCM. In particular, we creatively build a heterogeneous graph to unify the three types of entities (i.e., users, items, and attributes) and their relations (i.e., user-item interactions, item-item matching relations, and item-attribute association relations). Thereafter, we design a multi-modal content-oriented user embedding module to learn user representations by inheriting the contents of their interacted items. Meanwhile, we define the user-oriented and item-oriented metapaths, and perform the metapath-guided heterogeneous graph learning to enhance the user and item embeddings. In addition, we introduce the contrastive regularization to improve the model performance. We conduct extensive experiments on the real-world benchmark dataset, which verifies the superiority of our proposed scheme over several cutting-edge baselines. As a byproduct, we have released our source codes to benefit other researchers.
Health thread recommendation methods aim to suggest the most relevant existing threads for a user. Most of the existing methods tend to rely on modeling the post contents to retrieve relevant answers. However, some posts written by users with different clinical conditions can be lexically similar, as unrelated diseases (e.g., Angina and Osteoporosis) may have the same symptoms (e.g., back pain), yet irrelevant threads to a user. Therefore, it is critical to not only consider the connections between users and threads, but also the descriptions of users' symptoms and clinical conditions. In this paper, towards this problem of thread recommendation in online healthcare forums, we propose a knowledge graph enhanced Threads Recommendation (KETCH) model, which leverages graph neural networks to model the interactions among users and threads, and learn their representations. In our model, the users, threads and posts are three types of nodes in a graph, linked through their associations. KETCH uses the message passing strategy by aggregating information along with the network. In addition, we introduce a knowledge-enhanced attention mechanism to capture the latent conditions and symptoms. We also apply the method to the task of predicting the side effects of drugs, to show that KETCH has the potential to complement the medical knowledge graph. Comparing with the best results of seven competing methods, in terms of MRR, KETCH outperforms all methods by at least 0.125 on the MedHelp dataset, 0.048 on the Patient dataset and 0.092 on HealthBoards dataset, respectively. We release the source code of KETCH at: https://github.com/cuilimeng/KETCH.
Online healthcare services can provide unlimited and in-time medical information to users, which promotes social goods and breaks the barriers of locations. However, understanding the user intents behind the medical related queries is a challenging problem. Medical search queries are usually short and noisy, lack strict syntactic structure, and also require professional background to understand the medical terms. The medical intents are fine-grained, making them hard to recognize. In addition, many intents only have a few labeled data. To handle these problems, we propose a few-shot learning method for medical search query intent recognition called MEDIC. We extract co-click queries from user search logs as weak supervision to compensate for the lack of labeled data. We also design a new query encoder which learns to represent queries as a combination of semantic knowledge recorded in an external medical knowledge graph, syntactic knowledge which marks the grammatical role of each word in the query, and generic knowledge which is captured by language models pretrained from large-scale text corpus. Experimental results on a real medical search query intent recognition dataset validate the effectiveness of MEDIC.
SESSION: Topic 7: Efficiency
As a crucial component of most modern deep recommender systems, feature embedding maps high-dimensional sparse user/item features into low-dimensional dense embeddings. However, these embeddings are usually assigned a unified dimension, which suffers from the following issues: (1) high memory usage and computation cost. (2) sub-optimal performance due to inferior dimension assignments. In order to alleviate the above issues, some works focus on automated embedding dimension search by formulating it as hyper-parameter optimization or embedding pruning problems. However, they either require well-designed search space for hyperparameters or need time-consuming optimization procedures. In this paper, we propose a Single-Shot Embedding Dimension Search method, called SSEDS, which can efficiently assign dimensions for each feature field via a single-shot embedding pruning operation while maintaining the recommendation accuracy of the model. Specifically, it introduces a criterion for identifying the importance of each embedding dimension for each feature field. As a result, SSEDS could automatically obtain mixed-dimensional embeddings by explicitly reducing redundant embedding dimensions based on the corresponding dimension importance ranking and the predefined parameter budget. Furthermore, the proposed SSEDS is model-agnostic, meaning that it could be integrated into different base recommendation models. The extensive offline experiments are conducted on two widely used public datasets for CTR (Click Through Rate) prediction task, and the results demonstrate that SSEDS can still achieve strong recommendation performance even if it has reduced 90% parameters. Moreover, SSEDS has also been deployed on the WeChat Subscription platform for practical recommendation services. The 7-day online A/B test results show that SSEDS can significantly improve the performance of the online recommendation model while reducing resource consumption.
With the development of deep learning techniques, deep recommendation models also achieve remarkable improvements in terms of recommendation accuracy. However, due to the large number of candidate items in practice and the high cost of preference computation, these methods also suffer from low efficiency of recommendation. The recently proposed tree-based deep recommendation models alleviate the problem by directly learning tree structure and representations under the guidance of recommendation objectives. However, such models have two shortcomings. First, the max-heap assumption in the hierarchical tree, in which the preference for a parent node should be the maximum between the preferences for its children, is difficult to satisfy in their binary classification objectives. Second, the learned index only includes a single tree, which is different from the widely-used multiple trees index, providing an opportunity to improve the accuracy of recommendation.
To this end, we propose a Deep Forest-based Recommender (DeFoRec for short) for an efficient recommendation. In DeFoRec, all the trees generated during training process are retained to form the forest. When learning node representation of each tree, we have to satisfy the max-heap assumption as much as possible and mimic beam search behavior over the tree in the training stage. This is achieved by DeFoRec to regard the training task as multi-classification over tree nodes at the same level. However, the number of tree nodes grows exponentially with levels, making us to train the preference model by the guidance of sampled-softmax technique. The experiments are conducted on real-world datasets, validating the effectiveness of the proposed preference model learning method and tree learning method.
Deep neural networks (DNNs) demonstrates significant advantages in improving ranking performance in retrieval tasks. Driven by the recent developments in optimization and generalization of DNNs, learning a neural ranking model online from its interactions with users becomes possible. However, the required exploration for model learning has to be performed in the entire neural network parameter space, which is prohibitively expensive and limits the application of such online solutions in practice.
In this work, we propose an efficient exploration strategy for online interactive neural ranker learning based on bootstrapping. Our solution is based on an ensemble of ranking models trained with perturbed user click feedback. The proposed method eliminates explicit confidence set construction and the associated computational overhead, which enables the online neural rankers training to be efficiently executed in practice with theoretical guarantees. Extensive comparisons with an array of state-of-the-art OL2R algorithms on two public learning to rank benchmark datasets demonstrate the effectiveness and computational efficiency of our proposed neural OL2R solution.
Session-based recommender systems (SBR) are becoming increasingly popular because they can predict user interests without relying on long-term user profile and support login-free recommendation. Modern recommender systems operate in a fully server-based fashion. To cater to millions of users, the frequent model maintaining and the high-speed processing for concurrent user requests are required, which comes at the cost of a huge carbon footprint. Meanwhile, users need to upload their behavior data even including the immediate environmental context to the server, raising the public concern about privacy. On-device recommender systems circumvent these two issues with cost-conscious settings and local inference. However, due to the limited memory and computing resources, on-device recommender systems are confronted with two fundamental challenges: (1) how to reduce the size of regular models to fit edge devices? (2) how to retain the original capacity?
Previous research mostly adopts tensor decomposition techniques to compress regular recommendation models with low compression rates so as to avoid drastic performance degradation. In this paper, we explore ultra-compact models for next-item recommendation, by loosing the constraint of dimensionality consistency in tensor decomposition. To compensate for the capacity loss caused by compression, we develop a self-supervised knowledge distillation framework which enables the compressed model (student) to distill the essential information lying in the raw data, and improves the long-tail item recommendation through an embedding-recombination strategy with the original model (teacher). The extensive experiments on two benchmarks demonstrate that, with 30x size reduction, the compressed model almost comes with no accuracy loss, and even outperforms its uncompressed counterpart. The code is released at https://github.com/xiaxin1998/OD-Rec.
SESSION: Topic 8: Evaluation and User Studies
Many IR collections contain forbidden documents (F-docs), i.e. documents that should not be retrieved to the searcher. In an ideal scenario F-docs are clearly flagged, hence the ranker can filter them out, guaranteeing that no F-doc will be exposed. However, in real-world scenarios, filtering algorithms are prone to errors. Therefore, an IR evaluation system should also measure filtering quality in addition to ranking quality. Typically, filtering is considered as a classification task and is evaluated independently of the ranking quality. However, due to the mutual affinity between the two, it is desirable to evaluate ranking quality while filtering decisions are being made. In this work we propose nDCGf, a novel extension of the nDCGmin metric, which measures both ranking and filtering quality of the search results. We show both theoretically and empirically that while nDCGmin is not suitable for the simultaneous ranking and filtering task, nDCGf is a reliable metric in this case.
We experiment with three datasets for which ranking and filtering are both required. In the PR dataset our task is to rank product reviews while filtering those marked as spam. Similarly, in the CQA dataset our task is to rank a list of human answers per question while filtering bad answers. We also experiment with the TREC web-track datasets, where F-docs are explicitly labeled, sorting participant runs according to their ranking and filtering quality, demonstrating the stability, sensitivity, and reliability of nDCGf for this task. We propose a learning to rank and filter (LTRF) framework that is specifically designed to optimize nDCGf, by learning a ranking model and optimizing a filtering threshold used for discarding documents with lower scores. We experiment with several loss functions demonstrating their success in learning an effective LTRF model for the simultaneous learning and filtering task.
The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. We review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties, since two items may appear nearly equal to assessors, and it must minimize the number of judgments required for any specific pair, since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, we simulate selected algorithms on representative test cases to provide insight into their practical utility. Based on these simulations, one algorithm stands out for its potential. Our simulations suggest modifications to further improve its performance. Using the modified algorithm, we collect over 10,000 preference judgments for pools derived from submissions to the TREC 2021 Deep Learning Track, confirming its suitability. We test the idea of best-item evaluation and suggest ideas for further theoretical and practical progress.
The use of offline effectiveness metrics is one of the cornerstones of evaluation in information retrieval. Static resources that include test collections and sets of topics, the corresponding relevance judgments connecting them, and metrics that map document rankings from a retrieval system to numeric scores have been used for multiple decades as an important way of comparing systems. The basis behind this experimental structure is that the metric score for a system can serve as a surrogate measurement for user satisfaction.
Here we introduce a user behavior framework that extends the C/W/L family. The essence of the new framework - which we call C/W/L/A - is that the user actions that are undertaken while reading the ranking can be considered separately from the benefit that each user will have derived as they exit the ranking. This split structure allows the great majority of current effectiveness metrics to be systematically categorized, and thus their relative properties and relationships to be better understood; and at the same time permits a wide range of novel combinations to be considered.
We then carry out experiments using relevance judgments, document rankings, and user satisfaction data from two distinct sources, comparing the patterns of metric scores generated, and showing that those metrics vary quite markedly in terms of their ability to predict user satisfaction.
Most of information retrieval effectiveness evaluation metrics assume that systems appending irrelevant documents at the bottom of the ranking are as effective as (or not worse than) systems that have a stopping criteria to 'truncate' the ranking at the right position to avoid retrieving those irrelevant documents at the end. It can be argued, however, that such truncated rankings are more useful to the end user. It is thus important to understand how to measure retrieval effectiveness in this scenario. In this paper we provide both theoretical and experimental contributions. We first define formal properties to analyze how effectiveness metrics behave when evaluating truncated rankings. Our theoretical analysis shows that de-facto standard metrics do not satisfy desirable properties to evaluate truncated rankings: only Observational Information Effectiveness (OIE) -- a metric based on Shannon's information theory -- satisfies them all. We then perform experiments to compare several metrics on nine TREC datasets. According to our experimental results, the most appropriate metrics for truncated rankings are OIE and a novel extension of Rank-Biased Precision that adds a user effort factor penalizing the retrieval of irrelevant documents.
Offline evaluation of information retrieval and recommendation has traditionally focused on distilling the quality of a ranking into a scalar metric such as average precision or normalized discounted cumulative gain. We can use this metric to compare the performance of multiple systems for the same request. Although evaluation metrics provide a convenient summary of system performance, they also collapse subtle differences across users into a single number and can carry assumptions about user behavior and utility not supported across retrieval scenarios. We propose recall-paired preference (RPP), a metric-free evaluation method based on directly computing a preference between ranked lists. RPP simulates multiple user subpopulations per query and compares systems across these pseudo-populations. Our results across multiple search and recommendation tasks demonstrate that RPP substantially improves discriminative power while correlating well with existing metrics and being equally robust to incomplete data.
A fundamental goal of Information Retrieval (IR) is to satisfy searchers' information need (IN). Advances in neuroimaging technologies have allowed for interdisciplinary research to investigate the brain activity associated with the realisation of IN. While these studies have been informative, they were not able to capture the cognitive processes underlying the realisation of IN and the interplay between them with a high temporal resolution. This paper aims to investigate this research question by inferring the variability of brain activity based on the contrast of a state of IN with the two other (no-IN) scenarios. To do so, we employed Electroencephalography (EEG) and constructed an Event-Related Potential (ERP) analysis of the brain signals captured while the participants were experiencing the realisation of IN. In particular, the brain signals of 24 healthy participants were captured while performing a Question-Answering (Q/A) Task. Our results show a link between the early stages of processing, corresponding to awareness and the late activity, meaning memory control mechanisms. Our findings also show that participants exhibited early N1-P2 complex indexing awareness processes and indicate, thus, that the realisation of IN is manifested in the brain before it reaches the user's consciousness. This research contributes novel insights into a better understanding of IN and informs the design of IR systems to better satisfy it.
Search engines and recommendation systems attempt to continually improve the quality of the experience they afford to their users. Refining the ranker that produces the lists displayed in response to user requests is an important component of this process. A common practice is for the service providers to make changes (e.g. new ranking features, different ranking models) and A/B test them on a fraction of their users to establish the value of the change. An alternative approach estimates the effectiveness of the proposed changes offline, utilising previously collected clickthrough data on the old ranker to posit what the user behaviour on ranked lists produced by the new ranker would have been. A majority of offline evaluation approaches invoke the well studied inverse propensity weighting to adjust for biases inherent in logged data. In this paper, we propose the use of parametric estimates for these propensities. Specifically, by leveraging well known learning-to-rank methods as subroutines, we show how accurate offline evaluation can be achieved when the new rankings to be evaluated differ from the logged ones.
Web search heavily relies on click-through behavior as an essential feedback signal for performance evaluation and improvement. Traditionally, click is usually treated as a positive implicit feedback signal of relevance or usefulness, while non-click is regarded as a signal of irrelevance or uselessness. However, there are many cases where users satisfy their information need with the contents shown on the Search Engine Result Page (SERP). This raises the problem of measuring the usefulness of non-click results and modeling user satisfaction in such circumstances.
For a long period, understanding non-click results is challenging owing to the lack of user interactions. In recent years, the rapid development of neuroimaging technologies constitutes a paradigm shift in various industries, e.g., search, entertainment, and education. Therefore, we benefit from these technologies and apply them to bridge the gap between the human mind and the external search system in non-click situations. To this end, we analyze the differences in brain signals between the examination of non-click search results in different usefulness levels. Inspired by these findings, we conduct supervised learning tasks to estimate the usefulness of non-click results with brain signals and conventional information (i.e., content and context factors). Furthermore, we devise two re-ranking methods, i.e., a Personalized Method (PM) and a Generalized Intent modeling Method (GIM), for search result re-ranking with the estimated usefulness. Results show that it is feasible to utilize brain signals to improve usefulness estimation performance and enhance human-computer interactions by search result re-ranking.
SESSION: Topic 9: Explainable Search and Recommendation
Post Processing Recommender Systems with Knowledge Graphs for Recency, Popularity, and Diversity of Explanations
Existing explainable recommender systems have mainly modeled relationships between recommended and already experienced products, and shaped explanation types accordingly (e.g., movie "x" starred by actress "y" recommended to a user because that user watched other movies with "y" as an actress). However, none of these systems has investigated the extent to which properties of a single explanation (e.g., the recency of interaction with that actress) and of a group of explanations for a recommended list (e.g., the diversity of the explanation types) can influence the perceived explaination quality. In this paper, we conceptualized three novel properties that model the quality of the explanations (linking interaction recency, shared entity popularity, and explanation type diversity) and proposed re-ranking approaches able to optimize for these properties. Experiments on two public data sets showed that our approaches can increase explanation quality according to the proposed properties, fairly across demographic groups, while preserving recommendation utility. The source code and data are available at https://github.com/giacoballoccu/explanation-quality-recsys.
As an essential operation of legal retrieval, legal case matching plays a central role in intelligent legal systems. This task has a high demand on the explainability of matching results because of its critical impacts on downstream applications --- the matched legal cases may provide supportive evidence for the judgments of target cases and thus influence the fairness and justice of legal decisions. Focusing on this challenging task, we propose a novel and explainable method, namely IOT-Match, with the help of computational optimal transport, which formulates the legal case matching problem as an inverse optimal transport (IOT) problem. Different from most existing methods, which merely focus on the sentence-level semantic similarity between legal cases, our IOT-Match learns to extract rationales from paired legal cases based on both semantics and legal characteristics of their sentences. The extracted rationales are further applied to generate faithful explanations and conduct matching. Moreover, the proposed IOT-Match is robust to the alignment label insufficiency issue commonly in practical legal case matching tasks, which is suitable for both supervised and semi-supervised learning paradigms. To demonstrate the superiority of our IOT-Match method and construct a benchmark of explainable legal case matching task, we not only extend the well-known Challenge of AI in Law (CAIL) dataset but also build a new Explainable Legal cAse Matching (ELAM) dataset, which contains lots of legal cases with detailed and explainable annotations. Experiments on these two datasets show that our IOT-Match outperforms state-of-the-art methods consistently on matching prediction, rationale extraction, and explanation generation.
It has been shown that the interpretability of search results is enhanced when query aspects covered by documents are explicitly provided. However, existing work on aspect-oriented explanation of search results explains each document independently. These explanations thus cannot describe the differences between documents. This issue is also true for existing models on query aspect generation. Furthermore, these models provide a single query aspect for each document, even though documents often cover multiple query aspects. To overcome these limitations, we propose LiEGe, an approach that jointly explains all documents in a search result list. LiEGe provides semantic representations at two levels of granularity -- documents and their tokens -- using different interaction signals including cross-document interactions. These allow listwise modeling of a search result list as well as the generation of coherent explanations for documents. To appropriately explain documents that cover multiple query aspects, we introduce two settings for search result explanation: comprehensive and novelty explanation generation. LiEGe is trained and evaluated for both settings. We evaluate LiEGe on datasets built from Wikipedia and real query logs of the Bing search engine. Our experimental results demonstrate that LiEGe outperforms all baselines, with improvements that are substantial and statistically significant.
Existing research on fairness-aware recommendation has mainly focused on the quantification of fairness and the development of fair recommendation models, neither of which studies a more substantial problem--identifying the underlying reason of model disparity in recommendation. This information is critical for recommender system designers to understand the intrinsic recommendation mechanism and provides insights on how to improve model fairness to decision makers. Fortunately, with the rapid development of Explainable AI, we can use model explainability to gain insights into model (un)fairness. In this paper, we study the problem ofexplainable fairness, which helps to gain insights about why a system is fair or unfair, and guides the design of fair recommender systems with a more informed and unified methodology. Particularly, we focus on a common setting with feature-aware recommendation and exposure unfairness, but the proposed explainable fairness framework is general and can be applied to other recommendation settings and fairness definitions. We propose a Counterfactual Explainable Fairness framework, called CEF, which generates explanations about model fairness that can improve the fairness without significantly hurting the performance. The CEF framework formulates an optimization problem to learn the "minimal'' change of the input features that changes the recommendation results to a certain level of fairness. Based on the counterfactual recommendation result of each feature, we calculate an explainability score in terms of the fairness-utility trade-off to rank all the feature-based explanations, and select the top ones as fairness explanations. Experimental results on several real-world datasets validate that our method is able to effectively provide explanations to the model disparities and these explanations can achieve better fairness-utility trade-off when using them for recommendation than all the baselines.
Variational autoencoders (VAEs) have been widely applied in recommendations. One reason is that their amortized inferences are beneficial for overcoming the data sparsity. However, in explainable recommendation that generates natural language explanations, they are still rarely explored. Thus, we aim to extend VAE to explainable recommendation. In this task, we find that VAE can generate acceptable explanations for users with few relevant training samples, however, it tends to generate less personalized explanations for users with relatively sufficient samples than autoencoders (AEs). We conjecture that information shared by different users in VAE disturbs the information for a specific user. To deal with this problem, we present PErsonalized VAE (PEVAE) that generates personalized natural language explanations for explainable recommendation. Moreover, we propose two novel mechanisms to aid our model in generating more personalized explanations, including 1) Self-Adaption Fusion (SAF) manipulates the latent space in a self-adaption manner for controlling the influence of shared information. In this way, our model can enjoy the advantage of overcoming the sparsity of data while generating more personalized explanations for a user with relatively sufficient training samples. 2) DEpendence Maximization (DEM) strengthens dependence between recommendations and explanations by maximizing the mutual information. It makes the explanation more specific to the input user-item pair and thus improves the personalization of the generated explanations. Extensive experiments show PEVAE can generate more personalized explanations and further analyses demonstrate the practical effect of our proposed methods.
SESSION: Topic 10: Fairness in IR
Prior research on exposure fairness in the context of recommender systems has focused mostly on disparities in the exposure of individual or groups of items to individual users of the system. The problem of how individual or groups of items may be systemically under or over exposed to groups of users, or even all users, has received relatively less attention. However, such systemic disparities in information exposure can result in observable social harms, such as withholding economic opportunities from historically marginalized groups (allocative harm) or amplifying gendered and racialized stereotypes (representational harm). Previously, Diaz et al. developed the expected exposure metric---that incorporates existing user browsing models that have previously been developed for information retrieval---to study fairness of content exposure to individual users. We extend their proposed framework to formalize a family of exposure fairness metrics that model the problem jointly from the perspective of both the consumers and producers. Specifically, we consider group attributes for both types of stakeholders to identify and mitigate fairness concerns that go beyond individual users and items towards more systemic biases in recommendation. Furthermore, we study and discuss the relationships between the different exposure fairness dimensions proposed in this paper, as well as demonstrate how stochastic ranking policies can be optimized towards said fairness goals.
There are several measures for fairness in ranking, based on different underlying assumptions and perspectives. \acPL optimization with the REINFORCE algorithm can be used for optimizing black-box objective functions over permutations. In particular, it can be used for optimizing fairness measures. However, though effective for queries with a moderate number of repeating sessions, \acPL optimization has room for improvement for queries with a small number of repeating sessions.
In this paper, we present a novel way of representing permutation distributions, based on the notion of permutation graphs. Similar to~\acPL, our distribution representation, called~\acPPG, can be used for black-box optimization of fairness. Different from~\acPL, where pointwise logits are used as the distribution parameters, in~\acPPG pairwise inversion probabilities together with a reference permutation construct the distribution. As such, the reference permutation can be set to the best sampled permutation regarding the objective function, making~\acPPG suitable for both deterministic and stochastic rankings. Our experiments show that~\acPPG, while comparable to~\acPL for larger session repetitions (i.e., stochastic ranking), improves over~\acPL for optimizing fairness metrics for queries with one session (i.e., deterministic ranking). Additionally, when accurate utility estimations are available, e.g., in tabular models, the performance of \acPPG in fairness optimization is significantly boosted compared to lower quality utility estimations from a learning to rank model, leading to a large performance gap with PL. Finally, the pairwise probabilities make it possible to impose pairwise constraints such as "item $d_1$ should always be ranked higher than item $d_2$.'' Such constraints can be used to simultaneously optimize the fairness metric and control another objective such as ranking performance.
Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user's information need. Evaluating these lists for their fairness along with other traditional metrics provides a more complete understanding of an information access system's behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to the protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking.
We aim to bridge the gap between theoretical and practical ap-plication of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement.
There is growing interest in designing recommender systems that aim at being fair towards item producers or their least satisfied users. Inspired by the domain of inequality measurement in economics, this paper explores the use of generalized Gini welfare functions (GGFs) as a means to specify the normative criterion that recommender systems should optimize for. GGFs weight individuals depending on their ranks in the population, giving more weight to worse-off individuals to promote equality. Depending on these weights, GGFs minimize the Gini index of item exposure to promote equality between items, or focus on the performance on specific quantiles of least satisfied users. GGFs for ranking are challenging to optimize because they are non-differentiable. We resolve this challenge by leveraging tools from non-smooth optimization and projection operators used in differentiable sorting. We present experiments using real datasets with up to 15k users and items, which show that our approach obtains better trade-offs than the baselines on a variety of recommendation tasks and fairness criteria.
In recent years, it has become clear that rankings delivered in many areas need not only be useful to the users but also respect fairness of exposure for the item producers. We consider the problem of finding ranking policies that achieve a Pareto-optimal tradeoff between these two aspects. Several methods were proposed to solve it; for instance a popular one is to use linear programming with a Birkhoff-von Neumann decomposition. These methods, however, are based on a classical Position Based exposure Model (PBM), which assumes independence between the items (hence the exposure only depends on the rank). In many applications, this assumption is unrealistic and the community increasingly moves towards considering other models that include dependences, such as the Dynamic Bayesian Network (DBN) exposure model. For such models, computing (exact) optimal fair ranking policies remains an open question. In this paper, we answer this question by leveraging a new geometrical method based on the so-called expohedron proposed recently for the PBM (Kletti et al., WSDM'22). We lay out the structure of a new geometrical object (the DBN-expohedron), and propose for it a Carathéodory decomposition algorithm of complexity $O(n^3)$, where n is the number of documents to rank. Such an algorithm enables expressing any feasible expected exposure vector as a distribution over at most n rankings; furthermore we show that we can compute the whole set of Pareto-optimal expected exposure vectors with the same complexity $O(n^3)$. Our work constitutes the first exact algorithm able to efficiently find a Pareto-optimal distribution of rankings. It is applicable to a broad range of fairness notions, including classical notions of meritocratic and demographic fairness. We empirically evaluate our method on the TREC2020 and MSLR datasets and compare it to several baselines in terms of Pareto-optimality and speed.
Fairness of exposure is a commonly used notion of fairness for ranking systems. It is based on the idea that all items or item groups should get exposure proportional to the merit of the item or the collective merit of the items in the group. Often, stochastic ranking policies are used to ensure fairness of exposure. Previous work unrealistically assumes that we can reliably estimate the expected exposure for all items in each ranking produced by the stochastic policy. In this work, we discuss how to approach fairness of exposure in cases where the policy contains rankings of which, due to inter-item dependencies, we cannot reliably estimate the exposure distribution. In such cases, we cannot determine whether the policy can be considered fair. % Our contributions in this paper are twofold. First, we define a method called \method for finding stochastic policies that avoid showing rankings with unknown exposure distribution to the user without having to compromise user utility or item fairness. Second, we extend the study of fairness of exposure to the top-k setting and also assess \method in this setting. We find that \method can significantly reduce the number of rankings with unknown exposure distribution without a drop in user utility or fairness compared to existing fair ranking methods, both for full-length and top-k rankings. This is an important first step in developing fair ranking methods for cases where we have incomplete knowledge about the user's behaviour.
Recently, there has been a rising awareness that when machine learning (ML) algorithms are used to automate choices, they may treat/affect individuals unfairly, with legal, ethical, or economic consequences. Recommender systems are prominent examples of such ML systems that assist users in making high-stakes judgments.
A common trend in the previous literature research on fairness in recommender systems is that the majority of works treat user and item fairness concerns separately, ignoring the fact that recommender systems operate in a two-sided marketplace. In this work, we present an optimization-based re-ranking approach that seamlessly integrates fairness constraints from both the consumer and producer-side in a joint objective framework. We demonstrate through large-scale experiments on 8 datasets that our proposed method is capable of improving both consumer and producer fairness without reducing overall recommendation quality, demonstrating the role algorithms may play in minimizing data biases.
SESSION: Topic 11: IR Models
Retrieving relevant documents from a corpus is typically based on the semantic similarity between the document content and query text. The inclusion of structural relationship between documents can benefit the retrieval mechanism by addressing semantic gaps. However, incorporating these relationships requires tractable mechanisms that balance structure with semantics and take advantage of the prevalent pre-train/fine-tune paradigm. We propose here a holistic approach to learning document representations by integrating intra-document content with inter-document relations. Our deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space. Furthermore, the separation margins between the documents are varied flexibly to encode the heterogeneity in relationship strengths. The model is fully fine-tunable and natively supports query projection during inference. We demonstrate that it outperforms competing methods on multiple datasets for document retrieval tasks.
Conventional methods for query autocompletion aim to predict which completed query a user will select from a list. A shortcoming of this approach is that users often do not know which query will provide the best retrieval performance on the current information retrieval system, meaning that any query autocompletion methods trained to mimic user behavior can lead to suboptimal query suggestions. To overcome this limitation, we propose a new approach that explicitly optimizes the query suggestions for downstream retrieval performance. We formulate this as a problem of ranking a set of rankings, where each query suggestion is represented by the downstream item ranking it produces. We then present a learning method that ranks query suggestions by the quality of their item rankings. The algorithm is based on a counterfactual learning approach that is able to leverage feedback on the items (e.g., clicks, purchases) to evaluate query suggestions through an unbiased estimator, thus avoiding the assumption that users write or select optimal queries. We establish theoretical support for the proposed approach and provide learning-theoretic guarantees. We also present empirical results on publicly available datasets, and demonstrate real-world applicability using data from an online shopping store.
Learning to Rank (L2R) is the core task of many Information Retrieval systems. Recently, a great effort has been put on exploring Deep Neural Networks (DNNs) for L2R, with significant results. However, risk-sensitiveness, an important and recent advance in the L2R arena, that reduces variability and increases trust, has not been incorporated into Deep Neural L2R yet. Risk-sensitive measures are important to assess the risk of an IR system to perform worse than a set of baseline IR systems for several queries. However, the risk-sensitive measures described in the literature have a non-smooth behavior, making them difficult, if not impossible, to be optimized by DNNs. In this work we solve this difficult problem by proposing a family of new loss functions -- \riskloss\ -- that support a smooth risk-sensitive optimization. \riskloss\ introduces two important contributions: (i) the substitution of the traditional NDCG or MAP metrics in risk-sensitive measures with smooth loss functions that evaluate the correlation between the predicted and the true relevance order of documents for a given query and (ii) the use of distinct versions of the same DNN architecture as baselines by means of a multi-dropout technique during the smooth risk-sensitive optimization, avoiding the inconvenience of assessing multiple IR systems as part of DNN training. We empirically demonstrate significant achievements of the proposed \riskloss\ functions when used with recent DNN methods in the context of well-known web-search datasets such as WEB10K, YAHOO, and MQ2007. Our solutions reach improvements of 8% in effectiveness (NDCG) while improving in around 5% the risk-sensitiveness (\grisk\ measure) when applied together with a state-of-the-art Self-Attention DNN-L2R architecture. Furthermore, \riskloss\ is capable of reducing by 28% the losses over the best evaluated baselines and significantly improving over the risk-sensitive state-of-the-art non-DNN method (by up to 13.3%) while keeping (or even increasing) overall effectiveness. All these results ultimately establish a new level for the state-of-the-art on risk-sensitiveness and DNN-L2R research.
Building a multi-stage cascade ranking system is a commonly used solution to balance the efficiency and effectiveness in modern information retrieval (IR) applications, such as recommendation and web search. Despite the popularity in practice, the literature specific on multi-stage cascade ranking systems is relatively scarce. The common practice is to train rankers of each stage independently using the same user feedback data (a.k.a., impression data), disregarding the data flow and the possible interactions between stages. This straightforward solution could lead to a sub-optimal system because of the sample selection bias (SSB) issue, which is especially damaging for cascade rankers due to the negative effect accumulated in the multiple stages. Worse still, the interactions between the rankers of each stage are not fully exploited. This paper provides an elaborate analysis of this commonly used solution to reveal its limitations. By studying the essence of cascade ranking, we propose a joint training framework named RankFlow to alleviate the SSB issue and exploit the interactions between the cascade rankers, which is the first systematic solution for this topic. We propose a paradigm of training cascade rankers that emphasizes the importance of fitting rankers on stage-specific data distributions instead of the unified user feedback distribution. We design the RankFlow framework based on this paradigm: The training data of each stage is generated by its preceding stages while the guidance signals not only come from the logs but its successors. Extensive experiments are conducted on various IR scenarios, including recommendation, web search and advertisement. The results verify the efficacy and superiority of RankFlow.
LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback
Pseudo-relevance feedback (PRF) has proven to be an effective query reformulation technique to improve retrieval accuracy. It aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents. Existing PRF methods independently treat revised queries originating from the same query but using different numbers of feedback documents, resulting in severe query drift. Without comparing the effects of two different revisions from the same query, a PRF model may incorrectly focus on the additional irrelevant information increased in the more feedback, and thus reformulate a query that is less effective than the revision using the less feedback. Ideally, if a PRF model can distinguish between irrelevant and relevant information in the feedback, the more feedback documents there are, the better the revised query will be. To bridge this gap, we propose the Loss-over-Loss (LoL) framework to compare the reformulation losses between different revisions of the same query during training. Concretely, we revise an original query multiple times in parallel using different amounts of feedback and compute their reformulation losses. Then, we introduce an additional regularization loss on these reformulation losses to penalize revisions that use more feedback but gain larger losses. With such comparative regularization, the PRF model is expected to learn to suppress the extra increased irrelevant information by comparing the effects of different revised queries. Further, we present a differentiable query reformulation method to implement this framework. This method revises queries in the vector space and directly optimizes the retrieval performance of query vectors, applicable for both sparse and dense retrieval models. Empirical evaluation demonstrates the effectiveness and robustness of our method for two typical sparse and dense retrieval models.
Stance detection aims to identify whether the author of a text is in favor of, against, or neutral to a given target. The main challenge of this task comes two-fold: few-shot learning resulting from the varying targets and the lack of contextual information of the targets. Existing works mainly focus on solving the second issue by designing attention-based models or introducing noisy external knowledge, while the first issue remains under-explored. In this paper, inspired by the potential capability of pre-trained language models (PLMs) serving as knowledge bases and few-shot learners, we propose to introduce prompt-based fine-tuning for stance detection. PLMs can provide essential contextual information for the targets and enable few-shot learning via prompts. Considering the crucial role of the target in stance detection task, we design target-aware prompts and propose a novel verbalizer. Instead of mapping each label to a concrete word, our verbalizer maps each label to a vector and picks the label that best captures the correlation between the stance and the target. Moreover, to alleviate the possible defect of dealing with varying targets with a single hand-crafted prompt, we propose to distill the information learned from multiple prompts. Experimental results show the superior performance of our proposed model in both full-data and few-shot scenarios.
Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly.
With the rapid growth of interaction data, many clustering methods have been proposed to discover interaction patterns as prior knowledge beneficial to downstream tasks. Considering that an interaction can be seen as an action occurring among multiple objects, most existing methods model the objects and their pair-wise relations as nodes and links in graphs. However, they only model and leverage part of the information in real entire interactions, i.e., either decompose the entire interaction into several pair-wise sub-interactions for simplification, or only focus on clustering some specific types of objects, which limits the performance and explainability of clustering. To tackle this issue, we propose to Co-cluster the Interactions via Attentive Hypergraph neural network (CIAH). Particularly, with more comprehensive modeling of interactions by hypergraph, we propose an attentive hypergraph neural network to encode the entire interactions, where an attention mechanism is utilized to select important attributes for explanations. Then, we introduce a salient method to guide the attention to be more consistent with real importance of attributes, namely saliency-based consistency. Moreover, we propose a novel co-clustering method to perform a joint clustering for the representations of interactions and the corresponding distributions of attribute selection, namely cluster-based consistency. Extensive experiments demonstrate that our CIAH significantly outperforms state-of-the-art clustering methods on both public datasets and real industrial datasets.
Neural text matching models have been used in a range of applications such as question answering and natural language inference, and have yielded a good performance. However, these neural models are of a limited adaptability, resulting in a decline in performance when encountering test examples from a different dataset or even a different task. The adaptability is particularly important in the few-shot setting: in many cases, there is only a limited amount of labeled data available for a target dataset or task, while we may have access to a richly labeled source dataset or task. However, adapting a model trained on the abundant source data to a few-shot target dataset or task is challenging. To tackle this challenge, we propose a Meta-Weight Regulator (MWR), which is a meta-learning approach that learns to assign weights to the source examples based on their relevance to the target loss. Specifically, MWR first trains the model on the uniformly weighted source examples, and measures the efficacy of the model on the target examples via a loss function. By iteratively performing a (meta) gradient descent, high-order gradients are propagated to the source examples. These gradients are then used to update the weights of source examples, in a way that is relevant to the target performance. As MWR is model-agnostic, it can be applied to any backbone neural model. Extensive experiments are conducted with various backbone text matching models, on four widely used datasets and two tasks. The results demonstrate that our proposed approach significantly outperforms a number of existing adaptation methods and effectively improves the cross-dataset and cross-task adaptability of the neural text matching models in the few-shot setting.
SESSION: Topic 12: Knowledge Graphs
Relation prediction on knowledge graphs (KGs) aims to infer missing valid triples from observed ones. Although this task has been deeply studied, most previous studies are limited to the transductive setting and cannot handle emerging entities. Actually, the inductive setting is closer to real-life scenarios because it allows entities in the testing phase to be unseen during training. However, it is challenging to precisely conduct inductive relation prediction as there exists requirements of entity-independent relation modeling and discrete logical reasoning for interoperability. To this end, we propose a novel model ConGLR to incorporate context graph with logical reasoning. Firstly, the enclosing subgraph w.r.t. target head and tail entities are extracted and initialized by the double radius labeling. And then the context graph involving relational paths, relations and entities is introduced. Secondly, two graph convolutional networks (GCNs) with the information interaction of entities and relations are carried out to process the subgraph and context graph respectively. Considering the influence of different edges and target relations, we introduce edge-aware and relation-aware attention mechanisms for the subgraph GCN. Finally, by treating the relational path as rule body and target relation as rule head, we integrate neural calculating and logical reasoning to obtain inductive scores. And to focus on the specific modeling goals of each module, the stop-gradient is utilized in the information interaction between context graph and subgraph GCNs in the training process. In this way, ConGLR satisfies two inductive requirements at the same time. Extensive experiments demonstrate that ConGLR obtains outstanding performance against state-of-the-art baselines on twelve inductive dataset versions of three common KGs.
Multimodal Knowledge Graphs (MKGs), which organize visual-text factual knowledge, have recently been successfully applied to tasks such as information retrieval, question answering, and recommendation system. Since most MKGs are far from complete, extensive knowledge graph completion studies have been proposed focusing on the multimodal entity, relation extraction and link prediction. However, different tasks and modalities require changes to the model architecture, and not all images/objects are relevant to text input, which hinders the applicability to diverse real-world scenarios. In this paper, we propose a hybrid transformer with multi-level fusion to address those issues. Specifically, we leverage a hybrid transformer architecture with unified input-output for diverse multimodal knowledge graph completion tasks. Moreover, we propose multi-level fusion, which integrates visual and text representation via coarse-grained prefix-guided interaction and fine-grained correlation-aware fusion modules. We conduct extensive experiments to validate that our MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER1. https://github.com/zjunlp/MKGformer.
Knowledge graph completion (KGC) aims to infer missing knowledge triples based on known facts in a knowledge graph. Current KGC research mostly follows an entity ranking protocol, wherein the effectiveness is measured by the predicted rank of a masked entity in a test triple. The overall performance is then given by a micro(-average) metric over all individual answer entities. Due to the incomplete nature of the large-scale knowledge bases, such an entity ranking setting is likely affected by unlabelled top-ranked positive examples, raising questions on whether the current evaluation protocol is sufficient to guarantee a fair comparison of KGC systems. To this end, this paper presents a systematic study on whether and how the label sparsity affects the current KGC evaluation with the popular micro metrics. Specifically, inspired by the TREC paradigm for large-scale information retrieval (IR) experimentation, we create a relatively "complete" judgment set based on a sample from the popular FB15k-237 dataset following the TREC pooling method. According to our analysis, it comes as a surprise that switching from the original labels to our "complete" labels results in a drastic change of system ranking of a variety of 13 popular KGC models in terms of micro metrics. Further investigation indicates that the IR-like macro(-average) metrics are more stable and discriminative under different settings, meanwhile, less affected by label sparsity. Thus, for KGC evaluation, we recommend conducting TREC-style pooling to balance between human efforts and label completeness, and reporting also the IR-like macro metrics to reflect the ranking nature of the KGC task.
Knowledge graphs (KGs) consisting of a large number of triples have become widespread recently, and many knowledge graph embedding (KGE) methods are proposed to embed entities and relations of a KG into continuous vector spaces. Such embedding methods simplify the operations of conducting various in-KG tasks (e.g., link prediction) and out-of-KG tasks (e.g., question answering). They can be viewed as general solutions for representing KGs. However, existing KGE methods are not applicable to inductive settings, where a model trained on source KGs will be tested on target KGs with entities unseen during model training. Existing works focusing on KGs in inductive settings can only solve the inductive relation prediction task. They can not handle other out-of-KG tasks as general as KGE methods since they don't produce embeddings for entities. In this paper, to achieve inductive knowledge graph embedding, we propose a model MorsE, which does not learn embeddings for entities but learns transferable meta-knowledge that can be used to produce entity embeddings. Such meta-knowledge is modeled by entity-independent modules and learned by meta-learning. Experimental results show that our model significantly outperforms corresponding baselines for in-KG and out-of-KG tasks in inductive settings.
Previous entity linking methods in knowledge graphs (KGs) mostly link the textual mentions to corresponding entities. However, they have deficiencies in processing numerous multimodal data, when the text is too short to provide enough context. Consequently, we conceive the idea of introducing valuable information of other modalities, and propose a novel multimodal entity linking method with gated hierarchical multimodal fusion and contrastive training (GHMFC). Firstly, in order to discover the fine-grained inter-modal correlations, GHMFC extracts the hierarchical features of text and visual co-attention through the multi-modal co-attention mechanism: textual-guided visual attention and visual-guided textual attention. The former attention obtains weighted visual features under the guidance of textual information. In contrast, the latter attention produces weighted textual features under the guidance of visual information. Afterwards, gated fusion is used to evaluate the importance of hierarchical features of different modalities and integrate them into the final multimodal representations of mentions. Subsequently, contrastive training with two types of contrastive losses is designed to learn more generic multimodal features and reduce noise. Finally, the linking entities are selected by calculating the cosine similarity between representations of mentions and entities in KGs. To evaluate the proposed method, this paper releases two new open multimodal entity linking datasets: WikiMEL and Richpedia-MEL. Experimental results demonstrate that GHMFC can learn meaningful multimodal representation and significantly outperforms most of the baseline methods.
SESSION: Topic 13: Multi- and Cross-modal IR
Given a text query, the text-to-video retrieval task aims to find the relevant videos in the database. Recently, model-based (MDB) methods have demonstrated superior accuracy than embedding-based (EDB) methods due to their excellent capacity of modeling local video/text correspondences, especially when equipped with large-scale pre-training schemes like ClipBERT. Generally speaking, MDB methods take a text-video pair as input and harness deep models to predict the mutual similarity, while EDB methods first utilize modality-specific encoders to extract embeddings for text and video, then evaluate the distance based on the extracted embeddings. Notably, MDB methods cannot produce explicit representations for text and video, instead, they have to exhaustively pair the query with every database item to predict their mutual similarities in the inference stage, which results in significant inefficiency in practical applications.
In this work, we propose a novel EDB method CRET (Cross-modal REtrieval Transformer), which not only demonstrates promising efficiency in retrieval tasks, but also achieves better accuracy than existing MDB methods. The credits are mainly attributed to our proposed Cross-modal Correspondence Modeling (CCM) module and Gaussian Estimation of Embedding Space (GEES) loss. Specifically, the CCM module is composed by transformer decoders and a set of decoder centers. With the help of the learned decoder centers, the text/video embeddings can be efficiently aligned, without suffering from pairwise model-based inference. Moreover, to balance the information loss and computational overhead when sampling frames from a given video, we present a novel GEES loss, which implicitly conducts dense sampling in the video embedding space, without suffering from heavy computational cost. Extensive experiments show that without pre-training on extra datasets, our proposed CRET outperforms the state-of-the-art MDB methods that were pre-trained on additional datasets, meanwhile still shows promising efficiency in retrieval tasks.
Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The recently proposed methods typically adopt the generative model as the main framework to learn a joint latent embedding space to alleviate the modality gap. Generally, these methods largely rely on auxiliary semantic embeddings for knowledge transfer across classes and unconsciously neglect the effect of the data reconstruction manner in the adopted generative model. To address this issue, we propose a novel ZS-CMR model termed Multimodal Disentanglement Variational AutoEncoders (MDVAE), which consists of two coupled disentanglement variational autoencoders (DVAEs) and a fusion-exchange VAE (FVAE). Specifically, DVAE is developed to disentangle the original representations of each modality into modality-invariant and modality-specific features. FVAE is designed to fuse and exchange information of multimodal data by the reconstruction and alignment process without pre-extracted semantic embeddings. Moreover, an advanced counter-intuitive cross-reconstruction scheme is further proposed to enhance the informativeness and generalizability of the modality-invariant features for more effective knowledge transfer. The comprehensive experiments on four image-text retrieval and two image-sketch retrieval datasets consistently demonstrate that our method establishes the new state-of-the-art performance.
Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35% and accelerating the inference speed by 14% at the best case. The code is available at https://github.com/mzhaoshuai/CenterCLIP https://github.com/mzhaoshuai/CenterCLIP.
Multi-modal hashing learns binary hash codes with extremely low storage cost and high retrieval speed. It can support efficient multi-modal retrieval well. However, most existing methods still suffer from three important problems: 1) Limited semantic representation capability with shallow learning. 2) Mandatory feature-level multi-modal fusion ignores heterogeneous multi-modal semantic gaps. 3) Direct coarse pairwise semantic preserving cannot effectively capture the fine-grained semantic correlations. For solving these problems, in this paper, we propose a Bit-aware Semantic Transformer Hashing (BSTH) framework to excavate bit-wise semantic concepts and simultaneously align the heterogeneous modalities for multi-modal hash learning on the concept-level. Specifically, the bit-wise implicit semantic concepts are learned with the transformer in a self-attention manner, which can achieve implicit semantic alignment on the fine-grained concept-level and reduce the heterogeneous modality gaps. Then, the concept-level multi-modal fusion is performed to enhance the semantic representation capability of each implicit concept and the fused concept representations are further encoded to the corresponding hash bits via bit-wise hash functions. Further, to supervise the bit-aware transformer module, a label prototype learning module is developed to learn prototype embeddings for all categories that capture the explicit semantic correlations on the category-level by considering the co-occurrence priors. Experiments on three widely tested multi-modal retrieval datasets demonstrate the superiority of the proposed method from various aspects.
Multi-modal Product Summary Generation is a new yet challenging task, which aims to generate a concise and readable summary for a product given its multi-modal content, e.g., its long text description and image. Although existing methods have achieved great success, they still suffer from three key limitations: 1) overlook the benefit of pre-training, 2) lack the representation-level supervision, and 3) ignore the diversity of the seller-generated data. To address these limitations, in this work, we propose a Vision-to-Prompt based multi-modal product summary generation framework, dubbed as V2P, where a Generative Pre-trained Language Model (GPLM) is adopted as the backbone. In particular, to maintain the original text capability of the GPLM and fully utilize the high-level concepts contained in the product image, we design V2P with two key components: vision-based prominent attribute prediction, and attribute prompt-guided summary generation. The first component works on obtaining the vital semantic attributes of the product from its image by the Swin Transformer, while the second component aims to generate the summary based on the product's long text description and the attribute prompts yielded by the first component with a GPLM. Towards comprehensive supervision over the second component, apart from the conventional output-level supervision, we introduce the representation-level regularization. Meanwhile, we design the data augmentation-based robustness regularization to handle the diverse inputs and improve the robustness of the second component. Extensive experiments on a large-scale Chinese dataset verify the superiority of our model over cutting-edge methods.
SESSION: Topic 14: Multimedia IR
Near-duplicate video retrieval (NDVR) aims to find the copies or transformations of the query video from a massive video database. It plays an important role in many video related applications, including copyright protection, tracing, filtering and etc. Video representation and similarity search are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. To address the above issues, we propose a video representation learning (VRL) approach to effectively address the above shortcomings. It first effectively learns video representation from unlabeled videos via contrastive learning to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging near-duplicate video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed VRL approach, which achieves the best performance of video retrieval on accuracy and efficiency.
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities. For example, a target product image is searched using a reference product image along with text about changing certain attributes of the reference image as the query. It is a more challenging image retrieval task that requires both semantic space learning and cross-modal fusion. Previous approaches that attempt to deal with both aspects achieve unsatisfactory performance. In this paper, we decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. We first leverage the semantic embedding space for open-domain image-text retrieval, and then transfer the learned knowledge to the fashion-domain with fashion-related pre-training tasks. Finally, we enhance the pre-trained model from single-query to hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of individual modality in the hybrid-modality query varies for different retrieval scenarios, we propose a self-supervised adaptive weighting strategy to dynamically determine the importance of image and text in the hybrid-modality query for better retrieval. Extensive experiments show that our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
Moment retrieval in videos is a challenging task that aims to retrieve the most relevant video moment in an untrimmed video given a sentence description. Previous methods tend to perform self-modal learning and cross-modal interaction in a coarse manner, which neglect fine-grained clues contained in video content, query context, and their alignment. To this end, we propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level. Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework. A coarse-grained feature encoder and a co-attention mechanism are utilized to obtain a preliminary perception of intra-modality and inter-modality information. Then a fine-grained feature encoder and a conditioned interaction module are introduced to enhance the initial perception inspired by how humans address reading comprehension problems. Moreover, to alleviate the huge computation burden of some existing methods, we further design an efficient choice comparison module and reduce the hidden size with imperceptible quality loss. Extensive experiments on Charades-STA, TACoS, and ActivityNet Captions datasets demonstrate that our solution outperforms existing state-of-the-art methods.
Video moment retrieval aims at finding the start and end timestamps of a moment (part of a video) described by a given natural language query. Fully supervised methods need complete temporal boundary annotations to achieve promising results, which is costly since the annotator needs to watch the whole moment. Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor. In this paper, we look closer into the annotation process and propose a new paradigm called "glance annotation". This paradigm requires the timestamp of only one single random frame, which we refer to as a "glance", within the temporal boundary of the fully supervised counterpart. We argue this is beneficial because comparing to weak supervision, trivial cost is added yet more potential in performance is provided. Under the glance annotation setting, we propose a method named as Video moment retrieval via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input video into clips and contrasts between clips and queries, in which glance guided Gaussian distributed weights are assigned to all clips. Our extensive experiments indicate that ViGA achieves better results than the state-of-the-art weakly supervised methods by a large margin, even comparable to fully supervised methods in some cases.
SESSION: Topic 15: NLP and Semantics
Keyphrases can concisely describe the high-level topics discussed in a document that usually possesses hierarchical topic structures. Thus, it is crucial to understand the hierarchical topic structures and employ it to guide the keyphrase identification. However, integrating the hierarchical topic information into a deep keyphrase generation model is unexplored. In this paper, we focus on how to effectively exploit the hierarchical topic to improve the keyphrase generation performance (HTKG). Specifically, we propose a novel hierarchical topic-guided variational neural sequence generation method for keyphrase generation, which consists of two major modules: a neural hierarchical topic model that learns the latent topic tree across the whole corpus of documents, and a variational neural keyphrase generation model to generate keyphrases under hierarchical topic guidance. Finally, these two modules are jointly trained to help them learn complementary information from each other. To the best of our knowledge, this is the first attempt to leverage the neural hierarchical topic to guide keyphrase generation. The experimental results demonstrate that our method significantly outperforms the existing state-of-the-art methods across five benchmark datasets.
Machine reading comprehension has aroused wide concerns, since it explores the potential of model for text understanding. To further equip the machine with the reasoning capability, the challenging task of logical reasoning is proposed. Previous works on logical reasoning have proposed some strategies to extract the logical units from different aspects. However, there still remains a challenge to model the long distance dependency among the logical units. Also, it is demanding to uncover the logical structures of the text and further fuse the discrete logic to the continuous text embedding. To tackle the above issues, we propose an end-to-end model Logiformer which utilizes a two-branch graph transformer network for logical reasoning of text. Firstly, we introduce different extraction strategies to split the text into two sets of logical units, and construct the logical graph and the syntax graph respectively. The logical graph models the causal relations for the logical branch while the syntax graph captures the co-occurrence relations for the syntax branch. Secondly, to model the long distance dependency, the node sequence from each graph is fed into the fully connected graph transformer structures. The two adjacent matrices are viewed as the attention biases for the graph transformer layers, which map the discrete logical structures to the continuous text embedding space. Thirdly, a dynamic gate mechanism and a question-aware self-attention module are introduced before the answer prediction to update the features. The reasoning process provides the interpretability by employing the logical units, which are consistent with human cognition. The experimental results show the superiority of our model, which outperforms the state-of-the-art single model on two logical reasoning benchmarks.
An opinion tag is a sequence of words on a specific aspect of a product or service. Opinion tags reflect key characteristics of product reviews and help users quickly understand their content in e-commerce portals. The task of abstractive opinion tagging has previously been proposed to automatically generate a ranked list of opinion tags for a given review. However, current models for opinion tagging are not personalized, even though personalization is an essential ingredient of engaging user interactions, especially in e-commerce. In this paper, we focus on the task of personalized abstractive opinion tagging. There are two main challenges when developing models for the end-to-end generation of personalized opinion tags: sparseness of reviews and difficulty to integrate multi-type signals, i.e., explicit review signals and implicit behavioral signals. To address these challenges, we propose an end-to-end model, named POT, that consists of three main components: (1) a review-based explicit preference tracker component based on a hierarchical heterogeneous review graph to track user preferences from reviews; (2)a behavior-based implicit preference tracker component using a heterogeneous behavior graph to track the user preferences from implicit behaviors; and (3) a personalized rank-aware tagging component to generate a ranked sequence of personalized opinion tags. In our experiments, we evaluate POT on a real-world dataset collected from e-commerce platforms and the results demonstrate that it significantly outperforms strong baselines.
Entity Set Expansion (ESE) is a promising task which aims to expand entities of the target semantic class described by a small seed entity set. Various NLP and IR applications will benefit from ESE due to its ability to discover knowledge. Although previous ESE methods have achieved great progress, most of them still lack the ability to handle hard negative entities (i.e., entities that are difficult to distinguish from the target entities), since two entities may or may not belong to the same semantic class based on different granularity levels we analyze on. To address this challenge, we devise an entity-level masked language model with contrastive learning to refine the representation of entities. In addition, we propose the ProbExpan, a novel probabilistic ESE framework utilizing the entity representation obtained by the aforementioned language model to expand entities. Extensive experiments and detailed analyses on three datasets show that our method outperforms previous state-of-the-art methods.
Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration.
In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https://github.com/ybai-nlp/CLS_CR.
What Makes the Story Forward?: Inferring Commonsense Explanations as Prompts for Future Event Generation
Prediction over event sequences is critical for many real-world applications in Information Retrieval and Natural Language Processing. Future Event Generation (FEG) is a challenging task in event sequence prediction because it requires not only fluent text generation but also commonsense reasoning to maintain the logical coherence of the entire event story. In this paper, we propose a novel explainable FEG framework, Coep. It highlights and integrates two types of event knowledge, sequential knowledge of direct event-event relations and inferential knowledge that reflects the intermediate character psychology between events, such as intents, causes, reactions, which intrinsically pushes the story forward. To alleviate the knowledge forgetting issue, we design two modules, IM and GM, for each type of knowledge, which are combined via prompt tuning. First, IM focuses on understanding inferential knowledge to generate commonsense explanations and provide a soft prompt vector for GM. We also design a contrastive discriminator for better generalization ability. Second, GM generates future events by modeling direct sequential knowledge with the guidance of IM. Automatic and human evaluation demonstrate that our approach can generate more coherent, specific, and logical future events.
Event argument extraction (EAE) is an important information extraction task, which aims to identify the arguments of an event described in a given text and classify the roles played by them. A key characteristic in realistic EAE data is that the instance numbers of different roles follow an obvious long-tail distribution. However, the training and evaluation paradigms of existing EAE models either prone to neglect the performance on "tail roles'', or change the role instance distribution for model training to an unrealistic uniform distribution. Though some generic methods can alleviate the class imbalance in long-tail datasets, they usually sacrifice the performance of "head classes'' as a trade-off. To address the above issues, we propose to train our model on realistic long-tail EAE datasets, and evaluate the average performance over all roles. Inspired by the Mixture of Experts (MOE), we propose a Routing-Balanced Dual Expert Framework (RBDEF), which divides all roles into "head" and "tail" two scopes and assigns the classifications of head and tail roles to two separate experts. In inference, each encoded instance will be allocated to one of the two experts by a routing mechanism. To reduce routing errors caused by the imbalance of role instances, we design a Balanced Routing Mechanism (BRM), which transfers several head roles to the tail expert to balance the load of routing, and employs a tri-filter routing strategy to reduce the misallocation of the tail expert's instances. To enable an effective learning of tail roles with scarce instances, we devise Target-Specialized Meta Learning (TSML) to train the tail expert. Different from other meta learning algorithms that only search a generic parameter initialization equally applying to infinite tasks, TSML can adaptively adjust its search path to obtain a specialized initialization for the tail expert, thereby expanding the benefits to the learning of tail roles. In experiments, RBDEF significantly outperforms the state-of-the-art EAE models and advanced methods for long-tail data.
Event detection (ED) is a pivotal task for information retrieval, which aims at identifying event triggers and classifying them into pre-defined event types. In real-world applications, events are usually annotated with numerous fine-grained types, which often arises long-tail type nature and co-occurrence event nature. Existing studies explore the event correlations without full utilization, which may limit the capability of event detection. This paper simultaneously incorporates both the type-level and instance-level event correlations, and proposes a novel framework, termed as CorED. Specifically, we devise an adaptive graph-based type encoder to capture instance-level correlations, learning type representations not only from their training data but also from their relevant types, thus leading to more informative type representations especially for the low-resource types. Besides, we devise an instance interactive decoder to capture instance-level correlations, which predicts event instance types conditioned on the contextual typed event instances, leveraging co-occurrence events as remarkable evidence in prediction. We conduct experiments on two public benchmarks, MAVEN and ACE-2005 dataset. Empirical results demonstrate the unity of both type-level and instance-level correlations, and the model achieves effectiveness performance on both benchmarks.
SESSION: Topic 16: POI and News Recommendations
Learning which Point-of-Interest (POI) a user will visit next is a challenging task for personalized recommender systems due to the large search space of possible POIs in the region. A recurring problem among existing works that makes it difficult to learn and perform well is the sparsity of the User-POI matrix. In this paper, we propose our Hierarchical Multi-Task Graph Recurrent Network (HMT-GRN) approach, which alleviates the data sparsity problem by learning different User-Region matrices of lower sparsities in a multi-task setting. We then perform a Hierarchical Beam Search (HBS) on the different region and POI distributions to hierarchically reduce the search space with increasing spatial granularity and predict the next POI. Our HBS provides efficiency gains by reducing the search space, resulting in speedups of 5 to 7 times over an exhaustive approach. In addition, we also propose a novel selectivity layer to predict if the next POI has been visited before by the user to balance between personalization and exploration. Experimental results on two real-world Location-Based Social Network (LBSN) datasets show that our model significantly outperforms baseline and the state-of-the-art methods.
Next POI recommendation intends to forecast users' immediate future movements given their current status and historical information, yielding great values for both users and service providers. However, this problem is perceptibly complex because various data trends need to be considered together. This includes the spatial locations, temporal contexts, user's preferences, etc. Most existing studies view the next POI recommendation as a sequence prediction problem while omitting the collaborative signals from other users. Instead, we propose a user-agnostic global trajectory flow map and a novel Graph Enhanced Transformer model (GETNext) to better exploit the extensive collaborative signals for a more accurate next POI prediction, and alleviate the cold start problem in the meantime. GETNext incorporates the global transition patterns, user's general preference, spatio-temporal context, and time-aware category embeddings together into a transformer model to make the prediction of user's future moves. With this design, our model outperforms the state-of-the-art methods with a large margin and also sheds light on the cold start challenges within the spatio-temporal involved recommendation problems.
Next Point-of-Interest (POI) recommendation plays a critical role in many location-based applications as it provides personalized suggestions on attractive destinations for users. Since users' next movement is highly related to the historical visits, sequential methods such as recurrent neural networks are widely used in this task for modeling check-in behaviors. However, existing methods mainly focus on modeling the sequential regularity of check-in sequences but pay little attention to the intrinsic characteristics of POIs, neglecting the entanglement of the diverse influence stemming from different aspects of POIs. In this paper, we propose a novel Disentangled Representation-enhanced Attention Network (DRAN) for next POI recommendation, which leverages the disentangled representations to explicitly model different aspects and corresponding influence for representing a POI more precisely. Specifically, we first design a propagation rule to learn graph-based disentangled representations by refining two types of POI relation graphs, making full use of the distance-based and transition-based influence for representation learning. Then, we extend the attention architecture to aggregate personalized spatio-temporal information for modeling dynamic user preferences on the next timestamp, while maintaining the different components of disentangled representations independent. Extensive experiments on two real-world datasets demonstrate the superior performance of our model to state-of-the-art approaches. Further studies confirm the effectiveness of DRAN in representation disentanglement.
News recommendation aims to help online news platform users find their preferred news articles. Existing news recommendation methods usually learn models from historical user behaviors on news. However, these behaviors are usually biased on news providers. Models trained on biased user data may capture and even amplify the biases on news providers, and are unfair for some minority news providers. In this paper, we propose a provider fairness-aware news recommendation framework (named ProFairRec), which can learn news recommendation models fair for different news providers from biased user data. The core idea of ProFairRec is to learn provider-fair news representations and provider-fair user representations to achieve provider fairness. To learn provider-fair representations from biased data, we employ provider-biased representations to inherit provider bias from data. Provider-fair and -biased news representations are learned from news content and provider IDs respectively, which are further aggregated to build fair and biased user representations based on user click history. All of these representations are used in model training while only fair representations are used for user-news matching to achieve fair news recommendation. Besides, we propose an adversarial learning task on news provider discrimination to prevent provider-fair news representation from encoding provider bias. We also propose an orthogonal regularization on provider-fair and -biased representations to better reduce provider bias in provider-fair representations. Moreover, ProFairRec is a general framework and can be applied to different news recommendation methods. Extensive experiments on a public dataset verify that our ProFairRec approach can effectively improve the provider fairness of many existing methods and meanwhile maintain their recommendation accuracy.
Pre-travel out-of-town recommendation aims to recommend Point-of-Interests (POIs) to the users who plan to travel out of their hometown in the near future yet have not decided where to go, i.e., their destination regions and POIs both remain unknown. It is a non-trivial task since the searching space is vast, which may lead to distinct travel experiences in different out-of-town regions and eventually confuse decision-making. Besides, users' out-of-town travel behaviors are affected not only by their personalized preferences but heavily by others' travel behaviors. To this end, we propose a Crowd-Aware Pre-Travel Out-of-town Recommendation framework (CAPTOR) consisting of two major modules: spatial-affined conditional random field (SA-CRF) and crowd behavior memory network (CBMN). Specifically, SA-CRF captures the spatial affinity among POIs while preserving the inherent information of POIs. Then, CBMN is proposed to maintain the crowd travel behaviors w.r.t. each region through three affiliated blocks reading and writing the memory adaptively. We devise the elaborated metric space with a dynamic mapping mechanism, where the users and POIs are distinguishable both inherently and geographically. Extensive experiments on two real-world nationwide datasets validate the effectiveness of CAPTOR against the pre-travel out-of-town recommendation task.
News recommendation for anonymous readers is a useful but challenging task for many news portals, where interactions between readers and articles are limited within a temporary login session. Previous works tend to formulate session-based recommendation as a next item prediction task, while they neglect the implicit feedback from user behaviors, which indicates what users really like or dislike. Hence, we propose a comprehensive framework to model user behaviors through positive feedback (i.e., the articles they spend more time on) and negative feedback (i.e., the articles they choose to skip without clicking in). Moreover, the framework implicitly models the user using their session start time, and the article using its initial publishing time, in what we call neutral feedback. Empirical evaluation on three real-world news datasets shows the framework's promising performance of more accurate, diverse and even unexpectedness recommendations than other state-of-the-art session-based recommendation approaches.
SESSION: Topic 17: Question Answering
Non-factoid question answering (NFQA) is a challenging and under-researched task that requires constructing long-form answers, such as explanations or opinions, to open-ended non-factoid questions - NFQs. There is still little understanding of the categories of NFQs that people tend to ask, what form of answers they expect to see in return, and what the key research challenges of each category are.
This work presents the first comprehensive taxonomy of NFQ categories and the expected structure of answers. The taxonomy was constructed with a transparent methodology and extensively evaluated via crowdsourcing. The most challenging categories were identified through an editorial user study. We also release a dataset of categorised NFQs and a question category classifier.
Finally, we conduct a quantitative analysis of the distribution of question categories using major NFQA datasets, showing that the NFQ categories that are the most challenging for current NFQA systems are poorly represented in these datasets. This imbalance may lead to insufficient system performance for challenging categories. The new taxonomy, along with the category classifier, will aid research in the area, helping to create more balanced benchmarks and to focus models on addressing specific categories.
Designing natural language processing (NLP) models that produce predictions by first extracting a set of relevant input sentences, i.e., rationales, is gaining importance for improving model interpretability and producing supporting evidence for users. Current unsupervised approaches are designed to extract rationales that maximize prediction accuracy, which is invariably obtained by exploiting spurious correlations in datasets, and leads to unconvincing rationales. In this paper, we introduce unsupervised generative models to extract dual-purpose rationales, which must not only be able to support a subsequent answer prediction, but also support a reproduction of the input query. We show that such models can produce more meaningful rationales, that are less influenced by dataset artifacts, and as a result, also achieve the state-of-the-art on rationale extraction metrics on four datasets from the ERASER benchmark, significantly improving upon previous unsupervised methods. Our multi-task model is scalable and enables using state-of-the-art pretrained language models to design explainable question answering systems.
Current question answering systems are insufficient when confronting real-life scenarios, as they can hardly be aware of whether a question is answerable given its context. Hence, there is a recent pursuit of unanswerability of a question and its attribution. Attribution of unanswerability requires the system to choose an appropriate cause for an unanswerable question. As the task is sophisticated for even human beings, it is expensive to acquire labeled data, which makes it a low-data regime problem. Moreover, the causes themselves are semantically abstract and complex, and the process of attribution is heavily question- and context-dependent. Thus, a capable model has to carefully appreciate the causes, and then, judiciously contrast the question with its context, in order to cast it into the right cause. In response to the challenges, we present PTAU, which refers to and implements a high-level human reading strategy such that one reads with anticipation. In specific, PTAU leverages the recent prompt-tuning paradigm, and is further enhanced with two innovatively conceived modules: 1) a cause-oriented template module that constructs continuous templates towards certain attributing class in high dimensional vector space; and 2) a semantics-aware label module that exploits label semantics through contrastive learning to render the classes distinguishable. Extensive experiments demonstrate that the proposed design better enlightens not only the attribution model, but also current question answering models, leading to superior performance.
Community question answering (CQA) becomes increasingly prevalent in recent years, providing platforms for users with various backgrounds to obtain information and share knowledge. However, the redundancy and lengthiness issues of crowd-sourced answers limit the performance of answer selection, thus leading to difficulties in reading or even misunderstandings for community users. To solve these problems, we propose the dual graph question-answer attention networks (DGQAN) for answer selection task. Aims to fully understand the internal structure of the question and the corresponding answer, firstly, we construct a dual-CQA concept graph with graph convolution networks using the original question and answer text. Specifically, our CQA concept graph exploits the correlation information between question-answer pairs to construct two sub-graphs (QSubject-Answer and QBody-Answer), respectively. Further, a novel dual attention mechanism is incorporated to model both the internal and external semantic relations among questions and answers. More importantly, we conduct experiment to investigate the impact of each layer in the BERT model. The experimental results show that DGQAN model achieves state-of-the-art performance on three datasets (SemEval-2015, 2016, and 2017), outperforming all the baseline models.
SESSION: Topic 18: Recommender Systems
ReCANet: A Repeat Consumption-Aware Neural Network for Next Basket Recommendation in Grocery Shopping
Retailers such as grocery stores or e-marketplaces often have vast selections of items for users to choose from. Predicting a user's next purchases has gained attention recently, in the form of next basket recommendation (NBR), as it facilitates navigating extensive assortments for users. Neural network-based models that focus on learning basket representations are the dominant approach in the recent literature. However, these methods do not consider the specific characteristics of the grocery shopping scenario, where users shop for grocery items on a regular basis, and grocery items are repurchased frequently by the same user.
In this paper, we first gain a data-driven understanding of users' repeat consumption behavior through an empirical study on six public and proprietary grocery shopping transaction datasets. We discover that, averaged over all datasets, over 54% of NBR performance in terms of recall comes from repeat items: items that users have already purchased in their history, which constitute only 1% of the total collection of items on average. A NBR model with a strong focus on previously purchased items can potentially achieve high performance. We introduce ReCANet, a repeat consumption-aware neural network that explicitly models the repeat consumption behavior of users in order to predict their next basket. ReCANet significantly outperforms state-of-the-art models for the NBR task, in terms of recall and nDCG. We perform an ablation study and show that all of the components of ReCANet contribute to its performance, and demonstrate that a user's repetition ratio has a direct influence on the treatment effect of ReCANet.
Recommender systems usually face the issue of filter bubbles: over-recommending homogeneous items based on user features and historical interactions. Filter bubbles will grow along the feedback loop and inadvertently narrow user interests. Existing work usually mitigates filter bubbles by incorporating objectives apart from accuracy such as diversity and fairness. However, they typically sacrifice accuracy, hurting model fidelity and user experience. Worse still, users have to passively accept the recommendation strategy and influence the system in an inefficient manner with high latency, e.g., keeping providing feedback (e.g., like and dislike) until the system recognizes the user intention.
This work proposes a new recommender prototype called User-Controllable Recommender System (UCRS), which enables users to actively control the mitigation of filter bubbles. Functionally, 1) UCRS can alert users if they are deeply stuck in filter bubbles. 2) UCRS supports four kinds of control commands for users to mitigate the bubbles at different granularities. 3) UCRS can respond to the controls and adjust the recommendations on the fly. The key to adjusting lies in blocking the effect of out-of-date user representations on recommendations, which contains historical information inconsistent with the control commands. As such, we develop a causality-enhanced User-Controllable Inference (UCI) framework, which can quickly revise the recommendations based on user controls in the inference stage and utilize counterfactual inference to mitigate the effect of out-of-date user representations. Experiments on three datasets validate that the UCI framework can effectively recommend more desired items based on user controls, showing promising performance w.r.t. both accuracy and diversity.
Knowledge graph (KG), integrating complex information and containing rich semantics, is widely considered as side information to enhance the recommendation systems. However, most of the existing KG-based methods concentrate on encoding the structural information in the graph, without utilizing the collaborative signals in user-item interaction data, which are important for understanding user preferences. Therefore, the representations learned by these models are insufficient for representing semantic information of users and items in the recommendation environment. The combination of both kinds of data provides a good chance to solve this problem, but it faces the following challenges: i) the inner correlations in user-item interaction data are difficult to capture from one side of the user or item; ii) capturing the knowledge associations on the whole KG would introduce noises and variously influence the recommendation results; iii) the semantic gap between both kinds of data is hard to alleviate.
To tackle this research gap, we propose a novel duet representation learning framework named KADM to fuse local information (user-item interaction data) and global information (external knowledge graph) for the top-N recommendation, which is composed of two separate sub-models. One learns the local representations by discovering the inner correlations in local information with a knowledge-aware co-attention mechanism, and another learns the global representations by encoding the knowledge associations in global information with a relation-aware attention network. The two sub-models are jointly trained as part of the semantic fusion network to compute the user preferences, which discriminates the contribution of the two sub-models under the special context. We conduct experiments on two real-world datasets, and the evaluations show that KADM significantly outperforms state-of-art methods. Further ablation studies confirm that the duet architecture performs significantly better than either sub-model on the recommendation tasks.
As much as Graph Convolutional Networks (GCNs) have shown tremendous success in recommender systems and collaborative filtering (CF), the mechanism of how they, especially the core components (\textiti.e., neighborhood aggregation) contribute to recommendation has not been well studied. To unveil the effectiveness of GCNs for recommendation, we first analyze them in a spectral perspective and discover two important findings: (1) only a small portion of spectral graph features that emphasize the neighborhood smoothness and difference contribute to the recommendation accuracy, whereas most graph information can be considered as noise that even reduces the performance, and (2) repetition of the neighborhood aggregation emphasizes smoothed features and filters out noise information in an ineffective way. Based on the two findings above, we propose a new GCN learning scheme for recommendation by replacing neihgborhood aggregation with a simple yet effective Graph Denoising Encoder (GDE), which acts as a band pass filter to capture important graph features. We show that our proposed method alleviates the over-smoothing and is comparable to an indefinite-layer GCN that can take any-hop neighborhood into consideration. Finally, we dynamically adjust the gradients over the negative samples to expedite model training without introducing additional complexity. Extensive experiments on five real-world datasets show that our proposed method not only outperforms state-of-the-arts but also achieves 12x speedup over LightGCN.
Most modern recommender systems predict users' preferences with two components: user and item embedding learning, followed by the user-item interaction modeling. By utilizing the auxiliary review information accompanied with user ratings, many of the existing review-based recommendation models enriched user/item embedding learning ability with historical reviews or better modeled user-item interactions with the help of available user-item target reviews. Though significant progress has been made, we argue that current solutions for review-based recommendation suffer from two drawbacks. First, as review-based recommendation can be naturally formed as a user-item bipartite graph with edge features from corresponding user-item reviews, how to better exploit this unique graph structure for recommendation? Second, while most current models suffer from limited user behaviors, can we exploit the unique self-supervised signals in the review-aware graph to guide two recommendation components better? To this end, in this paper, we propose a novel Review-aware Graph Contrastive Learning (RGCL) framework for review-based recommendation. Specifically, we first construct a review-aware user-item graph with feature-enhanced edges from reviews, where each edge feature is composed of both the user-item rating and the corresponding review semantics. This graph with feature-enhanced edges can help attentively learn each neighbor node weight for user and item representation learning. After that, we design two additional contrastive learning tasks (i.e., Node Discrimination and Edge Discrimination) to provide self-supervised signals for the two components in recommendation process. Finally, extensive experiments over five benchmark datasets demonstrate the superiority of our proposed RGCL compared to the state-of-the-art baselines.
Contrastive learning (CL) recently has spurred a fruitful line of research in the field of recommendation, since its ability to extract self-supervised signals from the raw data is well-aligned with recommender systems' needs for tackling the data sparsity issue. A typical pipeline of CL-based recommendation models is first augmenting the user-item bipartite graph with structure perturbations, and then maximizing the node representation consistency between different graph augmentations. Although this paradigm turns out to be effective, what underlies the performance gains is still a mystery. In this paper, we first experimentally disclose that, in CL-based recommendation models, CL operates by learning more uniform user/item representations that can implicitly mitigate the popularity bias. Meanwhile, we reveal that the graph augmentations, which used to be considered necessary, just play a trivial role. Based on this finding, we propose a simple CL method which discards the graph augmentations and instead adds uniform noises to the embedding space for creating contrastive views. A comprehensive experimental study on three benchmark datasets demonstrates that, though it appears strikingly simple, the proposed method can smoothly adjust the uniformity of learned representations and has distinct advantages over its graph augmentation-based counterparts in terms of recommendation accuracy and training efficiency. The code is released at https://github.com/Coder-Yu/QRec.
In recommendation systems, the choice of loss function is critical since a good loss may significantly improve the model performance. However, manually designing a good loss is a big challenge due to the complexity of the problem. A large fraction of previous work focuses on handcrafted loss functions, which needs significant expertise and human effort. In this paper, inspired by the recent development of automated machine learning, we propose an automatic loss function generation framework, AutoLossGen, which is able to generate loss functions directly constructed from basic mathematical operators without prior knowledge on loss structure. More specifically, we develop a controller model driven by reinforcement learning to generate loss functions, and develop iterative and alternating optimization schedule to update the parameters of both the controller model and the recommender model. One challenge for automatic loss generation in recommender systems is the extreme sparsity of recommendation datasets, which leads to the sparse reward problem for loss generation and search. To solve the problem, we further develop a reward filtering mechanism for efficient and effective loss generation. Experimental results show that our framework manages to create tailored loss functions for different recommendation models and datasets, and the generated loss gives better recommendation performance than commonly used baseline losses. Besides, most of the generated losses are transferable, i.e., the loss generated based on one model and dataset also works well for another model or dataset. Source code of the work is available at https://github.com/rutgerswiselab/AutoLossGen.
Locality-Sensitive State-Guided Experience Replay Optimization for Sparse Rewards in Online Recommendation
Online recommendation requires handling rapidly changing user preferences. Deep reinforcement learning (DRL) is an effective means of capturing users' dynamic interest during interactions with recommender systems. Generally, it is challenging to train a DRL agent in online recommender systems because of the sparse rewards caused by the large action space (e.g., candidate item space) and comparatively fewer user interactions. Leveraging experience replay (ER) has been extensively studied to conquer the issue of sparse rewards. However, they adapt poorly to the complex environment of online recommender systems and are inefficient in learning an optimal strategy from past experience. As a step to filling this gap, we propose a novel state-aware experience replay model, in which the agent selectively discovers the most relevant and salient experiences and is guided to find the optimal policy for online recommendations. In particular, a locality-sensitive hashing method is proposed to selectively retain the most meaningful experience at scale and a prioritized reward-driven strategy is designed to replay more valuable experiences with higher chance. We formally show that the proposed method guarantees the upper and lower bound on experience replay and optimizes the space complexity, as well as empirically demonstrate our model's superiority to several existing experience replay methods over three benchmark simulation platforms.
Recommender systems have become a fundamental service in most E-Commerce platforms, in which the matching stage aims to retrieve potentially relevant candidate items to users for further ranking. Recently, some efforts on extracting multi-interests from user's historical behaviors have demonstrated superior performance. However, the historical behaviors are not noise-free due to the possible misclicks or disturbances. Existing works mainly overlook the fact that the interests of a user are not only reflected by the historical behaviors, but also inherently regulated by the profile information. Hence, we are interested in exploiting the benefit of user profile in multi-interest learning to enhance candidate matching performance. To this end, a user-aware multi-interest learning framework (named UMI) is proposed in this paper to exploit both user profile and behavior information for candidate matching. Specifically, UMI consists of two main components: dual-attention routing and interest refinement. In the dual-attention routing, we firstly introduce a user-guided attention network to identify the important historical items with respect to the user profile. Then, the resultant importance weights are leveraged via the dual-attentive capsule network to extract the user's multi-interests. Afterwards, the extracted interests are utilized to highlight the corresponding user profile features for interest refinement, such that different user profiles can be incorporated into interest learning for diverse user preference understanding. Besides, to improve the model's discriminative capacity, we further devise a harder-negatives strategy to support model optimization. Extensive experiments show that UMI significantly outperforms state-of-the-art multi-interest modeling alternatives. Currently, UMI has been successfully deployed at Taobao App in Alibaba, serving hundreds of millions of users.
As the final stage of the multi-stage recommender system (MRS), reranking directly affects users' experience and satisfaction, thus playing a critical role in MRS. Despite the improvement achieved in the existing work, three issues are yet to be solved. First, users' historical behaviors contain rich preference information, such as users' long and short-term interests, but are not fully exploited in reranking. Previous work typically treats items in history equally important, neglecting the dynamic interaction between the history and candidate items. Second, existing reranking models focus on learning interactions at the item level while ignoring the fine-grained feature-level interactions. Lastly, estimating the reranking score on the ordered initial list before reranking may lead to the early scoring problem, thereby yielding suboptimal reranking performance. To address the above issues, we propose a framework named Multi-level Interaction Reranking (MIR). MIR combines low-level cross-item interaction and high-level set-to-list interaction, where we view the candidate items to be reranked as a set and the users' behavior history in chronological order as a list. We design a novel SLAttention structure for modeling the set-to-list interactions with personalized long-short term interests. Moreover, feature-level interactions are incorporated to capture the fine-grained influence among items. We design MIR in such a way that any permutation of the input items would not change the output ranking, and we theoretically prove it. Extensive experiments on three public and proprietary datasets show that MIR significantly outperforms the state-of-the-art models using various ranking and utility metrics.
Modern recommender systems aim to improve user experience. As reinforcement learning (RL) naturally fits this objective---maximizing an user's reward per session---it has become an emerging topic in recommender systems. Developing RL-based recommendation methods, however, is not trivial due to the offline training challenge. Specifically, the keystone of traditional RL is to train an agent with large amounts of online exploration making lots of 'errors' in the process. In the recommendation setting, though, we cannot afford the price of making 'errors' online. As a result, the agent needs to be trained through offline historical implicit feedback, collected under different recommendation policies; traditional RL algorithms may lead to sub-optimal policies under these offline training settings.
Here we propose a new learning paradigm---namely Prompt-Based Reinforcement Learning (PRL)---for the offline training of RL-based recommendation agents. While traditional RL algorithms attempt to map state-action input pairs to their expected rewards (e.g., Q-values), PRL directly infers actions (i.e., recommended items) from state-reward inputs. In short, the agents are trained to predict a recommended item given the prior interactions and an observed reward value---with simple supervised learning. At deployment time, this historical (training) data acts as a knowledge base, while the state-reward pairs are used as a prompt. The agents are thus used to answer the question: Which item should be recommended given the prior interactions & the prompted reward value? We implement PRL with four notable recommendation models and conduct experiments on two real-world e-commerce datasets. Experimental results demonstrate the superior performance of our proposed methods.
Knowledge graph (KG) plays an increasingly important role in recommender systems. Recently, graph neural networks (GNNs) based model has gradually become the theme of knowledge-aware recommendation (KGR). However, there is a natural deficiency for GNN-based KGR models, that is, the sparse supervised signal problem, which may make their actual performance drop to some extent. Inspired by the recent success of contrastive learning in mining supervised signals from data itself, in this paper, we focus on exploring the contrastive learning in KG-aware recommendation and propose a novel multi-level cross-view contrastive learning mechanism, named MCCLK. Different from traditional contrastive learning methods which generate two graph views by uniform data augmentation schemes such as corruption or dropping, we comprehensively consider three different graph views for KG-aware recommendation, including global-level structural view, local-level collaborative and semantic views. Specifically, we consider the user-item graph as a collaborative view, the item-entity graph as a semantic view, and the user-item-entity graph as a structural view. MCCLK hence performs contrastive learning across three views on both local and global levels, mining comprehensive graph feature and structure information in a self-supervised manner. Besides, in semantic view, a k-Nearest-Neighbor (k NN) item-item semantic graph construction module is proposed, to capture the important item-item semantic relation which is usually ignored by previous work. Extensive experiments conducted on three benchmark datasets show the superior performance of our proposed method over the state-of-the-arts. The implementations are available at: https://github.com/CCIIPLab/MCCLK.
Off-policy learning has drawn huge attention in recommender systems (RS), which provides an opportunity for reinforcement learning to abandon the expensive online training. However, off-policy learning from logged data suffers biases caused by the policy shift between the target policy and the logging policy. Consequently, most off-policy learning resorts to inverse propensity scoring (IPS) which however tends to be over-fitted over exposed (or recommended) items and thus fails to explore unexposed items.
In this paper, we propose meta graph enhanced off-policy learning (MGPolicy), which is the first recommendation model for correcting the off-policy bias via contextual information. In particular, we explicitly leverage rich semantics in meta graphs for user state representation, and then train the candidate generation model to promote an efficient search in the action space. lMoreover, our MGpolicy is designed with counterfactual risk minimization, which can correct poicy learning bias and ultimately yield an effective target policy to maximize the long-run rewards for the recommendation. We extensively evaluate our method through a series of simulations and large-scale real-world datasets, achieving favorable results compared with state-of-the-art methods. Our code is currently available online.
Recommendation systems make predictions chiefly based on users' historical interaction data (e.g., items previously clicked or purchased). There is a risk of privacy leakage when collecting the users' behavior data for building the recommendation model. However, existing privacy-preserving solutions are designed for tackling the privacy issue only during the model training  and results collection  phases. The problem of privacy leakage still exists when directly sharing the private user interaction data with organizations or releasing them to the public. To address this problem, in this paper, we present a User Privacy Controllable Synthetic Data Generation model (short for UPC-SDG), which generates synthetic interaction data for users based on their privacy preferences. The generation model aims to provide certain privacy guarantees while maximizing the utility of the generated synthetic data at both data level and item level. Specifically, at the data level, we design a selection module that selects those items that contribute less to a user's preferences from the user's interaction data. At the item level, a synthetic data generation module is proposed to generate a synthetic item corresponding to the selected item based on the user's preferences. Furthermore, we also present a privacy-utility trade-off strategy to balance the privacy and utility of the synthetic data. Extensive experiments and ablation studies have been conducted on three publicly accessible datasets to justify our method, demonstrating its effectiveness in generating synthetic data under users' privacy preferences.
Knowledge graph (KG) plays an increasingly important role to improve the recommendation performance and interpretability. A recent technical trend is to design end-to-end models based on the information propagation schemes. However, existing propagation-based methods fail to (1) model the underlying hierarchical structures and relations, and (2) capture the high-order collaborative signals of items for learning high-quality user and item representations.
In this paper, we propose a new model, called Hierarchy-Aware Knowledge Gated Network (HAKG), to tackle the aforementioned problems. Technically, we model users and items (that are captured by a user-item graph), as well as entities and relations (that are captured in a KG) in hyperbolic space, and design a new hyperbolic aggregation scheme to gather relational contexts over KG. Meanwhile, we introduce a novel angle constraint to preserve characteristics of items in the embedding space. Furthermore, we propose the dual item embeddings design to represent and propagate collaborative signals and knowledge associations separately, and leverage the gated aggregation to distill discriminative information for better capturing user behavior patterns. Experimental results on three benchmark datasets show that, HAKG achieves significant improvement over the state-of-the-art methods like CKAN, Hyper-Know, and KGIN. Further analyses on the learned hyperbolic embeddings confirm that HAKG can offer meaningful insights into the hierarchies of data.
Alleviating Spurious Correlations in Knowledge-aware Recommendations through Counterfactual Generator
Limited by the statistical-based machine learning framework, a spurious correlation is likely to appear in existing knowledge-aware recommendation methods. It refers to a knowledge fact that appears causal to the user behaviors (inferred by the recommender) but is not in fact. For tackling this issue, we present a novel approach to discovering and alleviating the potential spurious correlations from a counterfactual perspective. To be specific, our approach consists of two counterfactual generators and a recommender. The counterfactual generators are designed to generate counterfactual interactions via reinforcement learning, while the recommender is implemented with two different graph neural networks to aggregate the information from KG and user-item interactions respectively. The counterfactual generators and recommender are integrated in a mutually collaborative way. With this approach, the recommender helps the counterfactual generators better identify potential spurious correlations and generate high-quality counterfactual interactions, while the counterfactual generators help the recommender weaken the influence of the potential spurious correlations simultaneously. Extensive experiments on three real-world datasets have shown the effectiveness of the proposed approach by comparing it with a number of competitive baselines. Our implementation code is available at: https://github.com/RUCAIBox/CGKR.
The ubiquity of implicit feedback makes them the default choice to build modern recommender systems. Generally speaking, observed interactions are considered as positive samples, while unobserved interactions are considered as negative ones. However, implicit feedback is inherently noisy because of the ubiquitous presence of noisy-positive and noisy-negative interactions. Recently, some studies have noticed the importance of denoising implicit feedback for recommendations, and enhanced the robustness of recommendation models to some extent. Nonetheless, they typically fail to (1) capture the hard yet clean interactions for learning comprehensive user preference, and (2) provide a universal denoising solution that can be applied to various kinds of recommendation models.
In this paper, we thoroughly investigate the memorization effect of recommendation models, and propose a new denoising paradigm, i.e., Self-Guided Denoising Learning (SGDL), which is able to collect memorized interactions at the early stage of the training (i.e., ''noise-resistant'' period), and leverage those data as denoising signals to guide the following training (i.e., ''noise-sensitive'' period) of the model in a meta-learning manner. Besides, our method can automatically switch its learning phase at the memorization point from memorization to self-guided learning, and select clean and informative memorized data via a novel adaptive denoising scheduler to improve the robustness. We incorporate SGDL with four representative recommendation models (i.e., NeuMF, CDAE, NGCF and LightGCN) and different loss functions (i.e., binary cross-entropy and BPR loss). The experimental results on three benchmark datasets demonstrate the effectiveness of SGDL over the state-of-the-art denoising methods like T-CE, IR, DeCA, and even state-of-the-art robust graph-based methods like SGCN and SGL.
Deployable and Continuable Meta-learning-Based Recommender System with Fast User-Incremental Updates
User cold-start is a major challenge in building personalized recommender systems. Due to the lack of sufficient interactions, it is difficult to effectively model new users. One of the main solutions is to obtain an initial model through meta-learning (mainly gradient-based methods) and adapt it to new users with a few steps of gradient descent. Although these methods have achieved remarkable performance, they are still far from being usable in real-world applications due to their high-demand data processing, heavy computational burden, and inability to perform effective user-incremental update. In this paper, we propose a d eployable and c ontinuable m eta-learning-based r ecommendation (DCMR) approach, which can achieve fast user-incremental updating with task replay and first-order gradient descent. Specifically, we introduce a dual-constrained task sampler, distillation-based loss functions, and an adaptive controller in this framework to balance the trade-off between stability and plasticity in updating. In summary, DCMR can be updated while serving new users; in other words, it learns continuously and rapidly from a sequential user stream and is able to make recommendations at any time. The extensive experiments conducted on three benchmark datasets illustrate the superiority of our model.
Knowledge Graphs (KGs) have been utilized as useful side information to improve recommendation quality. In those recommender systems, knowledge graph information often contains fruitful facts and inherent semantic relatedness among items. However, the success of such methods relies on the high quality knowledge graphs, and may not learn quality representations with two challenges: i) The long-tail distribution of entities results in sparse supervision signals for KG-enhanced item representation; ii) Real-world knowledge graphs are often noisy and contain topic-irrelevant connections between items and entities. Such KG sparsity and noise make the item-entity dependent relations deviate from reflecting their true characteristics, which significantly amplifies the noise effect and hinders the accurate representation of user's preference.
To fill this research gap, we design a general Knowledge Graph Contrastive Learning framework (KGCL) that alleviates the information noise for knowledge graph-enhanced recommender systems. Specifically, we propose a knowledge graph augmentation schema to suppress KG noise in information aggregation, and derive more robust knowledge-aware representations for items. In addition, we exploit additional supervision signals from the KG augmentation process to guide a cross-view contrastive learning paradigm, giving a greater role to unbiased user-item interactions in gradient descent and further suppressing the noise. Extensive experiments on three public datasets demonstrate the consistent superiority of our KGCL over state-of-the-art techniques. KGCL also achieves strong performance in recommendation scenarios with sparse user-item interactions, long-tail and noisy KG entities. Our implementation codes are available at https://github.com/yuh-yang/KGCL-SIGIR22.
SESSION: Topic 19: Search and Ranking
CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos
Current dense retrievers are not robust to out-of-domain and outlier queries, i.e. their effectiveness on these queries is much poorer than what one would expect. In this paper, we consider a specific instance of such queries: queries that contain typos. We show that a small character level perturbation in queries (as caused by typos) highly impacts the effectiveness of dense retrievers. We then demonstrate that the root cause of this resides in the input tokenization strategy employed by BERT. In BERT, tokenization is performed using the BERT's WordPiece tokenizer and we show that a token with a typo will significantly change the token distributions obtained after tokenization. This distribution change translates to changes in the input embeddings passed to the BERT-based query encoder of dense retrievers. We then turn our attention to devising dense retriever methods that are robust to such queries with typos, while still being as performant as previous methods on queries without typos. For this, we use CharacterBERT as the backbone encoder and an efficient yet effective training method, called Self-Teaching (ST), that distills knowledge from queries without typos into the queries with typos. Experimental results show that CharacterBERT in combination with ST achieves significantly higher effectiveness on queries with typos compared to previous methods. Along with these results and the open-sourced implementation of the methods, we also provide a new passage retrieval dataset consisting of real-world queries with typos and associated relevance assessments on the MS MARCO corpus, thus supporting the research community in the investigation of effective and robust dense retrievers. Code, experimental results and dataset are made available at https://github.com/ielab/CharacterBERT-DR.
Pre-trained language models such as BERT have been a key ingredient to achieve state-of-the-art results on a variety of tasks in natural language processing and, more recently, also in information retrieval. Recent research even claims that BERT is able to capture factual knowledge about entity relations and properties, the information that is commonly obtained from knowledge graphs. This paper investigates the following question: Do BERT-based entity retrieval models benefit from additional entity information stored in knowledge graphs? To address this research question, we map entity embeddings into the same input space as a pre-trained BERT model and inject these entity embeddings into the BERT model. This entity-enriched language model is then employed on the entity retrieval task. We show that the entity-enriched BERT model improves effectiveness on entity-oriented queries over a regular BERT model, establishing a new state-of-the-art result for the entity retrieval task, with substantial improvements for complex natural language queries and queries requesting a list of entities with a certain property. Additionally, we show that the entity information provided by our entity-enriched model particularly helps queries related to less popular entities. Last, we observe empirically that the entity-enriched BERT models enable fine-tuning on limited training data, which otherwise would not be feasible due to the known instabilities of BERT in few-sample fine-tuning, thereby contributing to data-efficient training of BERT for entity search.
Entity-oriented search systems often learn vector representations of entities via the introductory paragraph from the Wikipedia page of the entity. As such representations are the same for every query, our hypothesis is that the representations are not ideal for IR tasks. In this work, we present BERT Entity Representations (BERT-ER) which are query-specific vector representations of entities obtained from text that describes how an entity is relevant for a query. Using BERT-ER in a downstream entity ranking system, we achieve a performance improvement of 13-42% (Mean Average Precision) over a system that uses the BERT embedding of the introductory paragraph from Wikipedia on two large-scale test collections. Our approach also outperforms entity ranking systems using entity embeddings from Wikipedia2Vec, ERNIE, and E-BERT. We show that our entity ranking system using BERT-ER can increase precision at the top of the ranking by promoting relevant entities to the top. With this work, we release our BERT models and query-specific entity embeddings fine-tuned for the entity ranking task.
The pre-trained language models (PLMs), such as BERT and ERNIE, have achieved outstanding performance in many natural language understanding tasks. Recently, PLMs-based Information Retrieval models have also been investigated and showed substantially state-of-the-art effectiveness, e.g., MORES, PROP and ColBERT. Moreover, most of the PLMs-based rankers only focus on a single level relevance matching (e.g., character-level), while ignore the other granularity information (e.g., words and phrases), which easily lead to the ambiguity of query understanding and inaccurate matching issues in web search.
In this paper, we aim to improve the state-of-the-art PLMs ERNIE for web search, by modeling multi-granularity context information with the awareness of word importance in queries and documents. In particular, we propose a novel H-ERNIE framework, which includes a query-document analysis component and a hierarchical ranking component. The query-document analysis component has several individual modules which generate the necessary variables, such as word segmentation, word importance analysis, and word tightness analysis. Based on these variables, the importance-aware multiple-level correspondences are sent to the ranking model. The hierarchical ranking model includes a multi-layer transformer module to learn the character-level representations, a word-level matching module, and a phrase-level matching module with word importance. Each of these modules models the query and the document matching from a different perspective. Also, these levels are inherently communicated to achieve the overall accurate matching. We discuss the time complexity of the proposed framework, and show that it can be efficiently implemented in real applications. The offline and online experiments on both public data sets and a commercial search engine illustrate the effectiveness of the proposed H-ERNIE framework.
Passage re-ranking is to obtain a permutation over the candidate passage set from retrieval stage. Re-rankers have been boomed by Pre-trained Language Models (PLMs) due to their overwhelming advantages in natural language understanding. However, existing PLM based re-rankers may easily suffer from vocabulary mismatch and lack of domain specific knowledge. To alleviate these problems, explicit knowledge contained in knowledge graph is carefully introduced in our work. Specifically, we employ the existing knowledge graph which is incomplete and noisy, and first apply it in passage re-ranking task. To leverage a reliable knowledge, we propose a novel knowledge graph distillation method and obtain a knowledge meta graph as the bridge between query and passage. To align both kinds of embedding in the latent space, we employ PLM as text encoder and graph neural network over knowledge meta graph as knowledge encoder. Besides, a novel knowledge injector is designed for the dynamic interaction between text and knowledge encoder. Experimental results demonstrate the effectiveness of our method especially in queries requiring in-depth domain knowledge.
Pre-trained language models (PLMs) have achieved great success in the area of Information Retrieval. Studies show that applying these models to ad-hoc document ranking can achieve better retrieval effectiveness. However, on the Web, most information is organized in the form of HTML web pages. In addition to the pure text content, the structure of the content organized by HTML tags is also an important part of the information delivered on a web page. Currently, such structured information is totally ignored by pre-trained models which are trained solely based on text content. In this paper, we propose to leverage large-scale web pages and their DOM (Document Object Model) tree structures to pre-train models for information retrieval. We argue that using the hierarchical structure contained in web pages, we can get richer contextual information for training better language models. To exploit this kind of information, we devise four pre-training objectives based on the structure of web pages, then pre-train a Transformer model towards these tasks jointly with traditional masked language model objective. Experimental results on two authoritative ad-hoc retrieval datasets prove that our model can significantly improve ranking performance compared to existing pre-trained models.
Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings
Vector quantization (VQ) based ANN indexes, such as Inverted File System (IVF) and Product Quantization (PQ), have been widely applied to embedding based document retrieval thanks to the competitive time and memory efficiency. Originally, VQ is learned to minimize the reconstruction loss, i.e., the distortions between the original dense embeddings and the reconstructed embeddings after quantization. Unfortunately, such an objective is inconsistent with the goal of selecting ground-truth documents for the input query, which may cause severe loss of retrieval quality. Recent works identify such a defect, and propose to minimize the retrieval loss through contrastive learning. However, these methods intensively rely on queries with ground-truth documents, whose performance is limited by the insufficiency of labeled data. In this paper, we propose Distill-VQ, which unifies the learning of IVF and PQ within a knowledge distillation framework. In Distill-VQ, the dense embeddings are leveraged as "teachers'', which predict the query's relevance to the sampled documents. The VQ modules are treated as the "students'', which are learned to reproduce the predicted relevance, such that the reconstructed embeddings may fully preserve the retrieval result of the dense embeddings. By doing so, Distill-VQ is able to derive substantial training signals from the massive unlabeled data, which significantly contributes to the retrieval quality. We perform comprehensive explorations for the optimal conduct of knowledge distillation, which may provide useful insights for the learning of VQ based ANN index. We also experimentally show that the labeled data is no longer a necessity for high-quality vector quantization, which indicates Distill-VQ's strong applicability in practice. The evaluations are performed on MS MARCO and Natural Questions benchmarks, where Distill-VQ notably outperforms the SOTA VQ methods in Recall and MRR. Our code is avaliable at https://github.com/staoxiao/LibVQ.
Recently, pre-training methods tailored for IR tasks have achieved great success. However, as the mechanisms behind the performance improvement remain under-investigated, the interpretability and robustness of these pre-trained models still need to be improved. Axiomatic IR aims to identify a set of desirable properties expressed mathematically as formal constraints to guide the design of ranking models. Existing studies have already shown that considering certain axioms may help improve the effectiveness and interpretability of IR models. However, there still lack efforts of incorporating these IR axioms into pre-training methodologies. To shed light on this research question, we propose a novel pre-training method with \underlineA xiomatic \underlineRe gularization for ad hoc \underlineS earch (ARES). In the ARES framework, a number of existing IR axioms are re-organized to generate training samples to be fitted in the pre-training process. These training samples then guide neural rankers to learn the desirable ranking properties. Compared to existing pre-training approaches, ARES is more intuitive and explainable. Experimental results on multiple publicly available benchmark datasets have shown the effectiveness of ARES in both full-resource and low-resource (e.g., zero-shot and few-shot) settings. An intuitive case study also indicates that ARES has learned useful knowledge that existing pre-trained models (e.g., BERT and PROP) fail to possess. This work provides insights into improving the interpretability of pre-trained models and the guidance of incorporating IR axioms or human heuristics into pre-training methods.
Multi-scenario learning (MSL) enables a service provider to cater for users' fine-grained demands by separating services for different user sectors, e.g., by user's geographical region. Under each scenario there is a need to optimize multiple task-specific targets e.g., click through rate and conversion rate, known as multi-task learning (MTL). Recent solutions for MSL and MTL are mostly based on the multi-gate mixture-of-experts (MMoE) architecture. MMoE structure is typically static and its design requires domain-specific knowledge, making it less effective in handling both MSL and MTL. In this paper, we propose a novel Automatic Expert Selection framework for Multi-scenario and Multi-task search, named AESM2. AESM2 integrates both MSL and MTL into a unified framework with an automatic structure learning. Specifically, AESM2 stacks multi-task layers over multi-scenario layers. This hierarchical design enables us to flexibly establish intrinsic connections between different scenarios, and at the same time also supports high-level feature extraction for different tasks. At each multi-scenario/multi-task layer, a novel expert selection algorithm is proposed to automatically identify scenario-/task-specific and shared experts for each input. Experiments over two real-world large-scale datasets demonstrate the effectiveness of AESM2 over a battery of strong baselines. Online A/B test also shows substantial performance gain on multiple metrics. Currently, AESM2 has been deployed online for serving major traffic.
SESSION: Topic 20: Sentiment Analysis and Classification
Multimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. Several works have addressed the missing modality problem; but most of them only considered the single modality missing case, and ignored the practically more general cases of multiple modalities missing. To this end, in this paper, we propose a Tag-Assisted Transformer Encoder (TATE) network to handle the problem of missing uncertain modalities. Specifically, we design a tag encoding module to cover both the single modality and multiple modalities missing cases, so as to guide the network's attention to those missing modalities. Besides, we adopt a new space projection pattern to align common vectors. Then, a Transformer encoder-decoder network is utilized to learn the missing modality features. At last, the outputs of the Transformer encoder are used for the final sentiment classification. Extensive experiments are conducted on CMU-MOSI and IEMOCAP datasets, showing that our method can achieve significant improvements compared with several baselines.
Mutual Disentanglement Learning for Joint Fine-Grained Sentiment Classification and Controllable Text Generation
Fine-grained sentiment classification (FGSC) task and fine-grained controllable text generation (FGSG) task are two representative applications of sentiment analysis, two of which together can actually form an inverse task prediction, i.e., the former aims to infer the fine-grained sentiment polarities given a text piece, while the latter generates text content that describes the input fine-grained opinions. Most of the existing work solves the FGSC and the FGSG tasks in isolation, while ignoring the complementary benefits in between. This paper combines FGSC and FGSG as a joint dual learning system, encouraging them to learn the advantages from each other. Based on the dual learning framework, we further propose decoupling the feature representations in two tasks into fine-grained aspect-oriented opinion variables and content variables respectively, by performing mutual disentanglement learning upon them. We also propose to transform the difficult "data-to-text'' generation fashion widely used in FGSG into an easier text-to-text generation fashion by creating surrogate natural language text as the model inputs. Experimental results on 7 sentiment analysis benchmarks including both the document-level and sentence-level datasets show that our method significantly outperforms the current strong-performing baselines on both the FGSC and FGSG tasks. Automatic and human evaluations demonstrate that our FGSG model successfully generates fluent, diverse and rich content conditioned on fine-grained sentiments.
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain. Existing studies in this task attach more attention to the sequence modeling of sentences while largely ignoring the rich domain-invariant semantics embedded in graph structures (i.e., the part-of-speech tags and dependency relations). As an important aspect of exploring characteristics of language comprehension, adaptive graph representations have played an essential role in recent years. To this end, in the paper, we aim to explore the possibility of learning invariant semantic features from graph-like structures in CDSC. Specifically, we present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs. More specifically, we first raise a POS-Transformer module to extract sequential semantic features from the word sequences as well as the part-of-speech tags. Then, we design a Hybrid Graph Attention (HGAT) module to generate syntax-based semantic features by considering the transferable dependency relations. Finally, we devise an Integrated aDaptive Strategy (IDS) to guide the joint learning process of both modules. Extensive experiments on four public datasets indicate that GAST achieves comparable effectiveness to a range of state-of-the-art models.
Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task designed to identify the polarity of a target aspect. Some works introduce various attention mechanisms to fully mine the relevant context words of different aspects, and use the traditional cross-entropy loss to fine-tune the models for the ABSA task. However, the attention mechanism paying partial attention to aspect-unrelated words inevitably introduces irrelevant noise. Moreover, the cross-entropy loss lacks discriminative learning of features, which makes it difficult to exploit the implicit information of intra-class compactness and inter-class separability. To overcome these challenges, we propose an Aspect Feature Distillation and Enhancement Network (AFDEN) for the ABSA task. We first propose a dual-feature extraction module to extract aspect-related and aspect-unrelated features through the attention mechanisms and graph convolutional networks. Then, to eliminate the interference of aspect-unrelated words, we design a novel aspect-feature distillation module containing a gradient reverse layer that learns aspect-unrelated contextual features through adversarial training, and an aspect-specific orthogonal projection layer to further project aspect-related features into the orthogonal space of aspect-unrelated features. Finally, we propose an aspect-feature enhancement module that leverages supervised contrastive learning to capture the implicit information between the same sentiment labels and between different sentiment labels. Experimental results on three public datasets demonstrate that our AFDEN model achieves state-of-the-art performance and verify the effectiveness and robustness of our model.
Recently, the aspect-opinion term pairs (AOTP) extraction task has gained substantial importance in the domain of aspect-based sentiment analysis. It intends to extract the potential pair of each aspect term with its corresponding opinion term present in a user review. Some existing studies heavily relied on the annotated aspect terms and/or opinion terms, or adopted external knowledge/resources to figure out the task. Therefore, in this study, we propose a novel end-to-end solution, called an Interactive AOTP (IAOTP) model, for exploring AOTP. The IAOTP model first tracks the boundary of each token in given aspect-specific and opinion-specific representations through a span-based operation. Next, it generates the candidate AOTP by formulating the dyadic relations between tokens through the Biaffine transformation. Then, it computes the positioning information to capture the significant distance relationship that each candidate pair holds. And finally, it jointly models collaborative interactions and prediction of AOTP through a 2D self-attention. Besides the IAOTP model, this study also proposes an independent aspect/opinion encoding model (a RS model) that formulates relational semantics to obtain aspect-specific and opinion-specific representations that can effectively perform the extraction of aspect and opinion terms. Detailed experiments conducted on the publicly available benchmark datasets for AOTP, aspect terms, and opinion terms extraction tasks, clearly demonstrate the significantly improved performance of our models relative to other competitive state-of-the-art baselines.
SESSION: Topic 21: Sequential Recommendations
A large-scale recommender system usually consists of recall and ranking modules. The goal of ranking modules (aka rankers) is to elaborately discriminate users' preference on item candidates proposed by recall modules. With the success of deep learning techniques in various domains, we have witnessed the mainstream rankers evolve from traditional models to deep neural models. However, the way that we design and use rankers remains unchanged: offline training the model, freezing the parameters, and deploying it for online serving. Actually, the candidate items are determined by specific user requests, in which underlying distributions (e.g., the proportion of items for different categories, the proportion of popular or new items) are highly different from one another in a production environment. The classical parameter-frozen inference manner cannot adapt to dynamic serving circumstances, making rankers' performance compromised.
In this paper, we propose a new training and inference paradigm, termed as Ada-Ranker, to address the challenges of dynamic online serving. Instead of using parameter-frozen models for universal serving, Ada-Ranker can adaptively modulate parameters of a ranker according to the data distribution of the current group of item candidates. We first extract distribution patterns from the item candidates. Then, we modulate the ranker by the patterns to make the ranker adapt to the current data distribution. Finally, we use the revised ranker to score the candidate list. In this way, we empower the ranker with the capacity of adapting from a global model to a local model which better handles the current task. As a first study, we examine our Ada-Ranker paradigm in the sequential recommendation scenario. Experiments on three datasets demonstrate that Ada-Ranker can effectively enhance various base sequential models and also outperform a comprehensive set of competitive baselines.
Side information fusion for sequential recommendation (SR) aims to effectively leverage various side information to enhance the performance of next-item prediction. Most state-of-the-art methods build on self-attention networks and focus on exploring various solutions to integrate the item embedding and side information embeddings before the attention layer. However, our analysis shows that the early integration of various types of embeddings limits the expressiveness of attention matrices due to a rank bottleneck and constrains the flexibility of gradients. Also, it involves mixed correlations among the different heterogeneous information resources, which brings extra disturbance to attention calculation. Motivated by this, we propose Decoupled Side Information Fusion for Sequential Recommendation (DIF-SR), which moves the side information from the input to the attention layer and decouples the attention calculation of various side information and item representation. We theoretically and empirically show that the proposed solution allows higher-rank attention matrices and flexible gradients to enhance the modeling capacity of side information fusion. Also, auxiliary attribute predictors are proposed to further activate the beneficial interaction between side information and item representation learning. Extensive experiments on four real-world datasets demonstrate that our proposed solution stably outperforms state-of-the-art SR models. Further studies show that our proposed solution can be readily incorporated into current attention-based SR models and significantly boost performance. Our source code is available at https://github.com/AIM-SE/DIF-SR.
For sequential recommender, the coarse-grained yet sparse sequential signals mined from massive user-item interactions have become the bottleneck to further improve the recommendation performance. To alleviate the spareness problem, exploiting auxiliary semantic features (\eg textual descriptions, visual images and knowledge graph) to enrich contextual information then turns into a mainstream methodology. Though effective, we argue that these different heterogeneous features certainly include much noise which may overwhelm the valuable sequential signals, and therefore easily reach the phenomenon of negative collaboration (ie 1 + 1 > 2). How to design a flexible strategy to select proper auxiliary information and alleviate the negative collaboration towards a better recommendation is still an interesting and open question. Unfortunately, few works have addressed this challenge in sequential recommendation.
In this paper, we introduce a Multi-Agent RL-based Information S election Model (named MARIS) to explore an effective collaboration between different kinds of auxiliary information and sequential signals in an automatic way. Specifically, MARIS formalizes the auxiliary feature selection as a cooperative Multi-agent Markov Decision Process. For each auxiliary feature type, MARIS resorts to using an agent to determine whether a specific kind of auxiliary feature should be imported to achieve a positive collaboration. In between, a QMIX network is utilized to cooperate their joint selection actions and produce an episode corresponding an effective combination of different auxiliary features for the whole historical sequence. Considering the lack of supervised selection signals, we further devise a novel reward-guided sampling strategy to leverage exploitation and exploration scheme for episode sampling. By preserving them in a replay buffer, MARIS learns the action-value function and the reward alternatively for optimization. Extensive experiments on four real-world datasets demonstrate that our model obtains significant performance improvement over up-to-date state-of-the-art recommendation models.
Sequential recommendation aims at identifying the next item that is preferred by a user based on their behavioral history. Compared to conventional sequential models that leverage attention mechanisms and RNNs, recent efforts mainly follow two directions for improvement: multi-interest learning and graph convolutional aggregation. Specifically, multi-interest methods such as ComiRec and MIMN, focus on extracting different interests for a user by performing historical item clustering, while graph convolution methods including TGSRec and SURGE elect to refine user preferences based on multilevel correlations between historical items. Unfortunately, neither of them realizes that these two types of solutions can mutually complement each other, by aggregating multi-level user preference to achieve more precise multi-interest extraction for a better recommendation. To this end, in this paper, we propose a unified multi-grained neural model (named MGNM) via a combination of multi-interest learning and graph convolutional aggregation. Concretely, MGNM first learns the graph structure and information aggregation paths of the historical items for a user. It then performs graph convolution to derive item representations in an iterative fashion, in which the complex preferences at different levels can be well captured. Afterwards, a novel sequential capsule network is proposed to inject the sequential patterns into the multi-interest extraction process, leading to a more precise interest learning in a multi-grained manner. Experiments on three real-world datasets from different scenarios demonstrate the superiority of MGNM against several state-of-the-art baselines. The performance gain over the best baseline is up to 27.10% and 25.17% in terms of NDCG@5 and HIT@5 respectively, which is one of the largest gains in recent development of sequential recommendation. Further analysis also demonstrates that MGNM is robust and effective at user preference understanding at multi-grained levels.
In most real-world recommender systems, users interact with items in a sequential and multi-behavioral manner. Exploring the fine-grained relationship of items behind the users' multi-behavior interactions is critical in improving the performance of recommender systems. Despite the great successes, existing methods seem to have limitations on modelling heterogeneous item-level multi-behavior dependencies, capturing diverse multi-behavior sequential dynamics, or alleviating data sparsity problems. In this paper, we show it is possible to derive a framework to address all the above three limitations. The proposed framework MB-STR, a Multi-Behavior Sequential Transformer Recommender, is equipped with the multi-behavior transformer layer (MB-Trans), the multi-behavior sequential pattern generator (MB-SPG) and the behavior-aware prediction module (BA-Pred). Compared with a typical transformer, we design MB-Trans to capture multi-behavior heterogeneous dependencies as well as behavior-specific semantics, propose MB-SPG to encode the diverse sequential patterns among multiple behaviors, and incorporate BA-Pred to better leverage multi-behavior supervision. Comprehensive experiments on three real-world datasets show the effectiveness of MB-STR by significantly boosting the recommendation performance compared with various competitive baselines. Further ablation studies demonstrate the superiority of different modules of MB-STR.
Sequential recommendation is a popular task in academic research and close to real-world application scenarios, where the goal is to predict the next action(s) of the user based on his/her previous sequence of actions. In the training process of recommender systems, the loss function plays an essential role in guiding the optimization of recommendation models to generate accurate suggestions for users. However, most existing sequential recommendation tech- niques focus on designing algorithms or neural network architectures, and few efforts have been made to tailor loss functions that fit naturally into the practical application scenario of sequential recommender systems.
Ranking-based losses, such as cross-entropy and Bayesian Personalized Ranking (BPR) are widely used in the sequential recommendation area. We argue that such objective functions suffer from two inherent drawbacks: i) the dependencies among elements of a sequence are overlooked in these loss formulations; ii) instead of balancing accuracy (quality) and diversity, only generating accurate results has been over emphasized. We therefore propose two new loss functions based on the Determinantal Point Process (DPP) likelihood, that can be adaptively applied to estimate the subsequent item or items. The DPP-distributed item set captures natural dependencies among temporal actions, and a quality vs. diversity decomposition of the DPP kernel pushes us to go beyond accuracy-oriented loss functions. Experimental results using the proposed loss functions on three real-world datasets show marked improvements over state-of-the-art sequential recommendation methods in both quality and diversity metrics.
SESSION: Topic 22: Session-based and Group Recommendation
As a step beyond traditional personalized recommendation, group recommendation is the task of suggesting items that can satisfy a group of users. In group recommendation, the core is to design preference aggregation functions to obtain a quality summary of all group members' preferences. Such user and group preferences are commonly represented as points in the vector space (i.e., embeddings), where multiple user embeddings are compressed into one to facilitate ranking for group-item pairs. However, the resulted group representations, as points, lack adequate flexibility and capacity to account for the multi-faceted user preferences. Also, the point embedding-based preference aggregation is a less faithful reflection of a group's decision-making process, where all users have to agree on a certain value in each embedding dimension instead of a negotiable interval. In this paper, we propose a novel representation of groups via the notion of hypercubes, which are subspaces containing innumerable points in the vector space. Specifically, we design the hypercube recommender (CubeRec) to adaptively learn group hypercubes from user embeddings with minimal information loss during preference aggregation, and to leverage a revamped distance metric to measure the affinity between group hypercubes and item points. Moreover, to counteract the long-standing issue of data sparsity in group recommendation, we make full use of the geometric expressiveness of hypercubes and innovatively incorporate self-supervision by intersecting two groups. Experiments on four real-world datasets have validated the superiority of CubeRec over state-of-the-art baselines.
Session-based recommendation (SBR) aims to predict a user's next clicked item based on an anonymous yet short interaction sequence. Previous SBR models, which rely only on the limited short-term transition information without utilizing extra valuable knowledge, have suffered a lot from the problem of data sparsity. This paper proposes a novel mirror graph enhanced neural model for session-based recommendation (MGS), to exploit item attribute information over item embeddings for more accurate preference estimation.
Specifically, MGS utilizes two kinds of graphs to learn item representations. One is a session graph generated from the user interaction sequence describing users' preference based on transition patterns. Another is a mirror graph built by an attribute-aware module that selects the most attribute-representative information for each session item by integrating items' attribute information. We applied an iterative dual refinement mechanism to propagate information between the session and mirror graphs. To further guide the training process of the attribute-aware module, we also introduce a contrastive learning strategy that compares two mirror graphs generated for the same session by randomly sampling the attribute-same neighbors. Experiments on three real-world datasets exhibit that the performance of MGS surpasses many state-of-the-art models.
Session-based recommendation aims to predict items that an anonymous user would like to purchase based on her short behavior sequence. The current approaches towards session-based recommendation only focus on modeling users' interest preferences, while they all ignore a key attribute of an item, i.e., the price. Many marketing studies have shown that the price factor significantly influences users' behaviors and the purchase decisions of users are determined by both price and interest preferences simultaneously. However, it is nontrivial to incorporate price preferences for session-based recommendation. Firstly, it is hard to handle heterogeneous information from various features of items to capture users' price preferences. Secondly, it is difficult to model the complex relations between price and interest preferences in determining user choices.
To address the above challenges, we propose a novel method Co-guided Heterogeneous Hypergraph Network (CoHHN) for session-based recommendation. Towards the first challenge, we devise a heterogeneous hypergraph to represent heterogeneous information and rich relations among them. A dual-channel aggregating mechanism is then designed to aggregate various information in the heterogeneous hypergraph. After that, we extract users' price preferences and interest preferences via attention layers. As to the second challenge, a co-guided learning scheme is designed to model the relations between price and interest preferences and enhance the learning of each other. Finally, we predict user actions based on item features and users' price and interest preferences. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed CoHHN. Further analysis reveals the significance of price for session-based recommendation.
Session-based recommendation aims to predict next click action (e.g., item) of anonymous users based on a fixed number of previous actions. Recently, Graph Neural Networks (GNNs) have shown superior performance in various applications. Inspired by the success of GNNs, tremendous endeavors have been devoted to introduce GNNs into session-based recommendation and have achieved significant results. Nevertheless, due to the highly diverse types of potential information in sessions, existing GNNs-based methods perform differently on different session datasets, leading to the need for efficient design of neural networks adapted to various session recommendation scenarios. To address this problem, we propose Automated neural architecture search for Graph-based Session Recommendation, namely AutoGSR, a framework that provides a practical and general solution to automatically find the optimal GNNs-based session recommendation model. In AutoGSR, we propose two novel GNN operations to build an expressive and compact search space. Building upon the search space, we employ a differentiable search algorithm to search for the optimal graph neural architecture. Furthermore, to consider all types of session information together, we propose to learn the item meta knowledge, which acts as a priori knowledge for guiding the optimization of final session representations. Comprehensive experiments on three real-world datasets demonstrate that AutoGSR is able to find effective neural architectures and achieve state-of-the-art results. To the best of our knowledge, we are the first to study the neural architecture search for the session-based recommendation.
As an emerging paradigm, session-based recommendation is aimed at recommending the next item based on a set of anonymous sessions. Effectively representing a session that is normally a short interaction sequence renders a major technical challenge. In view of the limitations of pioneering studies that explore collaborative information from other sessions, in this paper we propose a new direction to enhance session representations by learning multi-faceted session-independent global item relations. In particular, we identify three types of advantageous global item relations, including negative relations that have not been studied before, and propose different graph construction methods to capture such relations. We then devise a novel multi-faceted global item relation (MGIR) model to encode different relations using different aggregation layers and generate enhanced session representations by fusing positive and negative relations. Our solution is flexible to accommodate new item relations and can easily integrate existing session representation learning methods to generate better representations from global relation enhanced session information. Extensive experiments on three benchmark datasets demonstrate the superiority of our model over a large number of state-of-the-art methods. Specifically, we show that learning negative relations is critical for session-based recommendation.
SESSION: Topic 23: Social Aspects
Social media enable users to share their feelings and emotional struggles. They also offer an opportunity to provide community support to suicidal users. Recent studies on suicide risk assessment have explored the user's historic timeline and information from their social network to analyze their emotional state. However, such methods often require a large amount of user-centric data. A less intrusive alternative is to only use conversation trees arising from online community responses. Modeling such online conversations between the community and a person in distress is an important context for understanding that person's mental state. However, it is not trivial to model the vast number of conversation trees on social media, since each comment has a diverse influence on a user in distress. Typically, a handful of comments/posts receive a significantly high number of replies, which results in scale-free dynamics in the conversation tree. Moreover, psychological studies suggested that it is important to capture the fine-grained temporal irregularities in the release of vast volumes of comments, since suicidal users react quickly to online community support. Building on these limitations and psychological studies, we propose HCN, a Hyperbolic Conversation Network, which is a less user-intrusive method for suicide ideation detection. HCN leverages the hyperbolic space to represent the scale-free dynamics of online conversations. Through extensive quantitative, qualitative, and ablative experiments on real-world Twitter data, we find that HCN outperforms state-of-the art methods, while using 98% less user-specific data, and while maintaining a 74% lower carbon footprint and a 94% smaller model size. We also find that the comments within the first half an hour are most important to identify at-risk users.
Unsupervised Belief Representation Learning with Information-Theoretic Variational Graph Auto-Encoders
This paper develops a novel unsupervised algorithm for belief representation learning in polarized networks that (i) uncovers the latent dimensions of the underlying belief space and (ii) jointly embeds users and content items (that they interact with) into that space in a manner that facilitates a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. Inspired by total correlation in information theory, we propose the Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. To better disentangle latent variables in that space, we develop a total correlation regularization module, a Proportional-Integral (PI) control module, and adopt rectified Gaussian distribution to ensure the orthogonality. The latent representation of users and content can then be used to quantify their ideological leaning and detect/predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from U.S. Congress voting records. The evaluation results show that our model outperforms state-of-the-art unsupervised models by reducing 10.5% user clustering errors and achieving 12.1% higher F1 scores for stance separation of content items. In addition, InfoVGAE produces a comparable result with supervised models. We also discuss its performance on stance prediction and user ranking within ideological groups.
A Multitask Framework for Sentiment, Emotion and Sarcasm aware Cyberbullying Detection from Multi-modal Code-Mixed Memes
Detecting cyberbullying from memes is highly challenging, because of the presence of the implicit affective content which is also often sarcastic, and multi-modality (image + text). The current work is the first attempt, to the best of our knowledge, in investigating the role of sentiment, emotion and sarcasm in identifying cyberbullying from multi-modal memes in a code-mixed language setting. As a contribution, we have created a benchmark multi-modal meme dataset called MultiBully annotated with bully, sentiment, emotion and sarcasm labels collected from open-source Twitter and Reddit platforms. Moreover, the severity of the cyberbullying posts is also investigated by adding a harmfulness score to each of the memes. The created dataset consists of two modalities, text and image. Most of the texts in our dataset are in code-mixed form, which captures the seamless transitions between languages for multilingual users. Two different multimodal multitask frameworks (BERT+ResNET-Feedback and CLIP-CentralNet) have been proposed for cyberbullying detection (CD), the three auxiliary tasks being sentiment analysis (SA), emotion recognition (ER) and sarcasm detection (SAR). Experimental results indicate that compared to uni-modal and single-task variants, the proposed frameworks improve the performance of the main task, i.e., CD, by 3.18% and 3.10% in terms of accuracy and F1 score, respectively.
Increased social media use has contributed to the greater prevalence of abusive, rude, and offensive textual comments. Machine learning models have been developed to detect toxic comments online, yet these models tend to show biases against users with marginalized or minority identities (e.g., females and African Americans). Established research in debiasing toxicity classifiers often (1) takes a static or batch approach, assuming that all information is available and then making a one-time decision; and (2) uses a generic strategy to mitigate different biases (e.g., gender and racial biases) that assumes the biases are independent of one another. However, in real scenarios, the input typically arrives as a sequence of comments/words over time instead of all at once. Thus, decisions based on partial information must be made while additional input is arriving. Moreover, social bias is complex by nature. Each type of bias is defined within its unique context, which, consistent with intersectionality theory within the social sciences, might be correlated with the contexts of other forms of bias. In this work, we consider debiasing toxicity detection as a sequential decision-making process where different biases can be interdependent. In particular, we study debiasing toxicity detection with two aims: (1) to examine whether different biases tend to correlate with each other; and (2) to investigate how to jointly mitigate these correlated biases in an interactive manner to minimize the total amount of bias. At the core of our approach is a framework built upon theories of sequential Markov Decision Processes that seeks to maximize the prediction accuracy and minimize the bias measures tailored to individual biases. Evaluations on two benchmark datasets empirically validate the hypothesis that biases tend to be correlated and corroborate the effectiveness of the proposed sequential debiasing strategy.
A Weakly Supervised Propagation Model for Rumor Verification and Stance Detection with Multiple Instance Learning
The diffusion of rumors on social media generally follows a propagation tree structure, which provides valuable clues on how an original message is transmitted and responded by users over time. Recent studies reveal that rumor verification and stance detection are two relevant tasks that can jointly enhance each other despite their differences. For example, rumors can be debunked by cross-checking the stances conveyed by their relevant posts, and stances are also conditioned on the nature of the rumor. However, stance detection typically requires a large training set of labeled stances at post level, which are rare and costly to annotate. Enlightened by Multiple Instance Learning (MIL) scheme, we propose a novel weakly supervised joint learning framework for rumor verification and stance detection which only requires bag-level class labels concerning the rumor's veracity. Specifically, based on the propagation trees of source posts, we convert the two multi-class problems into multiple MIL-based binary classification problems where each binary model is focused on differentiating a target class (of rumor or stance) from the remaining classes. Then, we propose a hierarchical attention mechanism to aggregate the binary predictions, including (1) a bottom-up/top-down tree attention layer to aggregate binary stances into binary veracity; and (2) a discriminative attention layer to aggregate the binary class into finer-grained classes. Extensive experiments conducted on three Twitter-based datasets demonstrate promising performance of our model on both claim-level rumor detection and post-level stance classification compared with state-of-the-art methods.
SESSION: Short Research Papers
Many recent Natural Language Processing (NLP) task formulations, such as question answering and fact verification, are implemented as a two-stage cascading architecture. In the first stage an IR system retrieves "relevant'' documents containing the knowledge, and in the second stage an NLP system performs reasoning to solve the task. Optimizing the IR system for retrieving relevant documents ensures that the NLP system has sufficient information to operate over. These recent NLP task formulations raise interesting and exciting challenges for IR, where the end-user of an IR system is not a human with an information need, but another system exploiting the documents retrieved by the IR system to perform reasoning and address the user information need. Among these challenges, as we will show, is that noise from the IR system, such as retrieving spurious or irrelevant documents, can negatively impact the accuracy of the downstream reasoning module. Hence, there is the need to balance maximizing relevance while minimizing noise in the IR system. This paper presents experimental results on two NLP tasks implemented as a two-stage cascading architecture. We show how spurious or irrelevant retrieved results from the first stage can induce errors in the second stage. We use these results to ground our discussion of the research challenges that the IR community should address in the context of these knowledge-intensive NLP tasks.
Cross-lingual information retrieval (CLIR) aims to provide access to information across languages. Recent pre-trained multilingual language models brought large improvements to the natural language tasks, including cross-lingual adhoc retrieval. However, pseudo-relevance feedback (PRF), a family of techniques for improving ranking using the contents of top initially retrieved items, has not been explored with neural CLIR retrieval models. Two of the challenges are incorporating feedback from long documents, and cross-language knowledge transfer. To address these challenges, we propose a novel neural CLIR architecture, NCLPRF, capable of incorporating PRF feedback from multiple potentially long documents, which enables improvements to query representation in the shared semantic space between query and document languages. The additional information that the feedback documents provide in a target language, can enrich the query representation, bringing it closer to relevant documents in the embedding space. The proposed model performance across three CLIR test collections in Chinese, Russian, and Persian languages, exhibits significant improvements over traditional and SOTA neural CLIR baselines across all three collections.
Session-based Recommendation (SBR) refers to the task of predicting the next item based on short-term user behaviors within an anonymous session. However, session embedding learned by a non-linear encoder is usually not in the same representation space as item embeddings, resulting in the inconsistent prediction issue while recommending items. To address this issue, we propose a simple and effective framework named CORE, which can unify the representation space for both the encoding and decoding processes. Firstly, we design a representation-consistent encoder that takes the linear combination of input item embeddings as session embedding, guaranteeing that sessions and items are in the same representation space. Besides, we propose a robust distance measuring method to prevent overfitting of embeddings in the consistent representation space. Extensive experiments conducted on five public real-world datasets demonstrate the effectiveness and efficiency of the proposed method. The code is available at: https://github.com/RUCAIBox/CORE.
Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization
Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.
Recently micro-videos have become more popular in social media platforms such as TikTok and Instagram. Engagements in these platforms are facilitated by multi-modal recommendation systems. Indeed, such multimedia content can involve diverse modalities, often represented as visual, acoustic, and textual features to the recommender model. Existing works in micro-video recommendation tend to unify the multi-modal channels, thereby treating each modality with equal importance. However, we argue that these approaches are not sufficient to encode item representations with multiple modalities, since the used methods cannot fully disentangle the users' tastes on different modalities. To tackle this problem, we propose a novel learning method named Multi-Modal Graph Contrastive Learning (MMGCL), which aims to explicitly enhance multi-modal representation learning in a self-supervised learning manner. In particular, we devise two augmentation techniques to generate the multiple views of a user/item: modality edge dropout and modality masking. Furthermore, we introduce a novel negative sampling technique that allows to learn the correlation between modalities and ensures the effective contribution of each modality. Extensive experiments conducted on two micro-video datasets demonstrate the superiority of our proposed MMGCL method over existing state-of-the-art approaches in terms of both recommendation performance and training convergence speed.
RESETBERT4Rec: A Pre-training Model Integrating Time And User Historical Behavior for Sequential Recommendation
Sequential recommendation methods are very important in modern recommender systems because they can well capture users' dynamic interests from their interaction history, and make accurate recommendations for users, thereby helping enterprises succeed in business. However, despite the great success of existing sequential recommendation-based methods, they focus too much on item-level modeling of users' click history and lack information about the user's entire click history (such as click order, click time, etc.). To tackle this problem, inspired by recent advances in pre-training techniques in the field of natural language processing, we build a new pre-training task based on the original BERT pre-training framework and incorporate temporal information. Specifically, we propose a new model called the RE arrange S equence prE -training and T ime embedding model via BERT for sequential R ecommendation (RESETBERT4Rec ) \footnoteThis work was completed during JD internship., it further captures the information of the user's whole click history by adding a rearrange sequence prediction task to the original BERT pre-training framework, while it integrates different views of time information. Comprehensive experiments on two public datasets as well as one e-commerce dataset demonstrate that RESETBERT4Rec achieves state-of-the-art performance over existing baselines.
Sequential recommender systems (SRSs) have become a research hotspot recently due to its powerful ability in capturing users' dynamic preferences. The key idea behind SRSs is to model the sequential dependencies over the user-item interactions. However, we argue that users' preferences are not only determined by their view or purchase items but also affected by the item-providers with which users have interacted. For instance, in a short-video scenario, a user may click on a video because he/she is attracted to either the video content or simply the video-providers as the vloggers are his/her idols. Motivated by the above observations, in this paper, we propose IPSRec, a novel Item-Provider co-learning framework for Sequential Recommendation. Specifically, we propose two representation learning methods (single-steam and cross-stream) to learn comprehensive item and user representations based on the user's historical item sequence and provider sequence. Then, contrastive learning is employed to further enhance the user embeddings in a self-supervised manner, which treats the representations of a specific user learned from the item side as well as the item-provider side as the positive pair and treats the representations of different users in the batch as the negative samples. Extensive experiments on three real-world SRS datasets demonstrate that IPSRec achieves substantially better results than the strong competitors. For reproducibility, our code and data are available at https://github.com/siat-nlp/IPSRec.
Recommender Systems (RS), as an efficient tool to discover users' interested items from a very large corpus, has attracted more and more attention from academia and industry. As the initial stage of RS, large-scale matching is fundamental yet challenging. A typical recipe is to learn user and item representations with a two-tower architecture and then calculate the similarity score between both representation vectors, which however still struggles in how to properly deal with negative samples. In this paper, we find that the common practice that randomly sampling negative samples from the entire space and treating them equally is not an optimal choice, since the negative samples from different sub-spaces at different stages have different importance to a matching model. To address this issue, we propose a novel method named Unbiased Model-Agnostic Matching Approach (UMA2). It consists of two basic modules including 1) General Matching Model (GMM), which is model-agnostic and can be implemented as any embedding-based two-tower models; and 2) Negative Samples Debias Network (NSDN), which discriminates negative samples by borrowing the idea of Inverse Propensity Weighting (IPW) and re-weighs the loss in GMM. UMA$^2$ seamlessly integrates these two modules in an end-to-end multi-task learning framework. Extensive experiments on both real-world offline dataset and online A/B test demonstrate its superiority over state-of-the-art methods.
Existing methods usually identify causal relations between events at the mention-level, which takes each event mention pair as a separate input. As a result, they either suffer from conflicts among causal relations predicted separately or require a set of additional constraints to resolve such conflicts. We propose to study this task in a more realistic setting, where event-level causality identification can be made. The advantage is two folds: 1) with modeling different mentions of an event as a single unit, no more conflicts among predicted results, without any extra constraints; 2) with the use of diverse knowledge sources (e.g., co-occurrence and coreference relations), a rich graph-based event structure can be induced from the document for supporting event-level causal inference. Graph convolutional network is used to encode such structural information, which aims to capture the local and non-local dependencies among nodes. Results show that our model achieves the best performance under both mention- and event-level settings, outperforming a number of strong baselines by at least 2.8% on F1 score.
A data lake is a repository for massive raw and heterogeneous data, which includes multiple data models with different data schemas and query interfaces. Keyword search can extract valuable information for users without the knowledge of underlying schemas and query languages. However, conventional keyword searches are restricted to a certain data model and cannot easily adapt to a data lake. In this paper, we study a novel keyword search. To achieve high accuracy and efficiency, we introduce canonical graphs and then integrate semantically related vertices based on vertex representations. A matching entity based keyword search algorithm is presented to find answers across multiple data sources. Finally, extensive experimental study shows the effectiveness and efficiency of our solution.
Engaging all content providers, including newcomers or minority demographic groups, is crucial for online platforms to keep growing and working. Hence, while building recommendation services, the interests of those providers should be valued. In this paper, we consider providers as grouped based on a common characteristic in settings in which certain provider groups have low representation of items in the catalog and, thus, in the user interactions. Then, we envision a scenario wherein platform owners seek to control the degree of exposure to such groups in the recommendation process. To support this scenario, we rely on disparate exposure measures that characterize the gap between the share of recommendations given to groups and the target level of exposure pursued by the platform owners. We then propose a re-ranking procedure that ensures desired levels of exposure are met. Experiments show that, while supporting certain groups of providers by rendering them with the target exposure, beyond-accuracy objectives experience significant gains with negligible impact in recommendation utility.
Brain-inspired hyperdimensional computing (HDC) has been introduced as an alternative computing paradigm to achieve efficient and robust learning. HDC simulates cognitive tasks by mapping all data points to patterns of neural activity in the high-dimensional space, which has demonstrated promising performances in a wide range of applications such as robotics, biomedical signal processing, and genome sequencing. Language tasks, generally solved using machine learning methods, are widely deployed on low-power embedded devices. However, existing HDC solutions suffer from major challenges that impede the deployment of low-power embedded devices: the storage and computation overhead of HDC models grows dramatically with (i) the number of dimensions and (ii) the complex similarity metric during the inference.
In this paper, we proposed a novel ensemble framework for the language task, termed L3E-HD, which enables efficient HDC on low-power edge devices. L3E-HD accelerates the inference by mapping data points to a high-dimensional binary space to simplify similarity search, which dominates costly and frequent operation in HDC. Through marrying HDC with the ensemble technique, L3E-HD also addresses the severe accuracy degradation induced by the compression of the dimension and precision of the model. Our experiments show that the ensemble technique is naturally a perfect fit to boost HDCs. We find that our L3E-HD, which is faster, more efficient, and more accurate than conventional machine learning methods, can even surpass the accuracy of the full-precision model at a smaller model size. Code is released at: https://github.com/MXHX7199/SIGIR22-EnsembleHDC.
With the success of deep learning, click-through rate (CTR) predictions are transitioning from shallow approaches to deep architectures. Current deep CTR prediction usually follows the Embedding & MLP paradigm, where the model embeds categorical features into latent semantic space. This paper introduces a novel embedding technique called neural statistics that instead learns explicit semantics of categorical features by incorporating feature engineering as an innate prior into the deep architecture in an end-to-end manner. Besides, since the statistical information changes over time, we study how to adapt to the distribution shift in the MLP module efficiently. Offline experiments on two public datasets validate the effectiveness of neural statistics against state-of-the-art models. We also apply it to a large-scale recommender system via online A/B tests, where the user's satisfaction is significantly improved.
Graph Neural Networks (GNNs) provide a class of powerful architectures that are effective for graph-based collaborative filtering. Nevertheless, GNNs are known to be vulnerable to adversarial perturbations. Adversarial training is a simple yet effective way to improve the robustness of neural models. For example, many prior studies inject adversarial perturbations into either node features or hidden layers of GNNs. However, perturbing graph structures has been far less studied in recommendations.
To bridge this gap, we propose AdvGraph to model adversarial graph perturbations during the training of GNNs. Our AdvGraph is mainly based on min-max robust optimization, where an universal graph perturbation is obtained through an inner maximization while the outer optimization aims to compute the model parameters of GNNs. However, direct optimizing the inner problem is challenging due to discrete nature of the graph perturbations. To address this issue, an unbiased gradient estimator is further proposed to compute the gradients of discrete variables. Extensive experiments demonstrate that our AdvGraph is able to enhance the generalization performance of GNN-based recommenders.
While Graph Convolutional Networks (GCNs) have been extended to various fields of artificial intelligence with their powerful representation capabilities, recent studies have revealed that their ability to capture the part-whole structure of the graph is limited. Furthermore, though many GCNs variants have been proposed and obtained state-of-the-art results, they face the situation that much early information may be lost during the graph convolution step. To this end, we innovatively present an Graph Capsule Network with a Dual Adaptive Mechanism (DA-GCN) to tackle the above challenges. Specifically, this powerful mechanism is a dual-adaptive mechanism to capture the part-whole structure of the graph. One is an adaptive node interaction module to explore the potential relationship between interactive nodes. The other is an adaptive attention-based graph dynamic routing to select appropriate graph capsules, so that only favorable graph capsules are gathered and redundant graph capsules are restrained for better capturing the whole structure between graphs. Experiments demonstrate that our proposed algorithm has achieved the most advanced or competitive results on all datasets.
Hierarchical Text Classification (HTC) is a challenging task where a document can be assigned to multiple hierarchically structured categories within a taxonomy. The majority of prior studies consider HTC as a flat multi-label classification problem, which inevitably leads to ''label inconsistency'' problem. In this paper, we formulate HTC as a sequence generation task and introduce a sequence-to-tree framework (Seq2Tree) for modeling the hierarchical label structure. Moreover, we design a constrained decoding strategy with dynamic vocabulary to secure the label consistency of the results. Compared with previous works, the proposed approach achieves significant and consistent improvements on three benchmark datasets.
In the era of big data, eXtreme Multi-label Classification (XMC) has already become one of the most essential research tasks to deal with enormous label spaces in machine learning applications. Instead of assessing every individual label, most XMC methods rely on label trees or filters to derive short ranked label lists as prediction, thereby reducing computational overhead. Specifically, existing studies obtain ranked label lists with a fixed length for prediction and evaluation. However, these predictions are unreasonable since data points have varied numbers of relevant labels. The greatly small and large list lengths in evaluation, such as Precision@5 and Recall@100, can also lead to the ignorance of other relevant labels or the tolerance of many irrelevant labels. In this paper, we aim to provide reasonable prediction for extreme multi-label classification with dynamic numbers of predicted labels. In particular, we propose a novel framework, Model-Agnostic List Truncation with Ordinal Regression (MALTOR), to leverage the ranking properties and truncate long ranked label lists for better accuracy. Extensive experiments conducted on six large-scale real-world benchmark datasets demonstrate that MALTOR significantly outperforms statistical baseline methods and conventional ranked list truncation methods in ad-hoc retrieval with both linear and deep XMC models. The results of an ablation study also shows the effectiveness of each individual component in our proposed MALTOR.
Target-oriented opinion words extraction (TOWE) is a subtask of aspect-based sentiment analysis (ABSA). Given a sentence and an aspect term occurring in the sentence, TOWE extracts the corresponding opinion words for the aspect term. TOWE has two types of instance. In the first type, aspect terms are associated with at least one opinion word, while in the second type, aspect terms do not have corresponding opinion words. However, previous researches trained and evaluated their models with only the first type of instance, resulting in a sample selection bias problem. Specifically, TOWE models were trained with only the first type of instance, while these models would be utilized to make inference on the entire space with both the first type of instance and the second type of instance. Thus, the generalization performance will be hurt. Moreover, the performance of these models on the first type of instance cannot reflect their performance on entire space. To validate the sample selection bias problem, four popular TOWE datasets containing only aspect terms associated with at least one opinion word are extended and additionally include aspect terms without corresponding opinion words. Experimental results on these datasets show that training TOWE models on entire space will significantly improve model performance and evaluating TOWE models only on the first type of instance will overestimate model performance.
Current conversational passage retrieval systems cast conversational search into ad-hoc search by using an intermediate query resolution step that places the user's question in context of the conversation. While the proposed methods have proven effective, they still assume the availability of large-scale question resolution and conversational search datasets. To waive the dependency on the availability of such data, we adapt a pre-trained token-level dense retriever on ad-hoc search data to perform conversational search with no additional fine-tuning. The proposed method allows to contextualize the user question within the conversation history, but restrict the matching only between question and potential answer. Our experiments demonstrate the effectiveness of the proposed approach. We also perform an analysis that provides insights of how contextualization works in the latent space, in essence introducing a bias towards salient terms from the conversation.
Graph Convolutional Neural Networks (GNN) based recommender systems are state-of-the-art since they can capture the high order collaborative signals between users and items. However, they suffer from the feature leakage problem since label information determined by edges can be leaked into node embeddings through the GNN aggregation procedure guided by the same set of edges, leading to poor generalization. We propose the accurate removal algorithm to generate the final embedding. For each edge, the embeddings of the two end nodes are evaluated on a graph with that edge removed. We devise an algebraic trick to efficiently compute this procedure without explicitly constructing separate graphs for the LightGCN model. Experiments on four datasets demonstrate that our algorithm can perform better on datasets with sparse interactions, while the training time is significantly reduced.
The exposure sequence is being actively studied for user interest modeling in Click-Through Rate (CTR) prediction. However, the existing methods for exposure sequence modeling bring extensive computational burden and neglect noise problems, resulting in an excessively latency and the limited performance in online recommenders. In this paper, we propose to address the high latency and noise problems via Gating-adapted wavelet multiresolution analysis (Gama), which can effectively denoise the extremely long exposure sequence and adaptively capture the implied multi-dimension user interest with linear computational complexity. This is the first attempt to integrate non-parametric multiresolution analysis technique into deep neural network to model user exposure sequence. Extensive experiments on large scale benchmark dataset and real production dataset confirm the effectiveness of Gama for exposure sequence modeling, especially in cold-start scenarios. Benefited from its low latency and high effecitveness, Gama has been deployed in our real large-scale industrial recommender, successfully serving over hundreds of millions users.
Deep neural networks (DNN) based recommender models often require numerous parameters to achieve remarkable performance. However, this inevitably brings redundant neurons, a phenomenon referred to as over-parameterization. In this paper, we plan to exploit such redundancy phenomena for recommender systems (RS), and propose a top-N item recommendation framework called PCRec that leverages collaborative training of two recommender models of the same network structure, termed peer collaboration. We first introduce two criteria to identify the importance of parameters of a given recommender model. Then, we rejuvenate the unimportant parameters by copying parameters from its peer network. After such an operation and retraining, the original recommender model is endowed with more representation capacity by possessing more functional model parameters. To show its generality, we instantiate PCRec by using three well-known recommender models. We conduct extensive experiments on two real-world datasets, and show that PCRec yields significantly better performance than its counterpart with the same model (parameter) size.
Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned sparse models that try to match the effectiveness of learned dense models, while leveraging the traditional inverted index data structure for efficiency.
Current learned sparse models learn the weights of terms in documents and, sometimes, queries; however, they exploit different vocabulary structures, document expansion techniques, and query expansion strategies, which can make them slower than traditional sparse models such as BM25. In this work, we propose a novel indexing and query processing technique that exploits a traditional sparse model's "guidance" to efficiently traverse the index, allowing the more effective learned model to execute fewer scoring operations. Our experiments show that our guided processing heuristic is able to boost the efficiency of the underlying learned sparse model by a factor of four without any measurable loss of effectiveness.
Recent works show the possibility of transferring the CLIP (Contrastive Language-Image Pretraining) model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text retrieval models with interaction-based matching perform far worse than models with representation-based matching. In this paper, we propose a novel image animation strategy to transfer the image-text CLIP model to video-text retrieval effectively. By imitating the video shooting components, we convert widely used image-language corpus to synthesized video-text data for pretraining. To reduce the time complexity of interaction matching, we further propose a coarse to fine framework which consists of dual encoders for fast candidates searching and a cross-modality interaction module for fine-grained re-ranking. The coarse to fine framework with the synthesized video-text pretraining provides significant gains in retrieval accuracy while preserving efficiency. Comprehensive experiments conducted on MSR-VTT, MSVD, and VATEX datasets demonstrate the effectiveness of our approach.
Explicit feedback---user input regarding their interest in an item---is the most helpful information for recommendation as it comes directly from the user and shows their direct interest in the item. Most approaches either treat the recommendation given such feedback as a typical regression problem or regard such data as implicit and then directly adopt approaches for implicit feedback; both methods, however,tend to yield unsatisfactory performance in top-k recommendation. In this paper, we propose interaction-level preference ranking(IPR), a novel pairwise ranking embedding learning approach to better utilize explicit feedback for recommendation. Experiments conducted on three real-world datasets show that IPR yields the best results compared to six strong baselines.
News recommendation aims to match news with personalized user interest. Existing methods for news recommendation usually model user interest from historical clicked news without the consideration of candidate news. However, each user usually has multiple interests, and it is difficult for these methods to accurately match a candidate news with a specific user interest. In this paper, we present a candidate-aware user modeling method for personalized news recommendation, which can incorporate candidate news into user modeling for better matching between candidate news and user interest. We propose a candidate-aware self-attention network that uses candidate news as clue to model candidate-aware global user interest. In addition, we propose a candidate-aware CNN network to incorporate candidate news into local behavior context modeling and learn candidate-aware short-term user interest. Besides, we use a candidate-aware attention network to aggregate previously clicked news weighted by their relevance with candidate news to build candidate-aware user representation. Experiments on real-world datasets show the effectiveness of our method in improving news recommendation performance.
Emoji recommendation is an important task to help users find appropriate emojis from thousands of candidates based on a short tweet text. Traditional emoji recommendation methods lack personalized recommendation and ignore user historical information in selecting emojis. In this paper, we propose a personalized emoji recommendation with dynamic user preference (PERD) which contains a text encoder and a personalized attention mechanism. In text encoder, a BERT model is contained to learn dense and low-dimensional representations of tweets. In personalized attention, user dynamic preferences are learned according to semantic and sentimental similarity between historical tweets and the tweet which is waiting for emoji recommendation. Informative historical tweets are selected and highlighted. Experiments are carried out on two real-world datasets from Sina Weibo and Twitter. Experimental results validate the superiority of our approach on personalized emoji recommendation.
Social recommendation with Graph Neural Networks(GNNs) learns to represent cold users by fusing user-user social relations with user-item interactions, thereby alleviating the cold-start problem associated with recommender systems. Despite being well adapted to social relations and user-item interactions, these supervised models are still susceptible to popularity bias. Contrastive learning helps resolve this dilemma by identifying the properties that distinguish positive from negative samples. In its previous combinations with recommender systems, social relationships and cold-start cases in this context are not considered. Also, they primarily focus on collaborative features between users and items, leaving the similarity between items under-utilized. In this work, we propose socially-aware dual contrastive learning for cold-start recommendation, where cold users can be modeled in the same way as warm users. To take full advantage of social relations, we create dynamic node embeddings for each user by aggregating information from different neighbors according to each different query item, in the form of user-item pairs. We further design a dual-branch self-supervised contrastive objective to account for user-item collaborative features and item-item mutual information, respectively. On one hand, our framework eliminates popularity bias with proper negative sampling in contrastive learning, without extra ground-truth supervision. On the other hand, we extend previous contrastive learning methods to provide a solution to cold-start problem with social relations included. Extensive experiments on two real-world social recommendation datasets demonstrate its effectiveness.
Neural Multi-task Learning is gaining popularity as a way to learn multiple tasks jointly within a single model. While related research continues to break new ground, two major limitations still remain, including (i) poor generalization to scenarios where tasks are loosely correlated; and (ii) under-investigation on global commonality and local characteristics of tasks. Our aim is to bridge these gaps by presenting a neural multi-task learning model coined Hierarchical Task-aware Multi-headed Attention Network (HTMN). HTMN explicitly distinguishes task-specific features from task-shared features to reduce the impact caused by weak correlation between tasks. The proposed method highlights two parts: Multi-level Task-aware Experts Network that identifies task-shared global features and task-specific local features, and Hierarchical Multi-Head Attention Network that hybridizes global and local features to profile more robust and adaptive representations for each task. Afterwards, each task tower receives its hybrid task-adaptive representation to perform task-specific predictions. Extensive experiments on two real datasets show that HTMN consistently outperforms the compared methods on a variety of prediction tasks.
Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization
In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: \urlhttps://github.com/Hambaobao/CRCGS.
Previous studies about event-level sentiment analysis (SA) usually model the event as a topic, a category or target terms, while the structured arguments (e.g., subject, object, time and location) that have potential effects on the sentiment are not well studied. In this paper, we redefine the task as structured event-level SA and propose an End-to-End Event-level Sentiment Analysis (E3SA) approach to solve this issue. Specifically, we explicitly extract and model the event structure information for enhancing event-level SA. Extensive experiments demonstrate the great advantages of our proposed approach over the state-of-the-art methods. Noting the lack of the dataset, we also release a large-scale real-world dataset with event arguments and sentiment labelling for promoting more researches.
Recently, modeling temporal patterns of user-item interactions have attracted much attention in recommender systems. We argue that existing methods ignore the variety of temporal patterns of user behaviors. We define the subset of user behaviors that are ir- relevant to the target item as noises, which limits the performance of target-related time cycle modeling and affect the recommendation performance. In this paper, we propose Denoising Time Cycle Modeling (DiCycle), a novel approach to denoise user behaviors and select the subset of user behaviors that are highly related to the target item. DiCycle is able to explicitly model diverse time cycle patterns for recommendation. Extensive experiments are conducted on both public benchmarks and a real-world dataset, demonstrating the superior performance of DiCycle over the state-of-the-art recommendation methods.
P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning
Compared to other language tasks, applying pre-trained language models (PLMs) for search ranking often requires more nuances and training signals. In this paper, we identify and study the two mismatches between pre-training and ranking fine-tuning: the training schema gap regarding the differences in training objectives and model architectures, and the task knowledge gap considering the discrepancy between the knowledge needed in ranking and that learned during pre-training. To mitigate these gaps, we propose Pre-trained, Prompt-learned and Pre-finetuned Neural Ranker (P3 Ranker). P3 Ranker leverages prompt-based learning to convert the ranking task into a pre-training like schema and uses pre-finetuning to initialize the model on intermediate supervised tasks. Experiments on MS MARCO and Robust04 show the superior performances of P3 Ranker in few-shot ranking. Analyses reveal that P3 Ranker is able to better accustom to the ranking task through prompt-based learning and retrieve necessary ranking-oriented knowledge gleaned in pre-finetuning, resulting in data-efficient PLM adaptation. Our code is available at https://github.com/NEUIR/P3Ranker.
The main focus of our work is the problem of multiple objectives optimization (MOO) while providing a final list of recommendations to the user. Currently, system designers can tune MOO by setting importance of individual objectives, usually in some kind of weighted average setting. However, this does not have to translate into the presence of such objectives in the final results. In contrast, in our work we would like to allow system designers or end-users to directly quantify the required relative ratios of individual objectives in the resulting recommendations, e.g., the final results should have 60% relevance, 30% diversity and 10% novelty. If individual objectives are transformed to represent quality on the same scale, these result conditioning expressions may greatly contribute towards recommendations tuneability and explainability as well as user's control over recommendations.
To achieve this task, we propose an iterative algorithm inspired by the mandates allocation problem in public elections. The algorithm is applicable as long as per-item marginal gains of individual objectives can be calculated. Effectiveness of the algorithm is evaluated on several settings of relevance-novelty-diversity optimization problem. Furthermore, we also outline several options to scale individual objectives to represent similar value for the user.
Adversarial Filtering Modeling on Long-term User Behavior Sequences for Click-Through Rate Prediction
Rich user behavior information is of great importance for capturing and understanding user interest in click-through rate (CTR) prediction. To improve the richness, collecting long-term behaviors becomes a typical approach in academy and industry but at the cost of increasing online storage and latency. Recently, researchers have proposed several approaches to shorten long-term behavior sequence and then model user interests. These approaches reduce online cost efficiently but do not well handle the noisy information in long-term user behavior, which may deteriorate the performance of CTR prediction significantly. To obtain better cost/performance trade-off, we propose a novel Adversarial Filtering Model (ADFM) to model long-term user behavior. ADFM uses a hierarchical aggregation representation to compress raw behavior sequence and then learns to remove useless behavior information with an adversarial filtering mechanism. The selected user behaviors are fed into interest extraction module for CTR prediction. Experimental results on public datasets and industrial dataset demonstrate that our method achieves significant improvements over state-of-the-art models.
User modeling is important for news recommendation. Existing methods usually first encode user's clicked news into news embeddings independently and then aggregate them into user embedding. However, the word-level interactions across different clicked news from the same user, which contain rich detailed clues to infer user interest, are ignored by these methods. In this paper, we propose a fine-grained and fast user modeling framework (FUM) to model user interest from fine-grained behavior interactions for news recommendation. The core idea of FUM is to concatenate the clicked news into a long document and transform user modeling into a document modeling task with both intra-news and inter-news word-level interactions. Since vanilla transformer cannot efficiently handle long document, we apply an efficient transformer named Fastformer to model fine-grained behavior interactions. Extensive experiments on two real-world datasets verify that FUM can effectively and efficiently model user interest for news recommendation.
Recent work has shown that more effective dense retrieval models can be obtained by distilling ranking knowledge from an existing base re-ranking model. In this paper, we propose a generic curriculum learning based optimization framework called CL-DRD that controls the difficulty level of training data produced by the re-ranking (teacher) model. CL-DRD iteratively optimizes the dense retrieval (student) model by increasing the difficulty of the knowledge distillation data made available to it. In more detail, we initially provide the student model coarse-grained preference pairs between documents in the teacher's ranking, and progressively move towards finer-grained pairwise document ordering requirements. In our experiments, we apply a simple implementation of the CL-DRD framework to enhance two state-of-the-art dense retrieval models. Experiments on three public passage retrieval datasets demonstrate the effectiveness of our proposed framework.
Concept drift in stream data has been well studied in machine learning applications. In the field of recommender systems, this issue is also widely observed, as known as temporal dynamics in user behavior. Furthermore, in the context of COVID-19 pandemic related contingencies, people shift their behavior patterns extremely and tend to imitate others' opinions. The changes in user behavior may not be always rational. Thus, irrational behavior may impair the knowledge learned by the algorithm. It can cause herd effects and aggravate the popularity bias in recommender systems due to the irrational behavior of users. However, related research usually pays attention to the concept drift of individuals and overlooks the synergistic effect among users in the same social group. We conduct a study on user behavior to detect the collaborative concept drifts among users. Also, we empirically study the increase of experience of individuals can weaken herding effects. Our results suggest the CF models are highly impacted by the herd behavior and our findings could provide useful implications for the design of future recommender algorithms.
There is essential information in the underlying structure of words and phrases in natural language questions, and this structure has been extensively studied. In this paper, we study one particular structure, referred to as frozen phrases, that is highly expected to transfer as a whole from questions to answer passages. Frozen phrases, if detected, can be helpful in open-domain Question Answering (QA) where identifying the localized context of a given input question is crucial. An interesting question is if frozen phrases can be accurately detected. We cast the problem as a sequence-labeling task and create synthetic data from existing QA datasets to train a model. We further plug this model into a sparse retriever that is made aware of the detected phrases. Our experiments reveal that detecting frozen phrases whose presence in answer documents are highly plausible yields significant improvements in retrievals as well as in the end-to-end accuracy of open-domain QA models.
Session-based recommendation (SBR) aims at the next-item prediction with a short behavior session. Existing solutions fail to address two main challenges: 1) user interests are shown as dynamically coupled intents, and 2) sessions always contain noisy signals. To address them, in this paper, we propose a hypergraph-based solution, HIDE. Specifically, HIDE first constructs a hypergraph for each session to model the possible interest transitions from distinct perspectives. HIDE then disentangles the intents under each item click in micro and macro manners. In the micro-disentanglement, we perform intent-aware embedding propagation on session hypergraph to adaptively activate disentangled intents from noisy data. In the macro-disentanglement, we introduce an auxiliary intent-classification task to encourage the independence of different intents. Finally, we generate the intent-specific representations for the given session to make the final recommendation. Benchmark evaluations demonstrate the significant performance gain of our HIDE over the state-of-the-art methods.
The task of temporally language grounding (TLG) aims to locate a video moment from an untrimmed video that match a given textual query, which has attracted considerable research attention. In recent years, typical retrieval-based TLG methods are inefficient due to pre-segmented candidate moments, while localization-based TLG solutions adopt reinforcement learning resulting in unstable convergence. Therefore, how to perform TLG task efficiently and stably is a non-trivial work.
Toward this end, we innovatively contribute a solution, Point Prompt Tuning (PPT), which formulates this task as a prompt-based multi-modal problem and integrates multiple sub-tasks to tuning performance. Specifically, a flexible prompt strategy is contributed to rewrite the query firstly, which contains both query, start point and end point. Thereafter, a multi-modal Transformer is adopted to fully learn the multi-modal context. Meanwhile, we design various sub-tasks to constrain the novel framework, namely matching task and localization task. Finally, the start and end points of matched video moment are straightforward predicted, simply yet stably. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution.
Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data rather than expensive online interactions. In this paper, we propose Value Penalized Q-learning (VPQ), a novel uncertainty-based offline RL algorithm that penalizes the unstable Q-values in the regression target using uncertainty-aware weights, achieving the conservative Q-function without the need of estimating the behavior policy, suitable for RS with a large number of items. Experiments on two real-world datasets show the proposed method serves as a gain plug-in for existing RS models.
Recommendation for cold-start users who have very limited data is a canonical challenge in recommender systems. Existing deep recommender systems utilize user content features and behaviors to produce personalized recommendations, yet often face significant performance degradation on cold-start users compared to existing ones due to the following challenges: (1) Cold-start users may have a quite different distribution of features from existing users. (2) The few behaviors of cold-start users are hard to be exploited. In this paper, we propose a recommender system called Cold-Transformer to alleviate these problems. Specifically, we design context-based Embedding Adaption to offset the differences in feature distribution. It transforms the embedding of cold-start users into a warm state that is more like existing ones to represent corresponding user preferences. Furthermore, to exploit the few behaviors of cold-start users and characterize the user context, we propose Label Encoding that models Fused Behaviors of positive and negative feedback simultaneously, which are relatively more sufficient. Last, to perform large-scale industrial recommendations, we keep the two-tower architecture that de-couples user and target item. Extensive experiments on public and industrial datasets show that Cold-Transformer significantly outperforms state-of-the-art methods, including those that are deep coupled and less scalable.
\beginabstract \AcpDS are evaluated depending on their type and purpose. Two categories are often distinguished: \beginenumerate* \item \acpTDS, which are typically evaluated on utility, i.e., their ability to complete a specified task, and \item open-domain chat-bots, which are evaluated on the user experience, i.e., based on their ability to engage a person. \endenumerate* What is the influence of user experience on the user satisfaction rating of \acpTDS as opposed to, or in addition to, utility ? We collect data by providing an additional annotation layer for dialogues sampled from the ReDial dataset, a widely used conversational recommendation dataset. Unlike prior work, we annotate the sampled dialogues at both the turn and dialogue level on six dialogue aspects: relevance, interestingness, understanding, task completion, efficiency, and interest arousal. The annotations allow us to study how different dialogue aspects influence user satisfaction. We introduce a comprehensive set of user experience aspects derived from the annotators' open comments that can influence users' overall impression. We find that the concept of satisfaction varies across annotators and dialogues, and show that a relevant turn is significant for some annotators, while for others, an interesting turn is all they need. Our analysis indicates that the proposed user experience aspects provide a fine-grained analysis of user satisfaction that is not captured by a monolithic overall human rating. \endabstract
The popularization of social media generates a large amount of user-oriented data, where text data especially attracts researchers and speculators to infer user attributes (e.g., age, gender) for fulfilling their intents. Generally, this line of work casts attribute inference as a text classification problem, and starts to leverage graph neural networks for higher-level text representations. However, these text graphs are constructed on words, suffering from high memory consumption and ineffectiveness on few labeled texts. To address this challenge, we design a text-graph-based few-shot learning model for social media attribute inferences. Our model builds a text graph with texts as nodes and edges learned from current text representations via manifold learning and message passing. To further use unlabeled texts to improve few-shot performance, a knowledge distillation is devised to optimize the problem. This offers a trade-off between expressiveness and complexity. Experiments on social media datasets demonstrate the state-of-the-art performance of our model on attribute inferences with considerably fewer labeled texts.
Progressive Self-Attention Network with Unsymmetrical Positional Encoding for Sequential Recommendation
In real-world recommendation systems, the preferences of users are often affected by long-term constant interests and short-term temporal needs. The recently proposed Transformer-based models have proved superior in the sequential recommendation, modeling temporal dynamics globally via the remarkable self-attention mechanism. However, all equivalent item-item interactions in original self-attention are cumbersome, failing to capture the drifting of users' local preferences, which contain abundant short-term patterns. In this paper, we propose a novel interpretable convolutional self-attention, which efficiently captures both short- and long-term patterns with a progressive attention distribution. Specifically, a down-sampling convolution module is proposed to segment the overall long behavior sequence into a series of local subsequences. Accordingly, the segments are interacted with each item in the self-attention layer to produce locality-aware contextual representations, during which the quadratic complexity in original self-attention is reduced to nearly linear complexity. Moreover, to further enhance the robust feature learning in the context of Transformers, an unsymmetrical positional encoding strategy is carefully designed. Extensive experiments are carried out on real-world datasets, \eg ML-1M, Amazon Books, and Yelp, indicating that the proposed method outperforms the state-of-the-art methods w.r.t. both effectiveness and efficiency.
With the wide adoption of mobile devices and web applications, location-based social networks (LBSNs) offer large-scale individual-level location-related activities and experiences. Next point-of-interest (POI) recommendation is one of the most important tasks in LBSNs, aiming to make personalized recommendations of next suitable locations to users by discovering preferences from users' historical activities. Noticeably, LBSNs have offered unparalleled access to abundant heterogeneous relational information about users and POIs (including user-user social relations, such as families or colleagues; and user-POI visiting relations). Such relational information holds great potential to facilitate the next POI recommendation. However, most existing methods either focus on merely the user-POI visits, or handle different relations based on over-simplified assumptions while neglecting relational heterogeneities. To fill these critical voids, we propose a novel framework, MEMO, which effectively utilizes the heterogeneous relations with a multi-network representation learning module, and explicitly incorporates the inter-temporal user-POI mutual influence with the coupled recurrent neural networks. Extensive experiments on real-world LBSN data validate the superiority of our framework over the state-of-the-art next POI recommendation methods.
Abstractive summarization of podcasts is motivated by the growing popularity of podcasts and the needs of their listeners. Podcasting is a markedly different domain from news and other media that are commonly studied in the context of automatic summarization. As such, the qualities of a good podcast summary are yet unknown. Using a collection of podcast summaries produced by different algorithms alongside human judgments of summary quality obtained from the TREC 2020 Podcasts Track, we study the correlations between various automatic evaluation metrics and human judgments, as well as the linguistic aspects of summaries that result in strong evaluations.
A Simple Meta-learning Paradigm for Zero-shot Intent Classification with Mixture Attention Mechanism
Zero-shot intent classification is a vital and challenging task in dialogue systems, which aims to deal with numerous fast-emerging unacquainted intents without annotated training data. To obtain more satisfactory performance, the crucial points lie in two aspects: extracting better utterance features and strengthening the model generalization ability. In this paper, we propose a simple yet effective meta-learning paradigm for zero-shot intent classification. To learn better semantic representations for utterances, we introduce a new mixture attention mechanism, which encodes the pertinent word occurrence patterns by leveraging the distributional signature attention and multi-layer perceptron attention simultaneously. To strengthen the transfer ability of the model from seen classes to unseen classes, we reformulate zero-shot intent classification with a meta-learning strategy, which trains the model by simulating multiple zero-shot classification tasks on seen categories, and promotes the model generalization ability with a meta-adapting procedure on mimic unseen categories. Extensive experiments on two real-world dialogue datasets in different languages show that our model outperforms other strong baselines on both standard and generalized zero-shot intent classification tasks.
Given the ubiquitous existence of graph-structured data, learning the representations of nodes for the downstream tasks ranging from node classification, link prediction to graph classification is of crucial importance. Regarding missing link inference of diverse networks, we revisit the link prediction techniques and identify the importance of both the structural and attribute information. However, the available techniques either heavily count on the network topology which is spurious in practice, or cannot integrate graph topology and features properly. To bridge the gap, we propose a bicomponent structural and attribute learning framework (BSAL) that is designed to adaptively leverage information from topology and feature spaces. Specifically, BSAL constructs a semantic topology via the node attributes and then gets the embeddings regarding the semantic view, which provides a flexible and easy-to-implement solution to adaptively incorporate the information carried by the node attributes. Then the semantic embedding together with topology embedding are fused together using attention mechanism for the final prediction. Extensive experiments show the superior performance of our proposal and it significantly outperforms baselines on diverse research benchmarks.
Useful tips extracted from product reviews assist customers to take a more informed purchase decision, as well as making a better, easier, and safer usage of the product. In this work we argue that extracted tips should be examined based on the amount of support and opposition they receive from all product reviews. A classifier, developed for this purpose, determines the degree to which a tip is supported or contradicted by a single review sentence. These support-levels are then aggregated over all review sentences, providing a global support score, and a global contradiction score, reflecting the support-level of all reviews to the given tip, thus improving the customer confidence in the tip validity. By analyzing a large set of tips extracted from product reviews, we propose a novel taxonomy for categorizing tips as highly-supported, highly-contradicted, controversial (supported and contradicted), and anecdotal (neither supported nor contradicted).
Session-based recommendation has recently attracted more and more research efforts. Most existing approaches are intuitively proposed to discover users' potential preferences or interests from the anonymous session data. This apparently ignores the fact that these sequential behavior data usually reflect session user's potential demand, i.e., a semantic level factor, and therefore how to estimate underlying demands from a session has become a challenging task. To tackle the aforementioned issue, this paper proposes a novel demand-aware graph neural network model. Particularly, a demand modeling component is designed to extract the underlying multiple demands of each session. Then, the demand-aware graph neural network is designed to first construct session demand graphs and then learn the demand-aware item embeddings to make the recommendation. The mutual information loss is further designed to enhance the quality of the learnt embeddings. Extensive experiments have been performed on two real-world datasets and the proposed model achieves the SOTA model performance.
Stance detection aims to identify the stance of the text towards a target. Different from conventional stance detection, Zero-Shot Stance Detection (ZSSD) needs to predict the stances of the unseen targets during the inference stage. For human beings, we generally tend to reason the stance of a new target by linking it with the related knowledge learned from the known ones. Therefore, in this paper, to better generalize the target-related stance features learned from the known targets to the unseen ones, we incorporate the targeted background knowledge from Wikipedia into the model. The background knowledge can be considered as a bridge for connecting the meanings between known targets and the unseen ones, which enables the generalization and reasoning ability of the model to be improved in dealing with ZSSD. Extensive experimental results demonstrate that our model outperforms the state-of-the-art methods on the ZSSD task.
Translation-Based Implicit Annotation Projection for Zero-Shot Cross-Lingual Event Argument Extraction
Zero-shot cross-lingual event argument extraction (EAE) is a challenging yet practical problem in Information Extraction. Most previous works heavily rely on external structured linguistic features, which are not easily accessible in real-world scenarios. This paper investigates a translation-based method to implicitly project annotations from the source language to the target language. With the use of translation-based parallel corpora, no additional linguistic features are required during training and inference. As a result, the proposed approach is more cost effective than previous works on zero-shot cross-lingual EAE. Moreover, our implicit annotation projection approach introduces less noises and hence is more effective and robust than explicit ones. Experimental results show that our model achieves the best performance, outperforming a number of competitive baselines. The thorough analysis further demonstrates the effectiveness of our model compared to explicit annotation projection approaches.
Sequential recommendation aims to model dynamic user behavior from historical interactions. Self-attentive methods have proven effective at capturing short-term dynamics and long-term preferences. Despite their success, these approaches still struggle to model sparse data, on which they struggle to learn high-quality item representations. We propose to model user dynamics from shopping intents and interacted items simultaneously. The learned intents are coarse-grained and work as prior knowledge for item recommendation. To this end, we present a coarse-to-fine self-attention framework, namely CaFe, which explicitly learns coarse-grained and fine-grained sequential dynamics. Specifically, CaFe first learns intents from coarse-grained sequences which are dense and hence provide high-quality user intent representations. Then, CaFe fuses intent representations into item encoder outputs to obtain improved item representations. Finally, we infer recommended items based on representations of items and corresponding intents. Experiments on sparse datasets show that CaFe outperforms state-of-the-art self-attentive recommenders by 44.03% NDCG@5 on average.
User modeling is critical for personalization. Existing methods usually train user models from task-specific labeled data, which may be insufficient. In fact, there are usually abundant unlabeled user behavior data that encode rich universal user information, and pre-training user models on them can empower user modeling in many downstream tasks. In this paper, we propose a user model pre-training method named UserBERT to learn universal user models on unlabeled user behavior data with two contrastive self-supervision tasks. The first one is masked behavior prediction and discrimination, aiming to model the contexts of user behaviors. The second one is behavior sequence matching, aiming to capture user interest stable in different periods. Besides, we propose a medium-hard negative sampling framework to select informative negative samples for better contrastive pre-training. Extensive experiments validate the effectiveness of UserBERT in user model pre-training.
Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically with the sequence length, PPLMs typically limit the code length to 512. However, codes in real-world applications are generally long, such as code searches, which cannot be processed efficiently by existing PPLMs. To solve this problem, in this paper, we present SASA, a Structure-Aware Sparse Attention mechanism, which reduces the complexity and improves performance for long code understanding tasks. The key components in SASA are top-k sparse attention and Abstract Syntax Tree (AST)-based structure-aware attention. With top-k sparse attention, the most crucial attention relation can be obtained with a lower computational cost. As the code structure represents the logic of the code statements, which is a complement to the code sequence characteristics, we further introduce AST structures into attention. Extensive experiments on CodeXGLUE tasks show that SASA achieves better performance than the competing baselines.
Learning Trustworthy Web Sources to Derive Correct Answers and Reduce Health Misinformation in Search
When searching the web for answers to health questions, people can make incorrect decisions that have a negative effect on their lives if the search results contain misinformation. To reduce health misinformation in search results, we need to be able to detect documents with correct answers and promote them over documents containing misinformation. Determining the correct answer has been a difficult hurdle to overcome for participants in the TREC Health Misinformation Track. In the 2021 track, automatic runs were not allowed to use the known answer to a topic's health question, and as a result, the top automatic run had a compatibility-difference score of 0.043 while the top manual run, which used the known answer, had a score of 0.259. The compatibility-difference measures the ability of methods to rank correct and credible documents before incorrect and non-credible documents. By using an existing set of health questions and their known answers, we show it is possible to learn which web hosts are trustworthy, from which we can predict the correct answers to the 2021 health questions with an accuracy of 76%. Using our predicted answers, we can promote documents that we predict contain this answer and achieve a compatibility-difference score of 0.129, which is a three-fold increase in performance over the best previous automatic method.
Binary pointwise labels (aka implicit feedback) are heavily leveraged by deep learning based recommendation algorithms nowadays. In this paper we discuss the limited expressiveness of these labels may fail to accommodate varying degrees of user preference, and thus lead to conflicts during model training, which we call annotation bias. To solve this issue, we find the soft-labeling property of pairwise labels could be utilized to alleviate the bias of pointwise labels. To this end, we propose a momentum contrast framework (\method ) that combines pointwise and pairwise learning for recommendation. \method has a three-tower network structure: one user network and two item networks. The two item networks are used for computing pointwise and pairwise loss respectively. To alleviate the influence of the annotation bias, we perform a momentum update to ensure a consistent item representation. Extensive experiments on real-world datasets demonstrate the superiority of our method against state-of-the-art recommendation algorithms.
Different from large-scale platforms such as Taobao and Amazon, CVR modeling in small-scale recommendation scenarios is more challenging due to the severe Data Distribution Fluctuation (DDF) issue. DDF prevents existing CVR models from being effective since 1) several months of data are needed to train CVR models sufficiently in small scenarios, leading to considerable distribution discrepancy between training and online serving; and 2) e-commerce promotions have significant impacts on small scenarios, leading to distribution uncertainty of the upcoming time period. In this work, we propose a novel CVR method named MetaCVR from a perspective of meta learning to address the DDF issue. Firstly, a base CVR model which consists of a Feature Representation Network (FRN) and output layers is designed and trained sufficiently with samples across months. Then we treat time periods with different data distributions as different occasions and obtain positive and negative prototypes for each occasion using the corresponding samples and the pre-trained FRN. Subsequently, a Distance Metric Network (DMN) is devised to calculate the distance metrics between each sample and all prototypes to facilitate mitigating the distribution uncertainty. At last, we develop an Ensemble Prediction Network (EPN) which incorporates the output of FRN and DMN to make the final CVR prediction. In this stage, we freeze the FRN and train the DMN and EPN with samples from recent time period, therefore effectively easing the distribution discrepancy. To the best of our knowledge, this is the first study of CVR prediction targeting the DDF issue in small-scale recommendation scenarios. Experimental results on real-world datasets validate the superiority of our MetaCVR and online A/B test also shows our model achieves impressive gains of 11.92% on PCVR and 8.64% on GMV.
A human-like user simulator that anticipates users' satisfaction scores, actions, and utterances can help goal-oriented dialogue systems in evaluating the conversation and refining their dialogue strategies. However, little work has experimented with user simulators which can generate users' utterances. In this paper, we propose a deep learning-based user simulator that predicts users' satisfaction scores and actions while also jointly generating users' utterances in a multi-task manner. In particular, we show that 1) the proposed deep text-to-text multi-task neural model achieves state-of-the-art performance in the users' satisfaction scores and actions prediction tasks, and 2) in an ablation analysis, user satisfaction score prediction, action prediction, and utterance generation tasks can boost the performance with each other via positive transfers across the tasks. The source code and model checkpoints used for the experiments run in this paper are available at the following weblink: \urlhttps://github.com/kimdanny/user-simulation-t5.
The wide dissemination of fake news is increasingly threatening both individuals and society. Fake news detection aims to train a model on the past news and detect fake news of the future. Though great efforts have been made, existing fake news detection methods overlooked the unintended entity bias in the real-world data, which seriously influences models' generalization ability to future data. For example, 97% of news pieces in 2010-2017 containing the entity 'Donald Trump' are real in our data, but the percentage falls down to merely 33% in 2018. This would lead the model trained on the former set to hardly generalize to the latter, as it tends to predict news pieces about 'Donald Trump' as real for lower training loss. In this paper, we propose an entity debiasing framework (ENDEF) which generalizes fake news detection models to the future data by mitigating entity bias from a cause-effect perspective. Based on the causal graph among entities, news contents, and news veracity, we separately model the contribution of each cause (entities and contents) during training. In the inference stage, we remove the direct effect of the entities to mitigate entity bias. Extensive offline experiments on the English and Chinese datasets demonstrate that the proposed framework can largely improve the performance of base fake news detectors, and online tests verify its superiority in practice. To the best of our knowledge, this is the first work to explicitly improve the generalization ability of fake news detection models to the future data. The code has been released at https://github.com/ICTMCG/ENDEF-SIGIR2022.
Dialogue topic segmentation is a challenging task in which dialogues are split into segments with pre-defined topics. Existing works on topic segmentation adopt a two-stage paradigm, including text segmentation and segment labeling. However, such methods tend to focus on the local context in segmentation, and the inter-segment dependency is not well captured. Besides, the ambiguity and labeling noise in dialogue segment bounds bring further challenges to existing models. In this work, we propose the Parallel Extraction Network with Neighbor Smoothing (PEN-NS) to address the above issues. Specifically, we propose the parallel extraction network to perform segment extractions, optimizing the bipartite matching cost of segments to capture inter-segment dependency. Furthermore, we propose neighbor smoothing to handle the segment-bound noise and ambiguity. Experiments on a dialogue-based and a document-based topic segmentation dataset show that PEN-NS outperforms state-the-of-art models significantly.
Dense retrieval is becoming one of the standard approaches for document and passage ranking. The dual-encoder architecture is widely adopted for scoring question-passage pairs due to its efficiency and high performance. Typically, dense retrieval models are evaluated on clean and curated datasets. However, when deployed in real-life applications, these models encounter noisy user-generated text. That said, the performance of state-of-the-art dense retrievers can substantially deteriorate when exposed to noisy text. In this work, we study the robustness of dense retrievers against typos in the user question. We observe a significant drop in the performance of the dual-encoder model when encountering typos and explore ways to improve its robustness by combining data augmentation with contrastive learning. Our experiments on two large-scale passage ranking and open-domain question answering datasets show that our proposed approach outperforms competing approaches. Additionally, we perform a thorough analysis on robustness. Finally, we provide insights on how different typos affect the robustness of embeddings differently and how our method alleviates the effect of some typos but not of others.
The common approach of using clusters of similar documents for ad hoc document retrieval is to rank the clusters in response to the query; then, the cluster ranking is transformed to document ranking. We present a novel supervised approach to transform cluster ranking to document ranking. The approach allows to simultaneously utilize different clusterings and the resultant cluster rankings; this helps to improve the modeling of the document similarity space. Empirical evaluation shows that using our approach results in performance that substantially transcends the state-of-the-art in cluster-based document retrieval.
Collaborative filtering algorithms capture underlying consumption patterns, including the ones specific to particular demographics or protected information of users, e.g., gender, race, and location. These encoded biases can influence the decision of a recommendation system (RS) towards further separation of the contents provided to various demographic subgroups, and raise privacy concerns regarding the disclosure of users' protected attributes. In this work, we investigate the possibility and challenges of removing specific protected information of users from the learned interaction representations of a RS algorithm, while maintaining its effectiveness. Specifically, we incorporate adversarial training into the state-of-the-art MultVAE architecture, resulting in a novel model, Adversarial Variational Auto-Encoder with Multinomial Likelihood (Adv-MultVAE), which aims at removing the implicit information of protected attributes while preserving recommendation performance. We conduct experiments on the MovieLens-1M and LFM-2b-DemoBias datasets, and evaluate the effectiveness of the bias mitigation method based on the inability of external attackers in revealing the users' gender information from the model. Comparing with baseline MultVAE, the results show that Adv-MultVAE, with marginal deterioration in performance (w.r.t. NDCG and recall), largely mitigates inherent biases in the model on both datasets.
The task of Query Performance Prediction (QPP) in Information Retrieval (IR) involves predicting the relative effectiveness of a search system for a given input query. Supervised approaches for QPP, such as NeuralQPP are often trained on pairs of queries to capture their relative retrieval performance. However, pointwise approaches, such as the recently proposed BERT-QPP, are generally preferable for efficiency reasons. In this paper, we propose a novel end-to-end neural cross-encoder-based approach that is trained pointwise on individual queries, but listwise over the top ranked documents (split into chunks). In contrast to prior work, the network is then trained to predict the number of relevant documents in each chunk for a given query. Our method is thus a split-n-merge technique that instead of predicting the likely number of relevant documents in the top-k, rather predicts the number of relevant documents for each fixed chunk size p(p<k) and then aggregates them for QPP on top-k. Experiments demonstrate that our method is significantly more effective than other supervised and unsupervised QPP approaches yielding improvements of up to 30% on the TREC-DL'20 dataset and by nearly 9% for the MS MARCO Dev set.
How Does Feedback Signal Quality Impact Effectiveness of Pseudo Relevance Feedback for Passage Retrieval
Pseudo-Relevance Feedback (PRF) assumes that the top results retrieved by a first-stage ranker are relevant to the original query and uses them to improve the query representation for a second round of retrieval. This assumption however is often not correct: some or even all of the feedback documents may be irrelevant. Indeed, the effectiveness of PRF methods may well depend on the quality of the feedback signal and thus on the effectiveness of the first-stage ranker. This aspect however has received little attention before.
In this paper we control the quality of the feedback signal and measure its impact on a range of PRF methods, including traditional bag-of-words methods (Rocchio), and dense vector-based methods (learnt and not learnt). Our results show the important role the quality of the feedback signal plays on the effectiveness of PRF methods. Importantly, and surprisingly, our analysis reveals that not all PRF methods are the same when dealing with feedback signals of varying quality. These findings are critical to gain a better understanding of the PRF methods and of which and when they should be used, depending on the feedback signal quality, and set the basis for future research in this area.
Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.
Math Word Problems (MWP) aims to automatically solve mathematical questions given in texts. Previous studies tend to design complex models to capture additional information in the original text so as to enable the model to gain more comprehensive features. In this paper, we turn our attention in the opposite direction, and work on how to discard redundant features containing spurious correlations for MWP. To this end, we design an Expression Syntax Information Bottleneck method for MWP (called ESIB) based on variational information bottleneck, which extracts essential features of the expression syntax tree while filtering latent-specific redundancy containing syntax-irrelevant features. The key idea of ESIB is to encourage multiple models to predict the same expression syntax tree for different problem representations of the same problem by mutual learning so as to capture consistent information of expression syntax tree and discard latent-specific redundancy. To improve the generalization ability of the model and generate more diverse expressions, we design a self-distillation loss to encourage the model to rely more on the expression syntax information in the latent space. Experimental results on two large-scale benchmarks show that our model not only achieves state-of-the-art results but also generates more diverse solutions.
Existing approaches for sarcasm detection are mainly based on supervised learning, in which the promising performance largely depends on a considerable amount of labeled data or extra information. In the real world scenario, however, the abundant labeled data or extra information requires high labor cost, not to mention that sufficient annotated data is unavailable in many low-resource conditions. To alleviate this dilemma, we investigate sarcasm detection from an unsupervised perspective, in which we explore a masking and generation paradigm in the context to extract the context incongruities for learning sarcastic expression. Further, to improve the feature representations of the sentences, we use unsupervised contrastive learning to improve the sentence representation based on the standard dropout. Experimental results on six perceived sarcasm detection benchmark datasets show that our approach outperforms baselines. Simultaneously, our unsupervised method obtains comparative performance with supervised methods for the intended sarcasm dataset.
Owing to the effectiveness of cross-modal attentions, text-vision BERT models have achieved excellent performance in text-image retrieval. Nevertheless, cross-modal attentions in text-vision BERT models require expensive computation cost when tackling text-vision retrieval due to their pairwise input. Therefore, normally, it is impractical for deploying them for large-scale cross-modal retrieval in real applications. To address the inefficiency issue in exiting text-vision BERT models, in this work, we develop a novel architecture, cross-probe BERT. It devises a small number of text and vision probes, and the cross-modal attentions are efficiency achieved through the interactions between text and vision probes. It takes lightweight computation cost, and meanwhile effectively exploits cross-modal attention. Systematic experiments on public benchmarks demonstrate the excellent effectiveness and efficiency of our cross-probe BERT.
Fact verification (FV) is a challenging task which aims to verify a claim using multiple evidential sentences from trustworthy corpora, e.g., Wikipedia. Most existing approaches follow a three-step pipeline framework, including document retrieval, sentence retrieval and claim verification. High-quality evidences provided by the first two steps are the foundation of the effective reasoning in the last step. Despite being important, high-quality evidences are rarely studied by existing works for FV, which often adopt the off-the-shelf models to retrieve relevant documents and sentences in an "index-retrieve-then-rank'' fashion. This classical approach has clear drawbacks as follows: i) a large document index as well as a complicated search process is required, leading to considerable memory and computational overhead; ii) independent scoring paradigms fail to capture the interactions among documents and sentences in ranking; iii) a fixed number of sentences are selected to form the final evidence set. In this work, we proposeGERE, the first system that retrieves evidences in a generative fashion, i.e., generating the document titles as well as evidence sentence identifiers. This enables us to mitigate the aforementioned technical issues since: i) the memory and computational cost is greatly reduced because the document index is eliminated and the heavy ranking process is replaced by a light generative process; ii) the dependency between documents and that between sentences could be captured via sequential generation process; iii) the generative formulation allows us to dynamically select a precise set of relevant evidences for each claim. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines, with both time-efficiency and memory-efficiency.
Social relations are often used as auxiliary information to improve recommendations. In the real-world, social relations among users are complex and diverse. However, most existing recommendation methods assume only single social relation (i.e., exploit pairwise relations to mine user preferences), ignoring the impact of multifaceted social relations on user preferences (i.e., high order complexity of user relations). Moreover, an observing fact is that similar items always have similar attractiveness when exposed to users, indicating a potential connection among the static attributes of items. Here, we advocate modeling the dual homogeneity from social relations and item connections by hypergraph convolution networks, named DH-HGCN, to obtain high-order correlations among users and items. Specifically, we use sentiment analysis to extract comment relation and use the k-means clustering to construct item-item correlations, and we then optimize those heterogeneous graphs in a unified framework. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model.
Click-through rate (CTR) prediction is fundamental in many industrial applications, such as online advertising and recommender systems. With the development of the online platforms, the sequential user behaviors grow rapidly, bringing us great opportunity to better understand user preferences.However, it is extremely challenging for existing sequential models to effectively utilize the entire behavior history of each user. First, there is a lot of noise in such long histories, which can seriously hurt the prediction performance. Second, feeding the long behavior sequence directly results in infeasible inference time and storage cost. In order to tackle these challenges, in this paper we propose a novel framework, which we name as User Behavior Clustering Sampling (UBCS). In UBCS, short sub-sequences will be obtained from the whole user history sequence with two cascaded modules: (i) Behavior Sampling module samples short sequences related to candidate items using a novel sampling method which takes relevance and temporal information into consideration; (ii) Item Clustering module clusters items into a small number of cluster centroids, mitigating the impact of noise and improving efficiency. Then, the sampled short sub-sequences will be fed into the CTR prediction module for efficient prediction. Moreover, we conduct a self-supervised consistency pre-training task to extract user persona preference and optimize the sampling module effectively. Experiments on real-world datasets demonstrate the superiority and efficiency of our proposed framework.
Conversational recommendation system aims to recommend appropriate items to user by directly asking preference on attributes or recommending item list. However, most of existing methods only employ the flat item and attribute relationship, and ignore the hierarchical relationship connected by the similar user which can provide more comprehensive information. And these methods usually use the user accepted attributes to represent the conversational history and ignore the hierarchical information of sequential transition in the historical turns. In this paper, we propose Hierarchical Information-aware Conversational Recommender (HICR) to model the two types of hierarchical information to boost the performance of CRS. Experiments conducted on four benchmark datasets verify the effectiveness of our proposed model.
In few-shot relational triple extraction (FS-RTE), one seeks to extract relational triples from plain texts by utilizing only few annotated samples. Recent work first extracts all entities and then classifies their relations. Such an entity-then-relation paradigm ignores the entity discrepancy between relations. To address it, we propose a novel task decomposition strategy, Relation-then-Entity, for FS-RTE. It first detects relations occurred in a sentence and then extracts the corresponding head/tail entities of the detected relations. To instantiate this strategy, we further propose a model, RelATE, which builds a dual-level attention to aggregate relation-relevant information to detect the relation occurrence and utilizes the annotated samples of the detected relations to extract the corresponding head/tail entities. Experimental results show that our model outperforms previous work by an absolute gain (18.98%, 28.85% in F1 in two few-shot settings).
Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find that some discarded queries in MS MARCO are ill-defined or otherwise unanswerable, many are valid questions that could be answered had the collection been annotated more completely (around two thirds using modern ranking techniques). This survivability problem distorts the MS MARCO collection in several ways. We find that it affects the natural distribution of queries in terms of the type of information needed. When used for evaluation, we find that the bias likely yields a significant distortion of the absolute performance scores observed. Finally, given that MS MARCO is frequently used for model training, we train models based on subsets of MS MARCO that simulates more survivorship bias. We find that models trained in this setting are up to 9.9% worse when evaluated on versions of the dataset with more complete annotations, and up to 3.5% worse at zero-shot transfer. Our findings are complementary to other recent suggestions for further annotation of MS MARCO, but with a focus on discarded queries.
Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked.
In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.
Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity -TGM- outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.
BERT-based rankers have been shown very effective as rerankers in information retrieval tasks. In order to extend these models to full-ranking scenarios, the ColBERT model has been recently proposed, which adopts a late interaction mechanism. This mechanism allows for the representation of documents to be precomputed in advance. However, the late-interaction mechanism leads to large index size, as one needs to save a representation for each token of every document. In this work, we focus on token pruning techniques in order to mitigate this problem. We test four methods, ranging from simpler ones to the use of a single layer of attention mechanism to select the tokens to keep at indexing time. Our experiments show that for the MS MARCO-passages collection, indexes can be pruned up to 70% of their original size, without a significant drop in performance. We also evaluate on the MS MARCO-documents collection and the BEIR benchmark, which reveals some challenges for the proposed mechanism.
Hypergraphs (i.e., sets of hyperedges) naturally represent group relations (e.g., researchers co-authoring a paper and ingredients used together in a recipe), each of which corresponds to a hyperedge (i.e., a subset of nodes). Predicting future or missing hyperedges bears significant implications for many applications (e.g., collaboration and recipe recommendation). What makes hyperedge prediction particularly challenging is the vast number of non-hyperedge subsets, which grows exponentially with the number of nodes. Since it is prohibitive to use all of them as negative examples for model training, it is inevitable to sample a very small portion of them, and to this end, heuristic sampling schemes have been employed. However, trained models suffer from poor generalization capability for examples of different natures. In this paper, we propose AHP, an adversarial training-based hyperedge-prediction method. It learns to sample negative examples without relying on any heuristic schemes. Using six real hypergraphs, we show that AHP generalizes better to negative examples of various natures. It yields up to 28.2% higher AUROC than the best existing methods and often even outperforms its variants with sampling schemes tailored to test sets.
GraFN: Semi-Supervised Node Classification on Graph with Few Labels via Non-Parametric Distribution Assignment
Despite the success of Graph Neural Networks (GNNs) on various applications, GNNs encounter significant performance degradation when the amount of supervision signals, i.e., number of labeled nodes, is limited, which is expected as GNNs are trained solely based on the supervision obtained from the labeled nodes. On the other hand, recent self-supervised learning paradigm aims to train GNNs by solving pretext tasks that do not require any labeled nodes, and it has shown to even outperform GNNs trained with few labeled nodes. However, a major drawback of self-supervised methods is that they fall short of learning class discriminative node representations since no labeled information is utilized during training. To this end, we propose a novel semi-supervised method for graphs, GraFN, that leverages few labeled nodes to ensure nodes that belong to the same class to be grouped together, thereby achieving the best of both worlds of semi-supervised and self-supervised methods. Specifically, GraFN randomly samples support nodes from labeled nodes and anchor nodes from the entire graph. Then, it minimizes the difference between two predicted class distributions that are non-parametrically assigned by anchor-supports similarity from two differently augmented graphs. We experimentally show that GraFN surpasses both the semi-supervised and self-supervised methods in terms of node classification on real-world graphs.
Real-world web applications such as Amazon and Netflix often provide services in multiple countries and regions (i.e., markets) around the world. Generally, different markets share similar item sets while containing different amounts of interaction data. Some markets are data-scarce and others are data-rich and leveraging those data from similar and data-rich auxiliary markets could enhance the data-scarce markets. In this paper, we explore multi-market recommendation (MMR), and propose a novel model called M$^3$Rec to improve all markets recommendation simultaneously. Since items play the role to bridge different markets, we argue that mining the similarities among items is the key point of MMR. Our M^3Rec preprocess two global item similarities: intra- and inter- market similarities. Specifically, we first learn the second-order intra-market similarity by adopting linear models with closed-form solutions, and then capture the high-order inter-market similarity by the random walk. Afterward, we incorporate the global item similarities for each local market. We conduct extensive experiments on five public available markets and compare with several state-of-the-art methods. Detailed experimental results demonstrate the effectiveness of our proposed method.
Interpretable Learning to Rank (LtR) is an emerging field within the research area of explainable AI, aiming at developing intelligible and accurate predictive models. While most of the previous research efforts focus on creating post-hoc explanations, in this paper we investigate how to train effective and intrinsically-interpretable ranking models. Developing these models is particularly challenging and it also requires finding a trade-off between ranking quality and model complexity. State-of-the-art rankers, made of either large ensembles of trees or several neural layers, exploit in fact an unlimited number of feature interactions making them black boxes. Previous approaches on intrinsically-interpretable ranking models address this issue by avoiding interactions between features thus paying a significant performance drop with respect to full-complexity models. Conversely, ILMART, our novel and interpretable LtR solution based on LambdaMART, is able to train effective and intelligible models by exploiting a limited and controlled number of pairwise feature interactions. Exhaustive and reproducible experiments conducted on three publicly-available LtR datasets show that ILMART outperforms the current state-of-the-art solution for interpretable ranking of a large margin with a gain of nDCG of up to 8%.
In this work, we focus on the task of generating SPARQL queries from natural language questions, which can then be executed on Knowledge Graphs (KGs). We assume that gold entity and relations have been provided, and the remaining task is to arrange them in the right order along with SPARQL vocabulary, and input tokens to produce the correct SPARQL query. Pre-trained Language Models (PLMs) have not been explored in depth on this task so far, so we experiment with BART, T5 and PGNs (Pointer Generator Networks) with BERT embeddings, looking for new baselines in the PLM era for this task, on DBpedia and Wikidata KGs. We show that T5 requires special input tokenisation, but produces state of the art performance on LC-QuAD 1.0 and LC-QuAD 2.0 datasets, and outperforms task-specific models from previous works. Moreover, the methods enable semantic parsing for questions where a part of the input needs to be copied to the output query, thus enabling a new paradigm in KG semantic parsing.
Learning-to-Rank at the Speed of Sampling: Plackett-Luce Gradient Estimation with Minimal Computational Complexity
Plackett-Luce gradient estimation enables the optimization of stochastic ranking models within feasible time constraints through sampling techniques. Unfortunately, the computational complexity of existing methods does not scale well with the length of the rankings, i.e. the ranking cutoff, nor with the item collection size. In this paper, we introduce the novel PL-Rank-3 algorithm that performs unbiased gradient estimation with a computational complexity comparable to the best sorting algorithms. As a result, our novel learning-to-rank method is applicable in any scenario where standard sorting is feasible in reasonable time. Our experimental results indicate large gains in the time required for optimization, without any loss in performance. For the field, our contribution could potentially allow state-of-the-art learning-to-rank methods to be applied to much larger scales than previously feasible.
In recent years, multi-task learning models based on deep learning in recommender systems have attracted increasing attention from researchers in industry and academia. Accurately estimating post-click conversion rate (CVR) is often considered as the primary task of multi-task learning in recommender systems. However, some advertisers may try to get higher click-through rates (CTR) by over-decorating their ads, which may result in excessive exposure to samples with lower CVR. For example, some only eye-catching clickbait have higher CTR, but actually, CVR is very low. As a result, the overall performance of the recommender system will be hurt. In this paper, we introduce a novelty auxiliary task called CTnoCVR, which aims to predict the probability of events with click but no-conversion, in various state-of-the-art multi-task models of recommender systems to promote samples with high CVR but low CTR. Plentiful Experiments on a large-scale dataset gathered from traffic logs of Taobao's recommender system demonstrate that the introduction of CTnoCVR task significantly improves the prediction effect of CVR under various multi-task frameworks. In addition, we conduct the online test and evaluate the effectiveness of our proposed method to make those samples with high CVR and low CTR rank higher.
Click-through rate (CTR) prediction is essential in the modelling of a recommender system. Previous studies mainly focus on user behavior modelling, while few of them consider candidate item representations. This makes the models strongly dependent on user representations, and less effective when user behavior is sparse. Furthermore, most existing works regard the candidate item as one fixed embedding and ignore the multi-representational characteristics of the item. To handle the above issues, we propose a Deep multi-Representational Item NetworK (DRINK) for CTR prediction. Specifically, to tackle the sparse user behavior problem, we construct a sequence of interacting users and timestamps to represent the candidate item; to dynamically capture the characteristics of the item, we propose a transformer-based multi-representational item network consisting of a multi-CLS representation submodule and contextualized global item representation submodule. In addition, we propose to decouple the time information and item behavior to avoid information overwhelming. Outputs of the above components are concatenated and fed into a MLP layer to fit the CTR. We conduct extensive experiments on real-world datasets of Amazon and the results demonstrate the effectiveness of the proposed model.
Sequential prediction is one of the key components in recommendation. In online e-commerce recommendation system, user behavior consists of the sequential visiting logs and item behavior contains the interacted user list in order. Most of the existing state-of-the-art sequential prediction methods only consider the user behavior while ignoring the item behavior. In addition, we find that user behavior varies greatly at different time, and most existing models fail to characterize the rich temporal information. To address the above problems, we propose a transformer-based spatial-temporal recommendation framework (STEM). In the STEM framework, we first utilize attention mechanisms to model user behavior and item behavior, and then exploit spatial and temporal information through a transformer-based model. The STEM framework, as a plug-in, is able to be incorporated into many neural network-based sequential recommendation methods to improve performance. We conduct extensive experiments on three real-world Amazon datasets. The results demonstrate the effectiveness of our proposed framework.
This paper studies correlation-based item-item similarity measures for recommendation systems. While current research on recommender systems is directed toward deep learning-based approaches, nearest neighbor methods have been still used extensively in commercial recommender systems due to their simplicity. A crucial step in item-based nearest neighbor methods is to compute similarities between items, which are generally estimated through correlation measures like Pearson. The purpose of this paper is to re-investigate the effectiveness of correlation-based nearest neighbor methods on several benchmark datasets that have been used for recommendation evaluation in recent years. This paper also provides a more effective estimation method for correlation measures than the classical Pearson correlation coefficient and shows that this leads to significant improvements in recommendation performance.
A mixed list of ads and organic items is usually displayed in feed and how to allocate the limited slots to maximize the overall revenue is a key problem. Meanwhile, user behavior modeling is essential in recommendation and advertising (e.g., CTR prediction and ads allocation). Most previous works only model point-level positive feedback (i.e., click), which neglect the page-level information of feedback and other types of feedback. To this end, we propose Deep Page-level Interest Network (DPIN) to model the page-level user preference and exploit multiple types of feedback. Specifically, we introduce four different types of page-level feedback, and capture user preference for item arrangement under different receptive fields through the multi-channel interaction module. Through extensive offline and online experiments on Meituan food delivery platform, we demonstrate that DPIN can effectively model the page-level user preference and increase the revenue.
In recent years, the emergence and development of third-party platforms have greatly facilitated the growth of the Online to Offline (O2O) business. However, the large amount of transaction data raises new challenges for retailers, especially anomaly detection in operating conditions. Thus, platforms begin to develop intelligent business assistants with embedded anomaly detection methods to reduce the management burden on retailers. Traditional time-series anomaly detection methods capture underlying patterns from the perspectives of time and attributes, ignoring the difference between retailers in this scenario. Besides, similar transaction patterns extracted by the platforms can also provide guidance to individual retailers and enrich their available information without privacy issues. In this paper, we pose an entity-wise multivariate time-series anomaly detection problem that considers the time-series of each unique entity. To address this challenge, we propose GraphAD, a novel multivariate time-series anomaly detection model based on the graph neural network. GraphAD decomposes the Key Performance Indicator (KPI) into stable and volatility components and extracts their patterns in terms of attributes, entities and temporal perspectives via graph neural networks. We also construct a real-world entity-wise multivariate time-series dataset from the business data of Ele.me. The experimental results on this dataset show that GraphAD significantly outperforms existing anomaly detection methods.
Top-K metrics such as NDCG@K are frequently used to evaluate ranking performance. The traditional tree-based models such as LambdaMART, which are based on Gradient Boosted Decision Trees (GBDT), are designed to optimize NDCG@K using the LambdaRank losses. Recently, there is a good amount of research interest on neural ranking models for learning-to-rank tasks. These models are fundamentally different from the decision tree models and behave differently with respect to different loss functions. For example, the most popular ranking losses used in neural models are the Softmax loss and the GumbelApproxNDCG loss. These losses do not connect to top-K metrics such as NDCG@K naturally. It remains a question on how to effectively optimize NDCG@K for neural ranking models. In this paper, we follow the LambdaLoss framework and design novel and theoretically sound losses for NDCG@K metrics, while the original LambdaLoss paper can only do so using an unsound heuristic. We study the new losses on the LETOR benchmark datasets and show that the new losses work better than other losses for neural ranking models.
Evidence-based fake news detection is to judge the veracity of news against relevant evidences. However, models tend to memorize the dataset biases within spurious correlations between news patterns and veracity labels as shortcuts, rather than learning how to integrate the information behind them to reason. As a consequence, models may suffer from a serious failure when facing real-life conditions where most news has different patterns. Inspired by the success of causal inference, we propose a novel framework for debiasing evidence-based fake news detection\footnoteCode available at https://github.com/CRIPAC-DIG/CF-FEND by causal intervention. Under this framework, the model is first trained on the original biased dataset like ordinary work, then it makes conventional predictions and counterfactual predictions simultaneously in the testing stage, where counterfactual predictions are based on the intervened evidence. Relatively unbiased predictions are obtained by subtracting intervened outputs from the conventional ones. Extensive experiments conducted on several datasets demonstrate our method's effectiveness and generality on debiased datasets.
Click-through rate (CTR) prediction plays a critical role in recommender systems and other applications. Recently, modeling user behavior sequences attracts much attention and brings great improvements in the CTR field. Many existing works utilize attention mechanism or recurrent neural networks to exploit user interest from the sequence, but fail to recognize the simple truth that a user's real-time interests are inherently diverse and fluid. In this paper, we propose DisenCTR, a novel dynamic graph-based disentangled representation framework for CTR prediction. The key novelty of our method compared with existing approaches is to model evolving diverse interests of users. Specifically, we construct a time-evolving user-item interaction graph induced by historical interactions. And based on the rich dynamics supplied by the graph, we propose a disentangled graph representation module to extract diverse user interests. We further exploit the fluidity of user interests and model the temporal effect of historical behaviors using Mixture of Hawkes Process. Extensive experiments on three real-world datasets demonstrate the superior performance of our method comparing to state-of-the-art approaches.
In Conversational Recommender Systems (CRSs), conversations usually involve a set of related items and entities e.g., attributes of items. These items and entities are mentioned in order following the development of a dialogue. In other words, potential sequential dependencies exist in conversations. However, most of the existing CRSs neglect these potential sequential dependencies. In this paper, we propose a Transformer-based sequential conversational recommendation method, named TSCR, which models the sequential dependencies in the conversations to improve CRS. We represent conversations by items and entities, and construct user sequences to discover user preferences by considering both mentioned items and entities. Based on the constructed sequences, we deploy a Cloze task to predict the recommended items along a sequence. Experimental results demonstrate that our TSCR model significantly outperforms state-of-the-art baselines.
Neural Query Synthesis and Domain-Specific Ranking Templates for Multi-Stage Clinical Trial Matching
In this work, we propose an effective multi-stage neural ranking system for the clinical trial matching problem. First, we introduce NQS, a neural query synthesis method that leverages a zero-shot document expansion model to generate multiple sentence-long queries from lengthy patient descriptions. These queries are independently issued to a search engine and the results are fused. We find that on the TREC 2021 Clinical Trials Track, this method outperforms strong traditional baselines like BM25 and BM25 + RM3 by about 12 points in nDCG@10, a relative improvement of 34%. This simple method is so effective that even a state-of-the-art neural relevance ranking method trained on the medical subset of MS MARCO passage, when reranking the results of NQS, fails to improve on the ranked list. Second, we introduce a two-stage neural reranking pipeline trained on clinical trial matching data using tailored ranking templates. In this setting, we can train a pointwise reranker using just 1.1k positive examples and obtain effectiveness improvements over NQS by 24 points. This end-to-end multi-stage system demonstrates a 20% relative effectiveness gain compared to the second-best submission at TREC 2021, making it an important step towards better automated clinical trial matching.
Identifying academic experts is crucial for the progress of science, enabling researchers to connect, form networks, and collaborate on the most pressing research problems. A key challenge for ranking experts in response to a query is how to infer their expertise from the publications they coauthored. Profile-centric approaches represent candidate experts by concatenating all their publications into a text-based profile. Despite offering a complete picture of each candidate's scientific output, such lengthy profiles make it inefficient to leverage state-of-the-art neural architectures for inferring expertise. To overcome this limitation, we investigate the suitability of extractive summarization as a mechanism to reduce candidate profiles for semantic encoding using Transformers. Our thorough experiments with a representative academic search test collection demonstrate the benefits of encoding summarized profiles for an improved expertise inference.
User historical behaviors are proved useful for Click Through Rate (CTR) prediction in online advertising system. In Meituan, one of the largest e-commerce platform in China, an item is typically displayed with its image and whether a user clicks the item or not is usually influenced by its image, which implies that user's image behaviors are helpful for understanding user's visual preference and improving the accuracy of CTR prediction. Existing user image behavior models typically use a two-stage architecture, which extracts visual embeddings of images through off-the-shelf Convolutional Neural Networks (CNNs) in the first stage, and then jointly trains a CTR model with those visual embeddings and non-visual features. We find that the two-stage architecture is sub-optimal for CTR prediction. Meanwhile, precisely labeled categories in online ad systems contain abundant visual prior information, which can enhance the modeling of user image behaviors. However, off-the-shelf CNNs without category prior may extract category unrelated features, limiting CNN's expression ability. To address the two issues, we propose a hybrid CNN based attention module, unifying user's image behaviors and category prior, for CTR prediction. Our approach achieves significant improvements in both online and offline experiments on a billion scale real serving dataset.
In e-commerce, ad creatives play an important role in effectively delivering product information to users. The purpose of online creative selection is to learn users' preferences for ad creatives, and to select the most appealing design for users to maximize Click-Through Rate (CTR). However, the existing common practices in the industry usually place the creative selection after the ad ranking stage, and thus the optimal creative fails to reflect the influence on the ad ranking stage. To address these issues, we propose a novel Cascade Architecture of Creative Selection (CACS), which is built before the ranking stage to joint optimization of intra-ad creative selection and inter-ad ranking. To improve the efficiency, we design a classic two-tower structure and allow creative embeddings of the creative selection stage to share with the ranking stage. To boost the effectiveness, on the one hand, we propose a soft label list-wise ranking distillation method to distill the ranking knowledge from the ranking stage to guide CACS learning; and on the other hand, we also design an adaptive dropout network to encourage the model to probabilistically ignore ID features in favor of content features to learn multi-modal representations of the creative. Most of all, the ranking model obtains the optimal creative information of each ad from our CACS, and uses all available features to improve the performance of the ranking model. We have launched our solution in Taobao advertising platform and have obtained significant improvements both in offline and online evaluations.
BERT-based Dense Intra-ranking and Contextualized Late Interaction via Multi-task Learning for Long Document Retrieval
Combining query tokens and document tokens and inputting them to pre-trained transformer models like BERT, an approach known as interaction-based, has shown state-of-the-art effectiveness for information retrieval. However, the computational complexity of this approach is high due to the online self-attention computation. In contrast, dense retrieval methods in representation-based approaches are known to be efficient, however less effective. A tradeoff between the two is reached with late interaction methods like ColBERT, which attempt to benefit from both approaches: contextualized token embeddings can be pre-calculated over BERT for fine-grained effective interaction while preserving efficiency. However, despite its success in passage retrieval, it's not straightforward to use this approach for long document retrieval. In this paper, we propose a cascaded late interaction approach using a single model for long document retrieval. Fast intra-ranking by dot product is used to select relevant passages, then fine-grained interaction of pre-stored token embeddings is used to generate passage scores which are aggregated to the final document score. Multi-task learning is used to train a BERT model to optimize both a dot product and a fine-grained interaction loss functions. Our experiments reveal that the proposed approach obtains near state-of-the-art level effectiveness while being efficient on such collections as TREC 2019.
Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.
Language models generate texts by successively predicting probability distributions for next tokens given past ones. A growing field of interest tries to leverage external information in the decoding process so that the generated texts have desired properties, such as being more natural, non toxic, faithful, or having a specific writing style. A solution is to use a classifier at each generation step, resulting in a cooperative environment where the classifier guides the decoding of the language model distribution towards relevant texts for the task at hand. In this paper, we examine three families of (transformer-based) discriminators for this specific task of cooperative decoding: bidirectional, left-to-right and generative ones. We evaluate the pros and cons of these different types of discriminators for cooperative generation, exploring respective accuracy on classification tasks along with their impact on the resulting sample quality and computational performances. We also provide the code of a batched implementation of the powerful cooperative decoding strategy used for our experiments, the Monte Carlo Tree Search, working with each discriminator for Natural Language Generation.
Online travel platforms (OTPs), e.g., booking.com and Ctrip.com, deliver travel experiences to online users by providing travel-related products. One key problem facing OTPs is to predict users' future travel destination, which has many important applications, e.g., proactively recommending users flight tickets or hotels in the destination city. Although much progress has been made for the next POI recommendation, they are largely sub-optimal for travel destination prediction on OTPs, due to the unique characteristics exhibited from users' travel behaviors such as offline spatial-temporal periodicity and online multi-interest exploration. In this paper, we propose an online-offline periodicity-aware information gain network, OOPIN, for travel destination prediction on OTPs. The key components of the model are (1) an offline mobility pattern extractor, which extracts spatial-temporal periodicity along with the sequential dependencies from the visited city sequence; and (2) an online multi-interests exploration module that discovers destinations that the user might be interested in but not yet visited from their online interaction data.Comprehensive experiments on real-world OTP demonstrate the superior performance of the proposed model for travel destination prediction compared with state-of-the-art methods.
Long document re-ranking has been a challenging problem for neural re-rankers based on deep language models like BERT. Early work breaks the documents into short passage-like chunks. These chunks are independently mapped to scalar scores or latent vectors, which are then pooled into a final relevance score. These encode-and-pool methods however inevitably introduce an information bottleneck: the low dimension representations. In this paper, we propose instead to model full query-to-document interaction, leveraging the attention operation and modular Transformer re-ranker framework. First, document chunks are encoded independently with an encoder module. An interaction module then encodes the query and performs joint attention from the query to all document chunk representations. We demonstrate that the model can use this new degree of freedom to aggregate important information from the entire document. Our experiments show that this design produces effective re-ranking on two classical IR collections Robust04 and ClueWeb09, and a large-scale supervised collection MS-MARCO document ranking.
With the rapid increase of micro-video creators and viewers, how to make personalized recommendations from a large number of candidates to viewers begins to attract more and more attention. However, existing micro-video recommendation models rely on expensive multi-modal information and learn an overall interest embedding that cannot reflect the user's multiple interests in micro-videos. Recently, contrastive learning provides a new opportunity for refining the existing recommendation techniques. Therefore, in this paper, we propose to extract contrastive multi-interests and devise a micro-video recommendation model CMI. Specifically, CMI learns multiple interest embeddings for each user from his/her historical interaction sequence, in which the implicit orthogonal micro-video categories are used to decouple multiple user interests. Moreover, it establishes the contrastive multi-interest loss to improve the robustness of interest embeddings and the performance of recommendations. The results of experiments on two micro-video datasets demonstrate that CMI achieves state-of-the-art performance over existing baselines.
News recommendation is often modeled as a sequential recommendation task, assuming there are rich short-term dependencies over historical clicked news. However, users usually have strong preferences on the temporal diversity of news information and may not tend to click similar news successively, which is very different from many sequential recommendation scenarios such as e-commerce recommendation. In this paper, we study whether news recommendation can be regarded as a standard sequential recommendation problem. Through extensive experiments on two real-world datasets, we find it suboptimal to model news recommendation as a conventional sequential recommendation problem. To handle this issue, we further propose a temporal diversity-aware sequential news recommendation method that can promote candidate news that are diverse from recently clicked news to help predict future clicks more accurately. Experiments show that our method can empower various news recommendation methods.
The Information Retrieval (IR) community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our synthetic datasets outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Code, models, and data are available at https://github.com/zetaalphavector/inpars.
We present an approach to identify argumentative questions among web search queries. Argumentative questions ask for reasons to support a certain stance on a controversial topic, such as ''Should marijuana be legalized?'' Controversial topics entail opposing stances, and hence can be supported or opposed by various arguments. Argumentative questions pose a challenge for search engines since they should be answered with both pro and con arguments in order to not bias a user toward a certain stance.
To further analyze the problem, we sampled questions about 19 controversial topics from a large Yandex search log and let human annotators label them as one of factual, method, or argumentative. The result is a collection of 39,340 labeled questions, 28% of which are argumentative, demonstrating the need to develop dedicated systems for this type of questions. A comparative analysis of the three question types shows that asking for reasons and predictions are among the most important features of argumentative questions. To demonstrate the feasibility of the classification task, we developed a BERT-based classifier to map questions to the question types, reaching a promising macro-averaged F>sub>1-score of 0.78.
Deep neural networks (DNNs) have been a key technique for click-through rate (CTR) estimation, yet existing DNNs-based CTR models neglect the inconsistency between their optimization objectives (e.g., Binary Cross Entropy, BCE) and CTR ranking metrics (e.g., Area Under the ROC Curve, AUC). It is noteworthy that directly optimizing AUC by gradient-descent methods is difficult due to the non-differentiable Heaviside function built-in AUC. To this end, we propose a smooth approximation of AUC, called smooth-AUC (SAUC), towards the rank-based CTR prediction. Specifically, SAUC relaxes the Heaviside function via sigmoid with a temperature coefficient (aiming at controlling the function sharpness) in order to facilitate the gradient-based optimization. Furthermore, SAUC is a plug-and-play objective that can be used in any DNNs-based CTR model. Experimental results on two real-world datasets demonstrate that SAUC consistently improves the recommendation accuracy of current DNNs-based CTR models.
Personalized algorithms focusing uniquely on accuracy might provide highly relevant recommendations, but the recommended items could be too similar to current users' preferences. Therefore, recommenders might prevent users from exploring new products and brands (filter bubbles). This is especially critical for luxury fashion recommendations because luxury shoppers expect to discover exclusive and rare items. Thus, recommender systems for fashion need to consider diversity and elevate the shopping experience by recommending new brands and products from the catalog. In this work, we explored a handful of diversification strategies to rerank the output of a relevance-focused recommender system. Subsequently, we conducted a multi-objective offline experiment optimizing for relevance and diversity simultaneously. We measured diversity with commonly used metrics such as coverage, serendipity, and neighborhood distance, whereas, for relevance, we selected ranking metrics such as recall. The best diversification strategy offline improved user engagement by 2% in click-through rate and presented an uplift of 46% in distinct brands recommended when AB tested against real users. These results reinforced the importance of considering accuracy and diversity metrics when developing a recommender system.
Two-tower architecture is commonly used in real-world systems for Unbiased Learning to Rank (ULTR), where a Deep Neural Network (DNN) tower models unbiased relevance predictions, while another tower models observation biases inherent in the training data like user clicks. This two-tower architecture introduces inductive biases to allow more efficient use of limited observational logs and better generalization during deployment than single-tower architecture that may learn spurious correlations between relevance predictions and biases. However, despite their popularity, it is largely neglected in the literature that existing two-tower models assume that the joint distribution of relevance prediction and observation probabilities are completely factorizable. In this work, we revisit two-tower models for ULTR. We rigorously show that the factorization assumption can be too strong for real-world user behaviors, and existing methods may easily fail under slightly milder assumptions. We then propose several novel ideas that consider a wider spectrum of user behaviors while still under the two-tower framework to maintain simplicity and generalizability. Our concerns of existing two-tower models and the effectiveness of our proposed methods are validated on both controlled synthetic and large-scale real-world datasets.
A challenging case in web search and question answering are count queries, such as"number of songs by John Lennon''. Prior methods merely answer these with a single, and sometimes puzzling number or return a ranked list of text snippets with different numbers. This paper proposes a methodology for answering count queries with inference, contextualization and explanatory evidence. Unlike previous systems, our method infers final answers from multiple observations, supports semantic qualifiers for the counts, and provides evidence by enumerating representative instances. Experiments with a wide variety of queries show the benefits of our method. To promote further research on this underexplored topic, we release an annotated dataset of 5k queries with 200k relevant text spans.
When users initiate search sessions, their query are often ambiguous or might lack of context; this resulting in non-efficient document ranking. Multiple approaches have been proposed by the Information Retrieval community to add context and retrieve documents aligned with users' intents. While some work focus on query disambiguation using users' browsing history, a recent line of work proposes to interact with users by asking clarification questions or/and proposing clarification panels. However, these approaches count either a limited number (i.e., 1) of interactions with user or log-based interactions. In this paper, we propose and evaluate a fully simulated query clarification framework allowing multi-turn interactions between IR systems and user agents.
Companies invest a substantial amount of time and resources in ensuring the compliance to the existing regulations or in the form of fines when compliance cannot be proven in auditing procedures. The topic is not only relevant, but also highly complex, given the frequency of changes and amendments, the complexity of the cases and the difficulty of the juristic language. This paper aims at applying advanced extractive summarization to democratize the understanding of regulations, so that non-jurists can decide which regulations deserve further follow-up. To achieve that, we first create a corpus named EUR-LexSum EUR-LexSum containing 4595 curated European regulatory documents and their corresponding summaries. We then fine-tune transformer-based models which, applied to this corpus, yield a superior performance (in terms of ROUGE metrics) compared to a traditional extractive summarization baseline. Our experiments reveal that even with limited amounts of data such transformer-based models are effective in the field of legal document summarization.
Graph-based recommender systems (GBRSs) have achieved promising performance by incorporating the user-item bipartite graph using the Graph Neural Network (GNN). Among GBRSs, the information from each user and item's multi-hop neighbours is effectively conveyed between nodes through neighbourhood aggregation and message passing. Although effective, existing neighbourhood information aggregation and passing functions are usually computationally expensive. Motivated by the emerging contrastive learning technique, we design a simple neighbourhood construction method in conjunction with the contrastive objective function to simulate the neighbourhood information processing of GNN. In addition, we propose a simple algorithm based on Multilayer Perceptron (MLP) for learning users and items' representations with extra non-linearity while lowering computational burden compared with multi-layers GNNs. Our extensive empirical experiments on three public datasets demonstrate that our proposed model, i.e. MLP-CGRec, can reduce the GPU memory consumption and training time by up to 24.0% and 33.1%, respectively, without significantly degenerating the recommendation accuracy in comparison with competitive baselines.
Spam is a serious problem plaguing web-scale digital platforms which facilitate user content creation and distribution. It compromises platform's integrity, performance of services like recommendation and search, and overall business. Spammers engage in a variety of abusive and evasive behavior which are distinct from non-spammers. Users' complex behavior can be well represented by a heterogeneous graph rich with node and edge attributes. Learning to identify spammers in such a graph for a web-scale platform is challenging because of its structural complexity and size. In this paper, we propose SEINE (Spam DEtection using Interaction NEtworks), a spam detection model over a novel graph framework. Our graph simultaneously captures rich users' details and behavior and enables learning on a billion-scale graph. Our model considers neighborhood along with edge types and attributes, allowing it to capture a wide range of spammers. SEINE, trained on a real dataset of tens of millions of nodes and billions of edges, achieves a high performance of 80% recall with 1% false positive rate. SEINE achieves comparable performance to the state-of-the-art techniques on a public dataset while being pragmatic to be used in a large-scale production system.
Pre-trained language models have contributed significantly to relation extraction by demonstrating remarkable few-shot learning abilities. However, prompt tuning methods for relation extraction may still fail to generalize to those rare or hard patterns. Note that the previous parametric learning paradigm can be viewed as memorization regarding training data as a book and inference as the close-book test. Those long-tailed or hard patterns can hardly be memorized in parameters given few-shot instances. To this end, we regard RE as an open-book examination and propose a new semiparametric paradigm of retrieval-enhanced prompt tuning for relation extraction. We construct an open-book datastore for retrieval regarding prompt-based instance representations and corresponding relation labels as memorized key-value pairs. During inference, the model can infer relations by linearly interpolating the base output of PLM with the non-parametric nearest neighbor distribution over the datastore. In this way, our model not only infers relation through knowledge stored in the weights during training but also assists decision-making by unwinding and querying examples in the open-book datastore. Extensive experiments on benchmark datasets show that our method can achieve state-of-the-art in both standard supervised and few-shot settings
Distant supervision (DS) has been a prevalent approach to generating labeled data for information extraction (IE) tasks. However, DS often suffers from noisy label problems, where the labels are extracted from the knowledge base (KB), regardless of the input context. Many efforts have been devoted to designing denoising mechanisms. However, most strategies are only designed for one specific task and cannot be directly adapted to other tasks. We propose a general paradigm (Dasiera) to resolve issues in KB-based DS. Labels from KB can be viewed as universal labels of a target entity or an entity pair. While the given context for an IE task may only contain partial/zero information about the target entities, or the entailed information may be vague. Hence the mismatch between the given context and KB labels, i.e., the given context has insufficient information to infer DS labels, can happen in IE training datasets. To solve the problem, during training, Dasiera leverages a retrieval-augmentation mechanism to complete missing information of the given context, where we seamlessly integrate a neural retriever and a general predictor in an end-to-end framework. During inference, we can keep/remove the retrieval component based on whether we want to predict solely on the given context. We have evaluated Dasiera on two IE tasks under the DS setting: named entity typing and relation extraction. Experimental results show Dasiera's superiority to other baselines in both tasks.
We present a novel semantic context prior-based venue recommendation system that uses only the title and the abstract of a paper. Based on the intuition that the text in the title and abstract have both semantic and syntactic components, we demonstrate that a joint training of a semantic feature extractor and syntactic feature extractor collaboratively leverages meaningful information that helps to provide venues for papers. The proposed methodology that we call DeSCoVeR at first elicits these semantic and syntactic features using a Neural Topic Model and text classifier respectively. The model then executes a transfer learning optimization procedure to perform a contextual transfer between the feature distributions of the Neural Topic Model and the text classifier during the training phase. DeSCoVeR also mitigates the document-level label bias using a Causal back-door path criterion and a sentence-level keyword bias removal technique. Experiments on the DBLP dataset show that DeSCoVeR outperforms the state-of-the-art methods.
Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval
We show that supervised neural information retrieval (IR) models are prone to learning sparse attention patterns over passage tokens, which can result in key phrases including named entities receiving low attention weights, eventually leading to model under-performance. Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage. On two public IR benchmarks, we empirically show that the proposed method helps improve both the model's attention patterns and retrieval performance, including in zero-shot settings.
In recent decades, the growing scale of scientific research has led to numerous novel findings. Reproducing these findings is the foundation of future research. However, due to the complexity of experiments, manually assessing scientific research is laborious and time-intensive, especially in social and behavioral sciences. Although increasing reproducibility studies have garnered increased attention in the research community, there is still a lack of systematic ways for evaluating scientific research at scale. In this paper, we propose a novel approach towards automatically assessing scientific publications by constructing a knowledge graph (KG) that captures a holistic view of the research contributions. Specifically, during the KG construction, we combine information from two different perspectives: micro-level features that capture knowledge from published articles such as sample sizes, effect sizes, and experimental models, and macro-level features that comprise relationships between entities such as authorship and reference information. We then learn low-dimensional representations using language models and knowledge graph embeddings for entities (nodes in KGs), which are further used for the assessments. A comprehensive set of experiments on two benchmark datasets shows the usefulness of leveraging KGs for scoring scientific research.
Promoting diversity in ranking while maintaining the relevance of ranked results is critical for enhancing human-centered search systems. While existing ranking algorithm and diversity IR metrics provide a solid basis for evaluating and improving search result diversification in offline experiments, it misses out possible divergences and temporal changes of users' levels of Diversity Acceptance, which in this work refers to the extent to which users actually prefer to interact with topically diversified search results. To address this gap between offline evaluations and users' expectations, we proposed an intuitive diversity acceptance measure and ran experiments for diversity acceptance prediction and diversity-aware re-ranking based on datasets from both controlled lab and naturalistic settings. Our results demonstrate that: 1) user diversity acceptance change across different query segments and session contexts, and can be predicted from search interaction signals; 2) our diversity-aware re-ranking algorithm utilizing predicted diversity acceptance and estimated relevance labels can effectively minimize the gap between diversity acceptance and result diversity, while maintaining SERP relevance levels. Our research presents an initial attempt on balancing user needs, result diversity, and SERP relevance in sessions and highlights the importance of studying diversity acceptance in promoting effective result diversification.
Automatically finding contradictions from text is a fundamental yet under-studied problem in natural language understanding and information retrieval. Recently, topology, a branch of mathematics concerned with the properties of geometric shapes, has been shown useful to understand semantics of text. This study presents a topological approach to enhancing deep learning models in detecting contradictions in text. In addition, in order to better understand contradictions, we propose a classification with six types of contradictions. Following that, the topologically enhanced models are evaluated with different contradictions types, as well as different text genres. Overall we have demonstrated the usefulness of topological features in finding contradictions, especially the more latent and more complex contradictions in text.
While neural rankers continue to show notable performance improvements over a wide variety of information retrieval tasks, there have been recent studies that show such rankers may intensify certain stereotypical biases. In this paper, we investigate whether neural rankers introduce retrieval effectiveness (performance) disparities over queries related to different genders. We specifically study whether there are significant performance differences between male and female queries when retrieved by neural rankers. Through our empirical study over the MS MARCO collection, we find that such performance disparities are notable and that the performance disparities may be due to the difference between how queries and their relevant judgements are collected and distributed for different gendered queries. More specifically, we observe that male queries are more closely associated with their relevant documents compared to female queries and hence neural rankers are able to more easily learn associations between male queries and their relevant documents. We show that it is possible to systematically balance relevance judgment collections in order to reduce performance disparity between different gendered queries without negatively compromising overall model performance.
Deep neural networks are widely used for text pair classification tasks such as as adhoc information retrieval. These deep neural networks are not inherently interpretable and require additional efforts to get rationale behind their decisions. Existing explanation models are not yet capable of inducing alignments between the query terms and the document terms -- which part of the document rationales are responsible for which part of the query? In this paper, we study how the input perturbations can be used to infer or evaluate alignments between the query and document spans, which best explain the black-box ranker's relevance prediction. We use different perturbation strategies and accordingly propose a set of metrics to evaluate the faithfulness of alignment rationales to the model. Our experiments show that the defined metrics based on substitution-based perturbation are more successful in preferring higher-quality alignments, compared to the deletion-based metrics.
Current pre-trained language model approaches to information retrieval can be broadly divided into two categories: sparse retrievers (to which belong also non-neural approaches such as bag-of-words methods, e.g., BM25) and dense retrievers. Each of these categories appears to capture different characteristics of relevance. Previous work has investigated how relevance signals from sparse retrievers could be combined with those from dense retrievers via interpolation. Such interpolation would generally lead to higher retrieval effectiveness.
In this paper we consider the problem of combining the relevance signals from sparse and dense retrievers in the context of Pseudo Relevance Feedback (PRF). This context poses two key challenges: (1) When should interpolation occur: before, after, or both before and after the PRF process? (2) Which sparse representation should be considered: a zero-shot bag-of-words model (BM25), or a learned sparse representation? To answer these questions we perform a thorough empirical evaluation considering an effective and scalable neural PRF approach (Vector-PRF), three effective dense retrievers (ANCE, TCTv2, DistillBERT), and one state-of-the-art learned sparse retriever (uniCOIL). The empirical findings from our experiments suggest that, regardless of sparse representation and dense retriever, interpolation both before and after PRF achieves the highest effectiveness across most datasets and metrics.
How can we recommend content for a brand agent to use over a series of rounds so as to gain new subscribers to its social network page? The Influence Maximization (IM) problem seeks a set of~k users, and its content-aware variants seek a set of~k post features, that achieve, in both cases, an objective of expected influence in a social network. However, apart from raw influence, it is also relevant to study gain in subscribers, as long-term success rests on the subscribers of a brand page; classic IM may select~k users from the subscriber set, and content-aware IM starts the post's propagation from that subscriber set. In this paper, we propose a novel content recommendation policy to a brand agent for Gaining Subscribers by Messaging (GSM) over many rounds. In each round, the brand agent messages a fixed number of social network users and invites them to visit the brand page aiming to gain their subscription, while its most recently published content consists of features that intensely attract the preferences of the invited users. To solve GSM, we find, in each round, which content features to publish and which users to notify aiming to maximize the cumulative subscription gain over all rounds. We deploy three GSM solvers, named \sR, \sSC, and \sSU, and we experimentally evaluate their performance based on VKontakte (VK) posts by considering different user sets and feature sets. Our experimental results show that \sSU provides the best solution, as it is significantly more efficient than \sSC with a minor loss of efficacy and clearly more efficacious than \sR with competitive efficiency.
Pretrained language models have improved effectiveness on numerous tasks, including ad-hoc retrieval. Recent work has shown that continuing to pretrain a language model with auxiliary objectives before fine-tuning on the retrieval task can further improve retrieval effectiveness. Unlike monolingual retrieval, designing an appropriate auxiliary task for cross-language mappings is challenging. To address this challenge, we use comparable Wikipedia articles in different languages to further pretrain off-the-shelf multilingual pretrained models before fine-tuning on the retrieval task. We show that our approach yields improvements in retrieval effectiveness.
In this paper, we study the semi-supervised text classification (SSTC) by exploring both labeled and extra unlabeled data. One of the most popular SSTC techniques is pseudo-labeling which assigns pseudo labels for unlabeled data via a teacher classifier trained on labeled data. These pseudo labeled data is then applied to train a student classifier. However, when the pseudo labels are inaccurate, the student classifier will learn from inaccurate data and get even worse performance than the teacher. To mitigate this issue, we propose a simple yet efficient pseudo-labeling framework called Dual Pseudo Supervision (DPS), which exploits the feedback signal from the student to guide the teacher to generate better pseudo labels. In particular, we alternately update the student based on the pseudo labeled data annotated by the teacher and optimize the teacher based on the student's performance via meta learning. In addition, we also design a consistency regularization term to further improve the stability of the teacher. With the above two strategies, the learned reliable teacher can provide more accurate pseudo-labels to the student and thus improve the overall performance of text classification. We conduct extensive experiments on three benchmark datasets (i.e., AG News, Yelp and Yahoo) to verify the effectiveness of our DPS method. Experimental results show that our approach achieves substantially better performance than the strong competitors. For reproducibility, we will release our code and data of this paper publicly at https://github.com/GRIT621/DPS.
The importance of entity retrieval, the task of retrieving a ranked list of related entities from big knowledge bases given a textual query, has been widely acknowledged in the literature. In this paper, we propose a novel entity retrieval method that addresses the important challenge that revolves around the need to effectively represent and model context in which entities relate to each other. Based on our proposed method, a model is firstly trained to retrieve and prune a subgraph of a textual knowledge graph that represents contextual relationships between entities. Secondly, a deep model is introduced to reason over the textual content of nodes, edges, and the given question and score and rank entities in the subgraph. We show experimentally that our approach outperforms state-of-the-art methods on a number of benchmarks for entity retrieval.
Mitigating the Filter Bubble While Maintaining Relevance: Targeted Diversification with VAE-based Recommender Systems
Online recommendation systems are prone to create filter bubbles, whereby users are only recommended content narrowly aligned with their historical interests. In the case of media recommendation, this can reinforce political polarization by recommending topical content (e.g., on the economy) at one extreme end of the political spectrum even though this topic has broad coverage from multiple political viewpoints that would provide a more balanced and informed perspective for the user. Historically, Maximal Marginal Relevance (MMR) has been used to diversify result lists and even mitigate filter bubbles, but suffers from three key drawbacks: (1)~MMR directly sacrifices relevance for diversity, (2)~MMR typically diversifies across all content and not just targeted dimensions (e.g., political polarization), and (3)~MMR is inefficient in practice due to the need to compute pairwise similarities between recommended items. To simultaneously address these limitations, we propose a novel methodology that trains Concept Activation Vectors (CAVs) for targeted topical dimensions (e.g., political polarization). We then modulate the latent embeddings of user preferences in a state-of-the-art VAE-based recommender system to diversify along the targeted dimension while preserving topical relevance across orthogonal dimensions. Our experiments show that our Targeted Diversification VAE-based Collaborative Filtering (TD-VAE-CF) methodology better preserves relevance of content to user preferences across a range of diversification levels in comparison to both untargeted and targeted variations of Maximum Marginal Relevance (MMR); TD-VAE-CF is also much more computationally efficient than the post-hoc re-ranking approach of MMR.
Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization
Societal biases can influence Information Retrieval system results, and conversely, search results can potentially reinforce existing societal biases. Recent research has therefore focused on developing methods for quantifying and mitigating bias in search results and applied them to contemporary retrieval systems that leverage transformer-based language models. In the present work, we expand this direction of research by considering bias mitigation within a framework for contextual document embedding reranking. In this framework, the transformer-based query encoder is optimized for relevance ranking through a list-wise objective, by jointly scoring for the same query a large set of candidate document embeddings in the context of one another, instead of in isolation. At the same time, we impose a regularization loss which penalizes highly scoring documents that deviate from neutrality with respect to a protected attribute (e.g., gender). Our approach for bias mitigation is end-to-end differentiable and efficient. Compared to the existing alternatives for deep neural retrieval architectures, which are based on adversarial training, we demonstrate that it can attain much stronger bias mitigation/fairness. At the same time, for the same amount of bias mitigation, it offers significantly better relevance performance (utility). Crucially, our method allows for a more finely controllable and predictable intensity of bias mitigation, which is essential for practical deployment in production systems.
In recent years, the fairness in information retrieval (IR) system has received increasing research attention. While the data-driven ranking models achieve significant improvements over traditional methods, the dataset used to train such models is usually biased, which causes unfairness in the ranking models. For example, the collected imbalance dataset on the subject of the expert search usually leads to systematic discrimination on the specific demographic groups such as race, gender, etc, which further reduces the exposure for the minority group. To solve this problem, we propose a Meta-learning based Fair Ranking (MFR) model that could alleviate the data bias for protected groups through an automatically-weighted loss. Specifically, we adopt a meta-learning framework to explicitly train a meta-learner from an unbiased sampled dataset (meta-dataset), and simultaneously, train a listwise learning-to-rank (LTR) model on the whole (biased) dataset governed by "fair" loss weights. The meta-learner serves as a weighting function to make the ranking loss attend more on the minority group. To update the parameters of the weighting function and the ranking model, we formulate the proposed MFR as a bilevel optimization problem and solve it using the gradients through gradients. Experimental results on several real-world datasets demonstrate that the proposed method achieves a comparable ranking performance and significantly improves the fairness metric compared with state-of-the-art methods.
Any given information need can be expressed via a wide range of possible queries. Recent work with such query variations has demonstrated that different queries can fetch notably divergent sets of documents, even when the queries have identical intents and superficial similarity. That is, different users might receive SERPs of quite different effectiveness for the same information need. That observation then raises an interesting question: do users have a sense of how useful any given query will be? Can they anticipate the effectiveness of alternative queries for the same retrieval need? To explore that question we designed and carried out a crowd-sourced user study in which we asked subjects to consider an information need statement expressed as a backstory, and then provide their opinions as to the relative usefulness of a set of queries ostensibly addressing that objective. We solicited opinions using two different interfaces: one that collected absolute ratings of queries, and one that required that the subjects place a set of queries into "order". We found that crowd workers are reasonably consistent in their estimates of how effective queries are likely to be, and also that their estimates correlate positively with actual system performance.
Sequential recommendation is often considered as a generative task, i.e., training a sequential encoder to generate the next item of a user's interests based on her historical interacted items. Despite their prevalence, these methods usually require training with more meaningful samples to be effective, which otherwise will lead to a poorly trained model. In this work, we propose to train the sequential recommenders as discriminators rather than generators. Instead of predicting the next item, our method trains a discriminator to distinguish if a sampled item is a 'real' target item or not. A generator, as an auxiliary model, is trained jointly with the discriminator to sample plausible alternative next items and will be thrown out after training. The trained discriminator is considered as the final SR model and denoted as \modelname. Experiments conducted on four datasets demonstrate the effectiveness and efficiency of the proposed approach.
Explainable Session-based Recommendation with Meta-path Guided Instances and Self-Attention Mechanism
Session-based recommendation (SR) gains increasing popularity because it helps greatly maintain users' privacy. Aside from its efficacy, explainability is also critical for developing a successful SR model, since it can improve the persuasiveness of the results, the users' satisfaction, and the debugging efficiency. However, the majority of current SR models are unexplainable and even those that claim to be interpretable cannot provide clear and convincing explanations of users' intentions and how they influence the models' decisions. To solve this problem, in this research, we propose a meta-path guided model which uses path instances to capture item dependencies, explicitly reveal the underlying motives, and illustrate the entire reasoning process. To begin with, our model explores meta-path guided instances and leverages the multi-head self-attention mechanism to disclose the hidden motivations beneath these path instances. To comprehensively model the user interest and interest shifting, we search paths in both adjacent and non-adjacent items. Then, we update item representations by incorporating the user-item interactions and meta-path-based context sequentially. Compared with recent strong baselines, our method is competent to the SOTA performance on three datasets and meanwhile provides sound and clear explanations.
News representation is critical for news recommendation. Most existing methods learn news representations only from news texts while ignoring the visual information of news. In fact, users may click news not only due to the interest in news titles but also the attraction of news images. Thus, images are useful for representing news and predicting news clicks. Pretrained visiolinguistic models are powerful in multi-modal understanding, which can represent news from both textual and visual contents. In this paper, we propose a multimodal news recommendation method that can incorporate both textual and visual information of news to learn multimodal news representations. We first extract region-of-interests (ROIs) from news images via object detection. We then use a pre-trained visiolinguistic model to encode both news texts and image ROIs and model their inherent relatedness using co-attentional Transformers. In addition, we propose a crossmodal candidate-aware attention network to select relevant historical clicked news for the accurate modeling of user interest in candidate news. Experiments validate that incorporating multimodal news information can effectively improve the performance of news recommendation.
The cold-start problem has been a long-standing issue in recommendation. Embedding-based recommendation models provide recommendations by learning embeddings for each user and item from historical interactions. Therefore, such embedding-based models perform badly for cold items which haven't emerged in the training set. The most common solutions are to generate the cold embedding for the cold item from its content features. However, the cold embeddings generated from contents have different distribution as the warm embeddings are learned from historical interactions. In this case, current cold-start methods are facing an interesting seesaw phenomenon, which improves the recommendation of either the cold items or the warm items but hurts the opposite ones. To this end, we propose a general framework named Generative Adversarial Recommendation (GAR). By training the generator and the recommender adversarially, the generated cold item embeddings can have similar distribution as the warm embeddings that can even fool the recommender. Simultaneously, the recommender is fine-tuned to correctly rank the "fake'' warm embeddings and the real warm embeddings. Consequently, the recommendation of the warms and the colds will not influence each other, thus avoiding the seesaw phenomenon. Additionally, GAR could be applied to any off-the-shelf recommendation model. Experiments on two datasets present that GAR has strong overall recommendation performance in cold-starting both the CF-based model (improved by over 30.18%) and the GNN-based model (improved by over 17.78%).
Given a query, neural retrieval models predict point estimates of relevance for each document; however, a significant drawback of relying solely on point estimates is that they contain no indication of the model's confidence in its predictions. Despite this lack of information, downstream methods such as reranking, cutoff prediction, and none-of-the-above classification are still able to learn effective functions to accomplish their respective tasks. Unfortunately, these downstream methods can suffer poor performance when the initial ranking model loses confidence in its score predictions. This becomes increasingly important in high-stakes settings, such as medical searches that can influence health decision making.
Recent work has resolved this lack of information by introducing Bayesian uncertainty to capture the possible distribution of a document score. This paper presents the use of this uncertainty information as an indicator of how well downstream methods will function over a ranklist. We highlight a significant bias against certain disease-related queries within the posterior distribution of a neural model, and show that this bias in a model's predictive distribution propagates to downstream methods. Finally, we introduce a multi-distribution uncertainty metric, confidence decay, as a valid way of partially identifying these failure cases in an offline setting without the need of any user feedback.
Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs,\textit we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio, \etc This modality imbalance results from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose \modelname (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our ~\modelname~ in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.
The effective fusion of multiple modalities (i.e., text, acoustic, and visual) is a non-trivial task, as these modalities often carry specific and diverse information and do not contribute equally. The fusion of different modalities could even be more challenging under the low-resource setting, where we have fewer samples for training. This paper proposes a multi-representative fusion mechanism that generates diverse fusions with multiple modalities and then chooses the best fusion among them. To achieve this, we first apply convolution filters on multimodal inputs to generate different and diverse representations of modalities. We then fuse pairwise modalities with multiple representations to get the multiple fusions. Finally, we propose an attention mechanism that only selects the most appropriate fusion, which eventually helps resolve the noise problem by ignoring the noisy fusions. We evaluate our proposed approach on three low-resource multimodal sentiment analysis datasets, i.e., YouTube, MOUD, and ICT-MMMO. Experimental results show the effectiveness of our proposed approach with the accuracies of 59.3%, 83.0%, and 84.1% for the YouTube, MOUD, and ICT-MMMO datasets, respectively.
Query-Focused Summarization (QFS) is a task that aims to extract essential information from a long document and organize it into a summary that can answer a query. Recently, Transformer-based summarization models have been widely used in QFS. However, the simple Transformer architecture cannot utilize the relationships between distant words and information from a query directly. In this study, we propose the QSG Transformer, a novel QFS model that leverages structure information on Query-attentive Semantic Graph (QSG) to address these issues. Specifically, in the QSG Transformer, QSG node representation is improved by a proposed query-attentive graph attention network, which spreads the information of the query node into QSG using Personalized PageRank, and it is used to generate a summary that better reflects the information from the relationships of a query and document. The proposed method is evaluated on two QFS datasets, and it achieves superior performances over the state-of-the-art models.
Embedding & MLP has become a paradigm for modern large-scale recommendation system. However, this paradigm suffers from the cold-start problem which will seriously compromise the ecological health of recommendation systems. This paper attempts to tackle the item cold-start problem by generating enhanced warmed-up ID embeddings for cold items with historical data and limited interaction records. From the aspect of industrial practice, we mainly focus on the following three points of item cold-start: 1) How to conduct cold-start without additional data requirements and make strategy easy to be deployed in online recommendation scenarios. 2) How to leverage both historical records and constantly emerging interaction data of new items. 3) How to model the relationship between item ID and side information stably from interaction data. To address these problems, we propose a model-agnostic Conditional Variational Autoencoder based Recommendation(CVAR) framework with some advantages including compatibility on various backbones, no extra requirements for data, utilization of both historical data and recent emerging interactions. CVAR uses latent variables to learn a distribution over item side information and generates desirable item ID embeddings using a conditional decoder. The proposed method is evaluated by extensive offline experiments on public datasets and online A/B tests on Tencent News recommendation platform, which further illustrate the advantages and robustness of CVAR.
Programming has become an important skill for individuals nowadays. For the demand to improve personal programming skill, tracking programming skill proficiency is getting more and more important. However, few researchers pay attention to measuring the programming skill of learners. Most of existing studies on learner capability portrait only made use of the exercise results, while the rich behavioral information contained in programming exercise process remains unused. Therefore, we propose a model that measures skill proficiency in programming exercise process named Programming Skill Tracing (PST). We designed Code Information Graph (CIG) to represent the feature of learners' solution code, and Code Tracing Graph (CTG) to measure the changes between the adjacent submissions. Furthermore, we divided programming skill into programming knowledge and coding ability to get more fine-grained assessment. Finally, we conducted various experiments to verify the effectiveness and interpretability of our PST model.
Interactive Recommender Systems (IRSs) have attracted a lot of attention, due to their ability to model interactive processes between users and recommender systems. Numerous approaches have adopted Reinforcement Learning (RL) algorithms, as these can directly maximize users' cumulative rewards. In IRS, researchers commonly utilize publicly available review datasets to compare and evaluate algorithms. However, user feedback provided in public datasets merely includes instant responses (e.g., a rating), with no inclusion of delayed responses (e.g., the dwell time and the lifetime value). Thus, the question remains whether these review datasets are an appropriate choice to evaluate the long-term effects in IRS. In this work, we revisited experiments on IRS with review datasets and compared RL-based models with a simple reward model that greedily recommends the item with the highest one-step reward. Following extensive analysis, we can reveal three main findings: First, a simple greedy reward model consistently outperforms RL-based models in maximizing cumulative rewards. Second, applying higher weighting to long-term rewards leads to degradation of recommendation performance. Third, user feedbacks have mere long-term effects in the benchmark datasets. Based on our findings, we conclude that a dataset has to be carefully verified and that a simple greedy baseline should be included for a proper evaluation of RL-based IRS approaches. Our code and dataset are available at https://github.com/dojeon-ai/irs_validation.
Next Point-of-Interest Recommendation with Auto-Correlation Enhanced Multi-Modal Transformer Network
Next Point-of-Interest (POI) recommendation is a pivotal issue for researchers in the field of location-based social networks. While many recent efforts show the effectiveness of recurrent neural network-based next POI recommendation algorithms, several important challenges have not been well addressed yet: (i) The majority of previous models only consider the dependence of consecutive visits, while ignoring the intricate dependencies of POIs in traces; (ii) The nature of hierarchical and the matching of sub-sequence in POI sequences are hardly model in prior methods; (iii) Most of the existing solutions neglect the interactions between two modals of POI and the density category. To tackle the above challenges, we propose an auto-correlation enhanced multi-modal Transformer network (AutoMTN) for the next POI recommendation. Particularly, AutoMTN uses the Transformer network to explicitly exploits connections of all the POIs along the trace. Besides, to discover the dependencies at the sub-sequence level and attend to cross-modal interactions between POI and category sequences, we replace self-attention in Transformer with the auto-correlation mechanism and design a multi-modal network. Experiments results on two real-world datasets demonstrate the ascendancy of AutoMTN contra state-of-the-art methods in the next POI recommendation.
Recent studies of extractive text summarization have leveraged BERT for document encoding with breakthrough performance. However, when using a pre-trained BERT-based encoder, existing approaches for selecting representative sentences for text summarization are inadequate since the encoder is not explicitly trained for representing sentences. Simply providing the BERT-initialized sentences to cross-sentential graph-based neural networks (GNNs) to encode semantic features of the sentences is not ideal because doing so fail to integrate other summary-worthy features like sentence importance and positions. This paper presents MuchSUM, a better approach for extractive text summarization. MuchSUM is a multi-channel graph convolutional network designed to explicitly incorporate multiple salient summary-worthy features. Specifically, we introduce three specific graph channels to encode the node textual features, node centrality features, and node position features, respectively, under bipartite word-sentence heterogeneous graphs. Then, a cross-channel convolution operation is designed to distill the common graph representations shared by different channels. Finally, the sentence representations of each channel are fused for extractive summarization. We also investigate three weighted graphs in each channel to infuse edge features for graph-based summarization modeling. Experimental results demonstrate our model can achieve considerable performance compared with some BERT-initialized graph-based extractive summarization systems.
Most existing recommendation models learn vectorized representations for items, i.e., item embeddings to make predictions. Item embeddings inherit popularity bias from the data, which leads to biased recommendations. We use this observation to design two simple and effective strategies, which can be flexibly plugged into different backbone recommendation models, to learn popularity neutral item representations. One strategy isolates popularity bias in one embedding direction and neutralizes the popularity direction post-training. The other strategy encourages all embedding directions to be disentangled and popularity neutral. We demonstrate that the proposed strategies outperform state-of-the-art debiasing methods on various real-world datasets, and improve recommendation quality of shallow and deep backbone models.
Recently, supervised abstractive summarization using high-resource datasets, such as CNN/DailyMail and Xsum, has achieved significant performance improvements. However, most of the existing high-resource dataset is biased towards a specific domain like news, and annotating document-summary pairs for low-resource datasets is too expensive. Furthermore, the need for low-resource abstractive summarization task is emerging but existing methods for the task such as transfer learning still have domain shifting and overfitting problems. To address these problems, we propose a new framework for low-resource abstractive summarization using a meta-learning algorithm that can quickly adapt to a new domain using small data. For adaptive meta-learning, we introduce a lightweight module inserted into the attention mechanism of a pre-trained language model; the module is first meta-learned with high-resource task-related datasets and then is fine-tuned with the low-resource target dataset. We evaluate our model on 11 different datasets. Experimental results show that the proposed method achieves the state-of-the-art on 9 datasets in low-resource abstractive summarization.
Current bundle generation studies focus on generating a combination of items to improve user experience. In real-world applications, there is also a great need to produce bundle creatives that consist of mixture types of objects (e.g., items, slogans and templates) for achieving better promotion effect. We study a new problem named bundle creative generation: for given users, the goal is to generate personalized bundle creatives that the users will be interested in. To take both quality and efficiency into account, we propose a contrastive non-autoregressive model that captures user preferences with ingenious decoding objective. Experiments on large-scale real-world datasets verify that our proposed model shows significant advantages in terms of creative quality and generation speed.
In recommendation systems, utilizing the user interaction history as sequential information has resulted in great performance improvement. However, in many online services, user interactions are commonly grouped by sessions that presumably share preferences, which requires a different approach from ordinary sequence representation techniques. To this end, sequence representation models with a hierarchical structure or various viewpoints have been developed but with a rather complex network structure. In this paper, we propose three methods to improve recommendation performance by exploiting session information while minimizing additional parameters in a BERT-based sequential recommendation model: using session tokens, adding session segment embeddings, and a time-aware self-attention. We demonstrate the feasibility of the proposed methods through experiments on widely used recommendation datasets.
Posterior Probability Matters: Doubly-Adaptive Calibration for Neural Predictions in Online Advertising
Predicting user response probabilities is vital for ad ranking and bidding. We hope that predictive models can produce accurate probabilistic predictions that reflect true likelihoods. Calibration techniques aims to post-process model predictions to posterior probabilities. Field-level calibration -- which performs calibration w.r.t. to a specific field value -- is fine-grained and more practical. In this paper we propose a doubly-adaptive approach AdaCalib. It learns an isotonic function family to calibrate model predictions with the guidance of posterior statistics, and field-adaptive mechanisms are designed to ensure that the posterior is appropriate for the field value to be calibrated. Experiments verify that AdaCalib achieves significant improvement on calibration performance. It has been deployed online and beats previous approach.
The scarcity of Mental Health Professionals (MHPs) available to assist patients underlines the need for developing automated systems to help MHPs combat the grievous mental illness called Major Depressive Disorder. In this paper, we develop a Virtual Assistant (VA) that serves as a first point of contact for users who are depressed or disheartened. In support based conversations, two primary components have been identified to produce positive outcomes,empathy andmotivation. While empathy necessitates acknowledging the feelings of the users with a desire to help, imparting hope and motivation uplifts the spirit of support seekers in distress. A combination of these aspects will ensure generalized positive outcome and beneficial alliance in mental health support. The VA, thus, should be capable of generating empathetic and motivational responses, continuously demonstrating positive sentiment by the VA. The end-to-end system employs two mechanisms in a pipe-lined manner : (i)Motivational Response Generator (MRG) : a sentiment driven Reinforcement Learning (RL) based motivational response generator; and (ii)Empathetic Rewriting Framework (ERF) : a transformer based model that rewrites the response from MRG to induce empathy. Experimental results indicate that our proposed VA outperforms several of its counterparts. To the best of our knowledge, this is the first work that seeks to incorporate these aspects together in an end-to-end system.
Recommendation fairness has attracted great attention recently. In real-world systems, users usually have multiple sensitive attributes (e.g. age, gender, and occupation), and users may not want their recommendation results influenced by those attributes. Moreover, which of and when these user attributes should be considered in fairness-aware modeling should depend on users' specific demands. In this work, we define the selective fairness task, where users can flexibly choose which sensitive attributes should the recommendation model be bias-free. We propose a novel parameter-efficient prompt-based fairness-aware recommendation (PFRec) framework, which relies on attribute-specific prompt-based bias eliminators with adversarial training, enabling selective fairness with different attribute combinations on sequential recommendation. Both task-specific and user-specific prompts are considered. We conduct extensive evaluations to verify PFRec's superiority in selective fairness. The source codes are released in \urlhttps://github.com/wyqing20/PFRec.
In multilingual communities, code-switching is a common phenomenon and code-switched tasks have become a crucial area of research in natural language processing (NLP) applications. Existing approaches mainly focus on supervised learning. However, it is expensive to annotate a sufficient amount of code-switched data. In this paper, we consider zero-shot setting and improve model performance on code-switched tasks via monolingual language datasets, unlabeled code-switched datasets, and semantic dictionaries. Inspired by the mechanism of code-switching itself, we propose multi-label masked language modeling and predict both the masked word and its synonyms in other languages. Experimental results show that compared with baselines, our method can further improve the pretrained multilingual model's performance on code-switched sentiment analysis datasets.
In this paper, we evaluate different alternatives to process richer forms of Automatic Speech Recognition (ASR) output based on lattice expansion algorithms for Spoken Document Retrieval (SDR). Typically, SDR systems employ ASR transcripts to index and retrieve relevant documents. However, ASR errors negatively affect the retrieval performance. Multiple alternative hypotheses can also be used to augment the input to document retrieval to compensate for the erroneous one-best hypothesis. In Weighted Finite State Transducer-based ASR systems, using the n-best output (i.e. the top "n'' scoring hypotheses) for the retrieval task is common, since they can easily be fed to a traditional Information Retrieval (IR) pipeline. However, the n-best hypotheses are terribly redundant, and do not sufficiently encapsulate the richness of the ASR output, which is represented as an acyclic directed graph called the lattice. In particular, we utilize the lattice's constrained minimum path cover to generate a minimum set of hypotheses that serve as input to the reranking phase of IR. The novelty of our proposed approach is the incorporation of the lattice as an input for neural reranking by considering a set of hypotheses that represents every arc in the lattice. The obtained hypotheses are encoded through sentence embeddings using BERT-based models, namely SBERT and RoBERTa, and the final ranking of the retrieved segments is obtained with a max-pooling operation over the computed scores among the input query and the hypotheses set. We present our evaluation on the publicly available AMI meeting corpus. Our results indicate that the proposed use of hypotheses from the expanded lattice improves the SDR performance significantly over the n-best ASR output.
Abstractive summarization focuses on generating concise and fluent text from an original document while maintaining the original intent and containing the new words that do not appear in the original document. Recent studies point out that rewriting extractive summaries help improve the performance with a more concise and comprehensible output summary, which uses a sentence as a textual unit. However, a single document sentence normally cannot supply sufficient information. In this paper, we apply elementary discourse unit (EDU) as textual unit of content selection. In order to utilize EDU for generating a high quality summary, we propose a novel summarization model that first designs an EDU selector to choose salient content. Then, the generator model rewrites the selected EDUs as the final summary. To determine the relevancy of each EDU on the entire document, we choose to apply group tag embedding, which can establish the connection between summary sentences and relevant EDUs, so that our generator does not only focus on selected EDUs, but also ingest the entire original document. Extensive experiments on the CNN/Daily Mail dataset have demonstrated the effectiveness of our model.
LightSGCN: Powering Signed Graph Convolution Network for Link Sign Prediction with Simplified Architecture Design
With both positive and negative links, signed graphs exist widely in the real world. Recently, signed graph neural networks (GNNs) have shown superior performance in the most common signed graph analysis task, i.e., link sign prediction. Existing signed GNNs follow the classic nonlinear-propagation paradigm in unsigned GNNs. However, several recent studies on unsigned GNNs have shown that such a paradigm increases training difficulty and even reduces performance in various unsigned graph analysis tasks. Meanwhile, most of the public real-world signed graph datasets do not provide node features. These motivate us to consider whether the existing complex model architecture is suitable.
In this work, we aim to simplify the architecture of signed GNNs to make it more concise and appropriate for link sign prediction. We propose a simplified signed graph convolution network model called LightSGCN. Specifically, LightSGCN utilizes linear propagation based on the balance theory, a widely adopted social theory. Then, the linear combination of hidden representations at each layer is used as the final representations. Moreover, we also propose a tailored prediction function. These finally yield a simple yet effective LightSGCN model, which is more interpretable, easier to implement, and more efficient to train. Experimental results on four real-world signed graphs demonstrate that such a linear method outperforms the state-of-the-art signed GNNs methods with significant improvement in the link sign prediction task and achieves more than 100X speedup over the most similar and simplest baseline.
Widely applied in today's recommender systems, sequential recommendation predicts the next interacted item for a given user via his/her historical item sequence. However, sequential recommendation suffers data sparsity issue like most recommenders. To extract auxiliary signals from the data, some recent works exploit self-supervised learning to generate augmented data via dropout strategy, which, however, leads to sparser sequential data and obscure signals. In this paper, we propose D ual C ontrastive N etwork (DCN) to boost sequential recommendation, from a new perspective of integrating auxiliary user-sequence for items. Specifically, we propose two kinds of contrastive learning. The first one is the dual representation contrastive learning that minimizes the distances between embeddings and sequence-representations of users/items. The second one is the dual interest contrastive learning which aims to self-supervise the static interest with the dynamic interest of next item prediction via auxiliary training. We also incorporate the auxiliary task of predicting next user for a given item's historical user sequence, which can capture the trends of items preferred by certain types of users. Experiments on benchmark datasets verify the effectiveness of our proposed method. Further ablation study also illustrates the boosting effect of the proposed components upon different sequential models.
Deep learning-based recommendation has become a widely adopted technique in various online applications. Typically, a deployed model undergoes frequent re-training to capture users' dynamic behaviors from newly collected interaction logs. However, the current model training process only acquires users' feedbacks as labels, but fails to take into account the errors made in previous recommendations. Inspired by the intuition that humans usually reflect and learn from mistakes, in this paper, we attempt to build a self-correction continual learning loop (dubbed ReLoop) for recommender systems. In particular, a new customized loss is employed to encourage every new model version to reduce prediction errors over the previous model version during training. Our ReLoop learning framework enables a continual self-correction process in the long run and thus is expected to obtain better performance over existing training strategies. Both offline experiments and an online A/B test have been conducted to validate the effectiveness of ReLoop.
In this paper, we propose to formulate the task-oriented dialogue system as the purely natural language generation task, so as to fully leverage the large-scale pre-trained models like GPT-2 and simplify complicated delexicalization prepossessing. However, directly applying this method heavily suffers from the dialogue entity inconsistency caused by the removal of delexicalized tokens, as well as the catastrophic forgetting problem of the pre-trained model during fine-tuning, leading to unsatisfactory performance. To alleviate these problems, we design a novel GPT-Adapter-CopyNet network, which incorporates the lightweight adapter and CopyNet modules into GPT-2 to achieve better performance on transfer learning and dialogue entity generation. Experimental results conducted on the DSTC8 Track 1 benchmark and MultiWOZ dataset demonstrate that our proposed approach significantly outperforms baseline models with a remarkable performance on automatic and human evaluations.
Network-aware cascade size prediction aims to predict the final reposted number of user-generated information via modeling the propagation process in social networks. Estimating the user's reposting probability by social influence, namely state activation plays an important role in the information diffusion process. Therefore, Graph Neural Networks (GNN), which can simulate the information interaction between nodes, has been proved as an effective scheme to handle this prediction task. However, existing studies including GNN-based models usually neglect a vital factor of user's preference which influences the state activation deeply. To that end, we propose a novel framework to promote cascade size prediction by enhancing the user preference modeling according to three stages, i.e., preference topics generation, preference shift modeling, and social influence activation. Our end-to-end method makes the user activating process of information diffusion more adaptive and accurate. Extensive experiments on two large-scale real-world datasets have clearly demonstrated the effectiveness of our proposed model compared to state-of-the-art baselines.
SESSION: Reproducibility Track Papers
Where Does the Performance Improvement Come From?: - A Reproducibility Concern about Image-Text Retrieval
This article aims to provide the information retrieval community with some reflections on recent advances in retrieval learning by analyzing the reproducibility of image-text retrieval models. Due to the increase of multimodal data over the last decade, image-text retrieval has steadily become a major research direction in the field of information retrieval. Numerous researchers train and evaluate image-text retrieval algorithms using benchmark datasets such as MS-COCO and Flickr30k. Research in the past has mostly focused on performance, with multiple state-of-the-art methodologies being suggested in a variety of ways. According to their assertions, these techniques provide improved modality interactions and hence more precise multimodal representations. In contrast to previous works, we focus on the reproducibility of the approaches and the examination of the elements that lead to improved performance by pretrained and nonpretrained models in retrieving images and text.
To be more specific, we first examine the related reproducibility concerns and explain why our focus is on image-text retrieval tasks. Second, we systematically summarize the current paradigm of image-text retrieval models and the stated contributions of those approaches. Third, we analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models. To complete this, we conducted ablation experiments and obtained some influencing factors that affect retrieval recall more than the improvement claimed in the original paper. Finally, we present some reflections and challenges that the retrieval community should consider in the future. Our source code is publicly available at https://github.com/WangFei-2019/Image-text-Retrieval.
Methods for reinforcement learning for recommendation are increasingly receiving attention as they can quickly adapt to user feedback. A typical RL4Rec framework consists of (1) a state encoder to encode the state that stores the users' historical interactions, and (2) an RL method to take actions and observe rewards. Prior work compared four state encoders in an environment where user feedback is simulated based on real-world logged user data. An attention-based state encoder was found to be the optimal choice as it reached the highest performance. However, this finding is limited to the actor-critic method, four state encoders, and evaluation-simulators that do not debias logged user data. In response to these shortcomings, we reproduce and expand on the existing comparison of attention-based state encoders (1) in the publicly available debiased RL4Rec SOFA simulator with (2) a different RL method, (3) more state encoders, and (4) a different dataset. Importantly, our experimental results indicate that existing findings do not generalize to the debiased SOFA simulator generated from a different dataset and a DQN-based method when compared with more state encoders.
Over two decades ago, Berger and Lafferty proposed "information retrieval as statistical translation" (IRST), a simple and elegant method for ad hoc retrieval based on the noisy channel model. At the time, they lacked the large-scale human-annotated datasets necessary to properly train their models. In this paper, we ask the simple question: What if Berger and Lafferty had access to datasets such as the MS MARCO passage ranking dataset that we take for granted today? The answer to this question tells us how much of recent improvements in ranking can be solely attributed to having more data available, as opposed to improvements in models (e.g., pretrained transformers) and optimization techniques (e.g., contrastive loss). In fact, Boytsov and Kolter recently began to answer this question with a replication of Berger and Lafferty's model, and this work can be viewed as another independent replication effort, with generalizations to additional conditions not previously explored, including replacing the sum of translation probabilities with ColBERT's MaxSim operator. We confirm that while neural models (particularly pretrained transformers) have indeed led to great advances in retrieval effectiveness, the IRST model proposed decades ago is quite effective if provided sufficient training data.
Recent work in recommender systems mainly focuses on fairness in recommendations as an important aspect of measuring recommendations quality. A fairness-aware recommender system aims to treat different user groups similarly. Relevant work on user-oriented fairness highlights the discriminant behavior of fairness-unaware recommendation algorithms towards a certain user group, defined based on users' activity level. Typical solutions include proposing a user-centered fairness re-ranking framework applied on top of a base ranking model to mitigate its unfair behavior towards a certain user group i.e., disadvantaged group. In this paper, we re-produce a user-oriented fairness study and provide extensive experiments to analyze the dependency of their proposed method on various fairness and recommendation aspects, including the recommendation domain, nature of the base ranking model, and user grouping method. Moreover, we evaluate the final recommendations provided by the re-ranking framework from both user- (e.g., NDCG, user-fairness) and item-side (e.g., novelty, item-fairness) metrics. We discover interesting trends and trade-offs between the model's performance in terms of different evaluation metrics. For instance, we see that the definition of the advantaged/disadvantaged user groups plays a crucial role in the effectiveness of the fairness algorithm and how it improves the performance of specific base ranking models. Finally, we highlight some important open challenges and future directions in this field. We release the data, evaluation pipeline, and the trained models publicly on https://github.com/rahmanidashti/FairRecSys.
TheSearch Engine Results Page (SERP) has evolved significantly over the last two decades, moving away from the simple ten blue links paradigm to considerably more complex presentations that contain results from multiple verticals and granularities of textual information. Prior works have investigated how user interactions on the SERP are influenced by the presence or absence of heterogeneous content (e.g., images, videos, or news content), the layout of the SERP (\emphlist vs. grid layout), and task complexity. In this paper, we reproduce the user studies conducted in prior works---specifically those of~\citetarguello2012task and~\citetsiu2014first ---to explore to what extent the findings from research conducted five to ten years ago still hold today as the average web user has become accustomed to SERPs with ever-increasing presentational complexity. To this end, we designed and ran a user study with four different SERP interfaces:(i) ~\empha heterogeneous grid ;(ii) ~\empha heterogeneous list ;(iii) ~\empha simple grid ; and(iv) ~\empha simple list. We collected the interactions of $41$ study participants over $12$ search tasks for our analyses. We observed that SERP types and task complexity affect user interactions with search results. We also find evidence to support most (6 out of 8) observations from~\citearguello2012task,siu2014first indicating that user interactions with different interfaces and to solve tasks of different complexity have remained mostly similar over time.
Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network
Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.
Dense retrieval approaches are of increasing interest because they can better capture contextualised similarity compared to sparse retrieval models such as BM25. Among the most prominent of these approaches is TCT-ColBERT, which trains a light-weight "student'' model from a more expensive "teacher'' model. In this work, we take a closer look into TCT-ColBERT concerning its reproducibility and replicability. To structure our study, we propose a three-stage perspective on reproducing the training, inference, and evaluation of model-focused papers, each using artefacts produced from different stages in the pipeline. We find that --- perhaps as expected --- precise reproduction is more challenging when the complete training process is conducted, rather than just inference from a released trained model. Each stage provides the opportunity to perform replication and ablation experiments. We are able to replicate (i.e., produce an effective independent implementation) for model inference and dense indexing/retrieval, but are unable to replicate the training process. We conduct several ablations to cover gaps in the original paper, and make the following observations: (1) the model can function as an inexpensive re-ranker, establishing a new Pareto-optimal result; (2) the index size can be reduced by using lower-precision floating point values, but only if ties in scores are handled appropriately; (3) training needs to be conducted for the entire suggested duration to achieve optimal performance; and (4) student initialisation from the teacher is not necessary.
SESSION: Perspective Papers
In this perspective paper we study the effect of non independent and identically distributed (non-IID) data on federated online learning to rank (FOLTR) and chart directions for future work in this new and largely unexplored research area of Information Retrieval. In the FOLTR process, clients participate in a federation to jointly create an effective ranker from the implicit click signal originating in each client, without the need to share data (documents, queries, clicks). A well-known factor that affects the performance of federated learning systems, and that poses serious challenges to these approaches, is that there may be some type of bias in the way data is distributed across clients. While FOLTR systems are on their own rights a type of federated learning system, the presence and effect of non-IID data in FOLTR has not been studied. To this aim, we first enumerate possible data distribution settings that may showcase data bias across clients and thus give rise to the non-IID problem. Then, we study the impact of each setting on the performance of the current state-of-the-art FOLTR approach, the Federated Pairwise Differentiable Gradient Descent (FPDGD), and we highlight which data distributions may pose a problem for FOLTR methods. We also explore how common approaches proposed in the federated learning literature address non-IID issues in FOLTR. This allows us to unveil new research gaps that, we argue, future research in FOLTR should consider.
Feature selection is a common step in many ranking, classification, or prediction tasks and serves many purposes. By removing redundant or noisy features, the accuracy of ranking or classification can be improved and the computational cost of the subsequent learning steps can be reduced. However, feature selection can be itself a computationally expensive process. While for decades confined to theoretical algorithmic papers, quantum computing is now becoming a viable tool to tackle realistic problems, in particular special-purpose solvers based on the Quantum Annealing paradigm. This paper aims to explore the feasibility of using currently available quantum computing architectures to solve some quadratic feature selection algorithms for both ranking and classification.
The experimental analysis includes 15 state-of-the-art datasets. The effectiveness obtained with quantum computing hardware is comparable to that of classical solvers, indicating that quantum computers are now reliable enough to tackle interesting problems. In terms of scalability, current generation quantum computers are able to provide a limited speedup over certain classical algorithms and hybrid quantum-classical strategies show lower computational cost for problems of more than a thousand features.
Recent advances in Information Retrieval utilise energy-intensive hardware to produce state-of-the-art results. In areas of research highly related to Information Retrieval, such as Natural Language Processing and Machine Learning, there have been efforts to quantify and reduce the power and emissions produced by methods that depend on such hardware. Research that is conscious of the environmental impacts of its experimentation and takes steps to mitigate some of these impacts is considered 'Green'. Given the continuous demand for more data and power-hungry techniques, Green research is likely to become more important within the broader research community. Therefore, within the Information Retrieval community, the consequences of non-Green (in other words, Red) research should at least be considered and acknowledged. As such, the aims of this perspective paper are fourfold: (1) to review the Green literature not only for Information Retrieval but also for related domains in order to identify transferable Green techniques; (2) to provide measures for quantifying the power usage and emissions of Information Retrieval research; (3) to report the power usage and emission impacts for various current IR methods; and (4) to provide a framework to guide Green Information Retrieval research, taking inspiration from 'reduce, reuse, recycle' waste management campaigns, including salient examples from the literature that implement these concepts.
The Web is a canonical example of a competitive search setting that includes document authors with ranking incentives: their goal is to promote their documents in rankings induced for queries. The incentives affect some of the corpus dynamics as the authors respond to rankings by applying strategic document manipulations. This well known reality has deep consequences that go well beyond the need to fight spam. As a case in point, researchers showed using game theoretic analysis that the probability ranking principle is not optimal in competitive retrieval settings; specifically, it leads to reduced topical diversity in the corpus. We provide a broad perspective on recent work on competitive retrieval settings, argue that this work is the tip of the iceberg, and pose a suite of novel research directions; for example, a general game theoretic framework for competitive search, methods of learning-to-rank that account for post-ranking effects, approaches to automatic document manipulation, addressing societal aspects and evaluation.
Where do queries -- the words searchers type into a search box -- come from? The Information Retrieval community understands the performance of queries and search engines extensively, and has recently begun to examine the impact of query variation, showing that different queries for the same information need produce different results. In an information environment where bad actors try to nudge searchers toward misinformation, this is worrisome. The source of query variation -- searcher characteristics, contextual or linguistic prompts, cognitive biases, or even the influence of external parties -- while studied in a piecemeal fashion by other research communities has not been studied by ours. In this paper we draw on a variety of literatures (including information seeking, psychology, and misinformation), and report some small experiments to describe what is known about where queries come from, and demonstrate a clear literature gap around the source of query variations in IR. We chart a way forward for IR to research, document and understand this important question, with a view to creating search engines that provide more consistent, accurate and relevant search results regardless of the searcher's framing of the query.
Natural interaction with recommendation and personalized search systems has received tremendous attention in recent years. We focus on the challenge of supporting people's understanding and control of these systems and explore a fundamentally new way of thinking about representation of knowledge in recommendation and personalization systems. Specifically, we argue that it may be both desirable and possible for algorithms that use natural language representations of users' preferences to be developed. We make the case that this could provide significantly greater transparency, as well as affordances for practical actionable interrogation of, and control over, recommendations. Moreover, we argue that such an approach, if successfully applied, may enable a major step towards systems that rely less on noisy implicit observations while increasing portability of knowledge of one's interests.
Although information access systems have long supportedpeople in accomplishing a wide range of tasks, we propose broadening the scope of users of information access systems to include task-driven machines, such as machine learning models. In this way, the core principles of indexing, representation, retrieval, and ranking can be applied and extended to substantially improve model generalization, scalability, robustness, and interpretability. We describe a generic retrieval-enhanced machine learning (REML) framework, which includes a number of existing models as special cases. REML challenges information retrieval conventions, presenting opportunities for novel advances in core areas, including optimization. The REML research agenda lays a foundation for a new style of information access research and paves a path towards advancing machine learning and artificial intelligence.
SESSION: Resource Track Papers
Memes have become the popular means of communication for Internet users worldwide. Understanding the Internet meme is one of the most tricky challenges in natural language processing (NLP) tasks due to its convenient non-standard writing and network vocabulary. Recently, many linguists suggested that memes contain rich metaphorical information. However, the existing researches ignore this key feature. Therefore, to incorporate informative metaphors into the meme analysis, we introduce a novel multimodal meme dataset called MET-Meme, which is rich in metaphorical features. It contains 10045 text-image pairs, with manual annotations of the metaphor occurrence, sentiment categories, intentions, and offensiveness degree. Moreover, we propose a range of strong baselines to demonstrate the importance of combining metaphorical features for meme sentiment analysis and semantic understanding tasks, respectively. MET-Meme, and its code are released publicly for research in \urlhttps://github.com/liaolianfoka/MET-Meme-A-Multi-modal-Meme-Dataset-Rich-in-Metaphors.
Revisiting Bundle Recommendation: Datasets, Tasks, Challenges and Opportunities for Intent-aware Product Bundling
Product bundling is a commonly-used marketing strategy in both offline retailers and online e-commerce systems. Current research on bundle recommendation is limited by: (1) noisy datasets, where bundles are defined by heuristics, e.g., products co-purchased in the same session; and (2) specific tasks, holding unrealistic assumptions, e.g., the availability of bundles for recommendation directly. In this paper, we propose to take a step back and consider the process of bundle recommendation from a holistic user experience perspective. We first construct high-quality bundle datasets with rich meta information, particularly bundle intents, through a carefully designed crowd-sourcing task. We then define a series of tasks that together, support all key steps in a typical bundle recommendation process, from bundle detection, completion, ranking, to explanation and auto-naming. Finally, we conduct extensive experiments and in-depth analysis that demonstrate the challenges of bundle recommendation, arising from the need for capturing complex relations among users, products, and bundles, as well as the research opportunities, especially in graph-based neural methods. To sum up, our study delivers new data sources, opens up new research directions, and provides useful guidance for product bundling in real e-commerce platforms. Our datasets are available at GitHub (\urlhttps://github.com/BundleRec/bundle_recommendation ).
The past two decades have witnessed the rapid development of personalized recommendation techniques. Despite the significant progress made in both research and practice of recommender systems, to date, there is a lack of a widely-recognized benchmarking standard in this field. Many of the existing studies perform model evaluations and comparisons in an ad-hoc manner, for example, by employing their own private data splits or using a different experimental setting. However, such conventions not only increase the difficulty in reproducing existing studies, but also lead to inconsistent experimental results among them. This largely limits the credibility and practical value of research results in this field. To tackle these issues, we present an initiative project aimed for open benchmarking for recommender systems. In contrast to some earlier attempts towards this goal, we take one further step by setting up a standardized benchmarking pipeline for reproducible research, which integrates all the details about datasets, source code, hyper-parameter settings, running logs, and evaluation results. The benchmark is designed with comprehensiveness and sustainability in mind. It spans both matching and ranking tasks, and also allows anyone to easily follow and contribute. We believe that our benchmark could not only reduce the redundant efforts of researchers to re-implement or re-run existing baselines, but also drive more solid and reproducible research on recommender systems.
Medical visual question answering (Med-VQA) is a challenging problem that aims to take a medical image and a clinical question about the image as input and output a correct answer in natural language. Current medical systems often require large-scale and high-quality labeled data for training and evaluation. To address the challenge, we present a new dataset, denoted by OVQA, which is generated from electronic medical records. We develop a semi-automatic data generation tool for constructing the dataset. First, medical entities are automatically extracted from medical records and filled into predefined templates for generating question and answer pairs. These pairs are then combined with medical images extracted from corresponding medical records, to generate candidates for visual question answering (VQA). The candidates are finally verified with high-quality labels annotated by experienced physicians. To evaluate the quality of OVQA, we conduct comprehensive experiments on state-of-the-art methods for the Med-VQA task to our dataset. The results show that our OVQA can be used as a benchmarking dataset for evaluating existing Med-VQA systems. The dataset can be downloaded from http://18.104.22.168/.
Fostering Coopetition While Plugging Leaks: The Design and Implementation of the MS MARCO Leaderboards
We articulate the design and implementation of the MS MARCO document ranking and passage ranking leaderboards. In contrast to "standard" community-wide evaluations such as those at TREC, which can be characterized as simultaneous games, leaderboards represent sequential games, where every player move is immediately visible to the entire community. The fundamental challenge with this setup is that every leaderboard submission leaks information about the held-out evaluation set, which conflicts with the fundamental tenant in machine learning about separation of training and test data. These "leaks", accumulated over long periods of time, threaten the validity of the insights that can be derived from the leaderboards. In this paper, we share our experiences grappling with this issue over the past few years and how our considerations are operationalized into a coherent submission policy. Our work provides a useful guide to help the community understand the design choices made in the popular MS MARCO leaderboards and offers lessons for designers of future leaderboards.
False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 573 manually and more than 51k automatically labelled mappings between claims and articles. Mappings consist of claim presence, i.e., whether a claim is contained in a given article, and article stance towards the claim. We provide several baselines for these two tasks and evaluate them on the manually labelled part of the dataset. The dataset enables a number of additional tasks related to medical misinformation, such as misinformation characterisation studies or studies of misinformation diffusion between sources.
We address the task of sentence retrieval for open-ended dialogues. The goal is to retrieve sentences from a document corpus that contain information useful for generating the next turn in a given dialogue. Prior work on dialogue-based retrieval focused on specific types of dialogues: either conversational QA or conversational search. To address a broader scope of this task where any type of dialogue can be used, we constructed a dataset that includes open-ended dialogues from Reddit, candidate sentences from Wikipedia for each dialogue and human annotations for the sentences. We report the performance of several retrieval baselines, including neural retrieval models, over the dataset. To adapt neural models to the types of dialogues in the dataset, we explored an approach to induce a large-scale weakly supervised training data from Reddit. Using this training set significantly improved the performance over training on the MS MARCO dataset.
This paper presents the lessons regarding the construction and use of large Cranfield-style test collections learned from the TREC 2021 Deep Learning track. The corpus used in the 2021 edition of the track was much bigger than the corpus used previously and it contains many more relevant documents. The process used to select documents to judge that had been used in earlier years of the track failed to produce a reliable collection because most topics have too many relevant documents. Judgment budgets were exceeded before an adequate sample of the relevant set could be found, so there are likely many unknown relevant documents in the unjudged portion of the corpus. As a result, the collection is not reusable, and furthermore, recall-based measures are unreliable even for the retrieval systems that were used to build the collection. Yet, early-precision measures cannot distinguish among system results because the maximum score is easily obtained for many topics. And since the existing tools for appraising the quality of test collections depend on systems' scores, they also fail when there are too many relevant documents. Collection builders will need new strategies and tools for building reliable test collections for continued use of the Cranfield paradigm on ever-larger corpora. Ensuring that the definition of 'relevant' truly reflects the desired systems' rankings is a provisional strategy for continued collection building.
Ad hoc dataset retrieval is a trending topic in IR research. Methods and systems are evolving from metadata-based to content-based ones which exploit the data itself for improving retrieval accuracy but thus far lack a specialized test collection. In this paper, we build and release the first test collection for ad hoc content-based dataset retrieval, where content-oriented dataset queries and content-based relevance judgments are annotated by human experts who are assisted with a dashboard designed specifically for comprehensively and conveniently browsing both the metadata and data of a dataset. We conduct extensive experiments on the test collection to analyze its difficulty and provide insights into the underlying task.
Link recommendation is an important and compelling problem at the intersection of recommender systems and online social networks. Given a user, link recommenders identify people in the platform the user might be interested in interacting with. We present RELISON, an extensible framework for running link recommendation experiments. The library provides a wide range of algorithms, along with tools for evaluating the produced recommendations. RELISON includes algorithms and metrics that consider the potential effect of recommendations on the properties of online social networks. For this reason, the library also implements network structure analysis metrics, community detection algorithms, and network diffusion simulation functionalities. The library code and documentation is available at https://github.com/ir-uam/RELISON.
We provide a resource for automatically harvesting relevance benchmarks from Wikipedia -- which we refer to as "Wikimarks" to differentiate them from manually created benchmarks. Unlike simulated benchmarks, they are based on manual annotations of Wikipedia authors. Studies on the TREC Complex Answer Retrieval track demonstrated that leaderboards under Wikimarks and manually annotated benchmarks are very similar. Because of their availability, Wikimarks can fill an important need for Information Retrieval research.
We provide a meta-resource to harvest Wikimarks for several information retrieval tasks across different languages: paragraph retrieval, entity ranking, query-specific clustering, outline prediction, and relevant entity linking and many more. In addition, we provide example Wikimarks for English, Simple English, and Japanese derived from the 01/01/2022 Wikipedia dump.
Resource available: https://trema-unh.github.io/wikimarks/
\AcpMDS aim to assist doctors and patients with a range of professional medical services, i.e., diagnosis, treatment and consultation. The development of \acpMDS is hindered because of a lack of resources. In particular. \beginenumerate* [label=(\arabic*) ] \item there is no dataset with large-scale medical dialogues that covers multiple medical services and contains fine-grained medical labels (i.e., intents, actions, slots, values), and \item there is no set of established benchmarks for \acpMDS for multi-domain, multi-service medical dialogues. \endenumerate*
In this paper, we present \acsReMeDi, a set of \aclReMeDi \acusedReMeDi. ØurResources consists of two parts, the ØurResources dataset and the ØurResources benchmarks. The ØurResources dataset contains 96,965 conversations between doctors and patients, including 1,557 conversations with fine-gained labels. It covers 843 types of diseases, 5,228 medical entities, and 3 specialties of medical services across 40 domains. To the best of our knowledge, the ØurResources dataset is the only medical dialogue dataset that covers multiple domains and services, and has fine-grained medical labels.
The second part of the ØurResources resources consists of a set of state-of-the-art models for (medical) dialogue generation. The ØurResources benchmark has the following methods: \beginenumerate* \item pretrained models (i.e., BERT-WWM, BERT-MED, GPT2, and MT5) trained, validated, and tested on the ØurResources dataset, and \item a \acfSCL method to expand the ØurResources dataset and enhance the training of the state-of-the-art pretrained models. \endenumerate*
We describe the creation of the ØurResources dataset, the ØurResources benchmarking methods, and establish experimental results using the ØurResources benchmarking methods on the ØurResources dataset for future research to compare against. With this paper, we share the dataset, implementations of the benchmarks, and evaluation scripts.
ArchivalQA: A Large-scale Benchmark Dataset for Open-Domain Question Answering over Historical News Collections
In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning decades are rarely used in training the models despite they are quite valuable for our society. To foster the research in the field of ODQA on such historical collections, we present ArchivalQA, a large question answering dataset consisting of 532,444 question-answer pairs which is designed for temporal news QA. We divide our dataset into four subparts based on the question difficulty levels and the containment of temporal expressions, which we believe are useful for training and testing ODQA systems characterized by different strengths and abilities. The novel QA dataset-constructing framework that we introduce can be also applied to generate high-quality, non-ambiguous questions over other types of temporal document collections.
Social networks have become an inseparable part of human activities. Most existing social networks follow a centralized system model, which despite storing valuable information of users, arise many critical concerns such as content ownership and over-commercialization. Recently, decentralized social networks, built primarily on blockchain technology, have been proposed as a substitution to eliminate these concerns. Since decentralized architectures are mature enough to be on par with the centralized ones, decentralized social networks are becoming more and more popular. Decentralized social networks can offer both common options like writing posts and comments and more advanced options such as reward systems and voting mechanisms. They provide rich eco-systems for the influencers to interact with their followers and other users via staking systems based on cryptocurrency tokens. The vast and valuable data of the decentralized social networks open several new directions for the research community to extend human behavior knowledge. However, accessing and collecting data from these social networks is not easy because it requires strong blockchain knowledge, which is not the main focus of computer science and social science researchers. Hence, our work proposes the SoChainDB framework that facilitates obtaining data from these new social networks. To show the capacity and strength of SoChainDB, we crawl and publish Hive data - one of the largest blockchain-based social networks. We conduct extensive analyses to understand the insight of Hive data and discuss some interesting applications, e.g., game, non-fungible tokens market built upon Hive. It is worth mentioning that our framework is well-adaptable to other blockchain social networks with minimal modification. SoChainDB is publicly accessible at http://sochaindb.com and the dataset is available under the CC BY-SA 4.0 license.
Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, a passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies.
User intent classification is an important task in information retrieval. In this work, we introduce a revised taxonomy of user intent. We take the widely used differentiation between navigational, transactional and informational queries as a starting point, and identify three different sub-classes for the informational queries: instrumental, factual and abstain. The resulting classification of user queries is more fine-grained, reaches a high level of consistency between annotators, and can serve as the basis for an effective automatic classification process. The newly introduced categories help distinguish between types of queries that a retrieval system could act upon, for example by prioritizing different types of results in the ranking.
We have used a weak supervision approach based on Snorkel to annotate the ORCAS dataset according to our new user intent taxonomy, utilising established heuristics and keywords to construct rules for the prediction of the intent category. We then present a series of experiments with a variety of machine learning models, using the labels from the weak supervision stage as training data, but find that the results produced by Snorkel are not outperformed by these competing approaches and can be considered state-of-the-art. The advantage of a rule-based approach like Snorkel's is its efficient deployment in an actual system, where intent classification would be executed for every query issued.
The resource released with this paper is the ORCAS-I dataset: a labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries. We anticipate the usage of this resource in a scenario where the retrieval system would change its internal workings and search user interface to match the type of information request. For example, a navigational query could trigger just a short result list; and, for instrumental intent the system could rank tutorials and instructions higher than for other types of queries.
CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation.
CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains on document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve document ranking and entity ranking performance. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods.
The information retrieval (IR) community has a strong tradition of making the computational artifacts and resources available for future reuse, allowing the validation of experimental results. Besides the actual test collections, the underlying run files are often hosted in data archives as part of conferences like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide much information about the underlying experiment. For instance, the single run file is not of much use without the context of the shared task's website or the run data archive. In other domains, like the social sciences, it is good practice to annotate research data with metadata. In this work, we introduce \textttir\_metadata - an extensible metadata schema for TREC run files based on the PRIMAD model. We propose to align the metadata annotations to PRIMAD, which considers components of computational experiments that can affect reproducibility. Furthermore, we outline important components and information that should be reported in the metadata and give evidence from the literature. To demonstrate the usefulness of these metadata annotations, we implement new features in \textttrepro\_eval that support the outlined metadata schema for the use case of reproducibility studies. Additionally, we curate a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotate these with the corresponding metadata. In the experiments, we cover reproducibility experiments that are identified by the metadata and classified by PRIMAD. With this work, we enable IR researchers to annotate TREC run files and improve the reuse value of experimental artifacts even further.
Would You Ask it that Way?: Measuring and Improving Question Naturalness for Knowledge Graph Question Answering
Knowledge graph question answering (KGQA) facilitates information access by leveraging structured data without requiring formal query language expertise from the user. Instead, users can express their information needs by simply asking their questions in natural language (NL). Datasets used to train KGQA models that would provide such a service are expensive to construct, both in terms of expert and crowdsourced labor. Typically, crowdsourced labor is used to improve template-based pseudo-natural questions generated from formal queries. However, the resulting datasets often fall short of representing genuinely natural and fluent language. In the present work, we investigate ways to characterize and remedy these shortcomings. We create the IQN-KGQA test collection by sampling questions from existing KGQA datasets and evaluating them with regards to five different aspects of naturalness. Then, the questions are rewritten to improve their fluency. Finally, the performance of existing KGQA models is compared on the original and rewritten versions of the NL questions. We find that some KGQA systems fare worse when presented with more realistic formulations of NL questions. The IQN-KGQA test collection is a resource to help evaluate KGQA systems in a more realistic setting. The construction of this test collection also sheds light on the challenges of constructing large-scale KGQA datasets with genuinely NL questions.
Neural approaches that use pre-trained language models are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their effectiveness compared to feature-based Learning-to-Rank (LtR) methods has not yet been well-established. A major reason for this is because present LtR benchmarks that contain query-document feature vectors do not contain the raw query and document text needed for neural models. On the other hand, the benchmarks often used for evaluating neural models, e.g., MS MARCO, TREC Robust, etc., provide text but do not provide query-document feature vectors. In this paper, we present Istella22, a new dataset that enables such comparisons by providing both query/document text and strong query-document feature vectors used by an industrial search engine. The dataset consists of a comprehensive corpus of 8.4M web documents, a collection of query-document pairs including 220 hand-crafted features, relevance judgments on a 5-graded scale, and a set of 2,198 textual queries used for testing purposes. Istella22 enables a fair evaluation of traditional learning-to-rank and transfer ranking techniques on the same data. LtR models exploit the feature-based representations of training samples while pre-trained transformer-based neural rankers can be evaluated on the corresponding textual content of queries and documents. Through preliminary experiments on Istella22, we find that neural re-ranking approaches lag behind LtR models in terms of effectiveness. However, LtR models identify the scores from neural models as strong signals.
Whether to retrieve, answer, translate, or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in answering questions about named entities grounded in a visual context using a Knowledge Base (KB). To benchmark this task, called KVQAE (Knowledge-based Visual Question Answering about named Entities), we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). The dataset is annotated using a semi-automatic method. We also propose a KB composed of 1.5M Wikipedia articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval and Reading Comprehension, with both zero- and few-shot learning methods. The experiments empirically demonstrate the difficulty of the task, especially when questions are not about persons. This work paves the way for better multimodal entity representations and question answering. The dataset, KB, code, and semi-automatic annotation pipeline are freely available at https://github.com/PaulLerner/ViQuAE.
Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.
Axiomatic approaches to information retrieval have played a key role in determining basic constraints that characterize good retrieval models. Beyond their importance in retrieval theory, axioms have been operationalized to improve an initial ranking, to "guide" retrieval, or to explain some model's rankings. However, recent open-source retrieval frameworks like PyTerrier and Pyserini, which made it easy to experiment with sparse and dense retrieval models, have not included any retrieval axiom support so far.
To fill this gap, we propose ir_axioms, an open-source Python framework that integrates retrieval axioms with common retrieval frameworks. We include reference implementations for 25 retrieval axioms, as well as components for preference aggregation, re-ranking, and evaluation. New axioms can easily be defined by implementing an abstract data type or by intuitively combining existing axioms with Python operators or regression. Integration with PyTerrier and ir_datasets makes standard retrieval models, corpora, topics, and relevance judgments---including those used at TREC---immediately accessible for axiomatic experimentation. Our experiments on the TREC Deep Learning tracks showcase some potential research questions that ir_axioms can help to address.
Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system (MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1-score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at https://mumin-dataset.github.io/, including the data, documentation, tutorials and leaderboards.
CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines
Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models.
Nowadays, most e-commerce and entertainment services have adopted interactive Recommender Systems (RS) to guide the entire journey of users into the system. This task has been addressed as a Multi-Armed Bandit problem where systems must continuously learn and recommend at each iteration. However, despite the recent advances, there is still a lack of consensus on the best practices to evaluate such bandit solutions. Several variables might affect the evaluation process, but most of the works have only been concerned about the accuracy of each method. Thus, this work proposes an interactive RS framework named iRec. It covers the whole experimentation process by following the main RS guidelines. The iRec provides three modules to prepare the dataset, create new recommendation agents, and simulate the interactive scenario. Moreover, it also contains several state-of-the-art algorithms, a hyperparameter tuning module, distinct evaluation metrics, different ways of visualizing the results, and statistical validation.
From Little Things Big Things Grow: A Collection with Seed Studies for Medical Systematic Review Literature Search
Medical systematic review query formulation is a highly complex task done by trained information specialists. Complexity comes from the reliance on lengthy Boolean queries, which express a detailed research question. To aid query formulation, information specialists use a set of exemplar documents, called 'seed studies', prior to query formulation. Seed studies help verify the effectiveness of a query prior to the full assessment of retrieved studies. Beyond this use of seeds, specific IR methods can exploit seed studies for guiding both automatic query formulation and new retrieval models. One major limitation of work to date is that these methods exploit 'pseudo seed studies' through retrospective use of included studies (i.e., relevance assessments). However, we show pseudo seed studies are not representative of real seed studies used by information specialists. Hence, we provide a test collection with real world seed studies used to assist with the formulation of queries. To support our collection, we provide an analysis, previously not possible, on how seed studies impact retrieval and perform several experiments using seed study based methods to compare the effectiveness of using seed studies versus pseudo seed studies. We make our test collection and the results of all of our experiments and analysis available at http://github.com/ielab/sysrev-seed-collection.
With doc2query, we train a neural sequence-to-sequence model that, given an input span of text, predicts a natural language query that the text might answer. These predictions can be viewed as document expansions that feed standard bag-of-words term weighting models such as BM25 or neural retrieval models based on learned sparse lexical representations such as uniCOIL. Previous experiments on the MS MARCO datasets have demonstrated the effectiveness of these methods, and they serve as baselines that are widely used by the community today. Following the recent release of the MS MARCO V2 passage and document ranking test collections, we have refreshed our doc2query and uniCOIL models. This work describes a number of resources that support competitive, reproducible baselines for both the MS MARCO V1 and V2 test collections using our Anserini and Pyserini IR toolkits. Together, they provide a solid foundation for future research on neural retrieval models using the MS MARCO datasets and beyond.
Asking clarification questions is an active area of research; however, resources for training and evaluating search clarification methods are not sufficient. To address this issue, we describe MIMICS-Duo, a new freely available dataset of 306 search queries with multiple clarifications (a total of 1,034 query-clarification pairs). MIMICS-Duo contains fine-grained annotations on clarification questions and their candidate answers and enhances the existing MIMICS datasets by enabling multi-dimensional evaluation of search clarification methods, including online and offline evaluation. We conduct extensive analysis to demonstrate the relationship between offline and online search clarification datasets and outline several research directions enabled by MIMICS-Duo. We believe that this resource will help researchers better understand clarification in search.
Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?
Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD, LC-QuAD 2.0 and QALD-9). Experiments on re-splitted KGQA datasets demonstrate its effectiveness towards generalizability. The code and a unified way to access 18 available datasets is online at https://github.com/semantic-systems/KGQA-datasets as well as https://github.com/semantic-systems/KGQA-datasets-generalization.
SESSION: Demo Papers
We introduce SparCAssist, a general-purpose risk assessment tool for the machine learning models trained for language tasks. It evaluates models' risk by inspecting their behavior on counterfactuals, namely out-of-distribution instances generated based on the given data instance. The counterfactuals are generated by replacing tokens in rational subsequences identified by ExPred, while the replacements are retrieved using HotFlip or the Masked-Language-Model-based algorithms. The main purpose of our system is to help the human annotators to assess the model's risk on deployment. The counterfactual instances generated during the assessment are the by-product and can be used to train more robust NLP models in the future.
In this demonstration, we present RecDelta, an interactive tool for the cross-model evaluation of top-k recommendation. RecDelta is a web-based information system where people visually compare the performance of various recommendation algorithms and their recommended items. In the proposed system, we visualize the distribution of the δ scores between algorithms--a distance metric measuring the intersection between recommendation lists. Such visualization allows for rapid identification of users for whom the items recommended by different algorithms diverge or vice versa; then, one can further select the desired user to present the relationship between recommended items and his/her historical behavior. RecDelta benefits both academics and practitioners by enhancing model explainability as they develop recommendation algorithms with their newly gained insights. Note that while the system is now online at https://cfda.csie.org/recdelta, we also provide a video recording at https://tinyurl.com/RecDelta to introduce the concept and the usage of our system.
Document-at-a-time (DaaT) and score-at-a-time (SaaT) query evaluation techniques are different approaches to top-k retrieval with inverted indexes. While modern systems are dominated by DaaT, the academic literature has seen decades of debate about the merits of each. Recently, there has been renewed interest in SaaT methods for learned sparse lexical models, where studies have shown that transformers generate "wacky weights" that appear to reduce opportunities for optimizations in DaaT methods. However, researchers currently lack an easy-to-use SaaT system to support further exploration. This is the gap that our work fills. Starting with a modern SaaT system (JASS), we built Python bindings in order to integrate into the DaaT Pyserini IR toolkit (Lucene). The result is a common frontend to both a DaaT and a SaaT system. We demonstrate how recent experiments with a wide range of learned sparse lexical models can be easily reproduced. Our contribution is a framework that enables future research comparing DaaT and SaaT methods in the context of modern neural retrieval models.
The process of model checkpoint validation refers to the evaluation of the performance of a model checkpoint executed on a held-out portion of the training data while learning the hyperparameters of the model. This model checkpoint validation process is used to avoid over-fitting and determine when the model has converged so as to stop training. A simple and efficient strategy to validate deep learning checkpoints is the addition of validation loops to execute during training. However, the validation of dense retrievers (DR) checkpoints is not as trivial -- and the addition of validation loops is not efficient. This is because, in order to accurately evaluate the performance of a DR checkpoint, the whole document corpus needs to be encoded into vectors using the current checkpoint before any actual retrieval operation for checkpoint validation can be performed. This corpus encoding process can be very time-consuming if the document corpus contains millions of documents (e.g., 8.8M for MS MARCO v1 and 21M for Natural Questions). Thus, a naïve use of validation loops during training will significantly increase training time. To address this issue, we propose Asyncval: a Python-based toolkit for efficiently validating DR checkpoints during training. Instead of pausing the training loop for validating DR checkpoints, Asyncval decouples the validation loop from the training loop, uses another GPU to automatically validate new DR checkpoints and thus permits to perform validation asynchronously from training. Asyncval also implements a range of different corpus subset sampling strategies for validating DR checkpoints; these strategies allow to further speed up the validation process. We provide an investigation of these methods in terms of their impact on validation time and validation fidelity. Asyncval is made available as an open-source project at https://github.com/ielab/asyncval.
The role of conversational assistants continues to evolve, beyond simple voice commands to ones that support rich and complex tasks in the home, car, and even virtual reality. Going beyond simple voice command and control requires agents and datasets blending structured dialogue, information seeking, grounded reasoning, and contextual question-answering in a multimodal environment with rich image and video content. In this demo, we introduce Task-oriented Multimodal Agent Dialogue (TaskMAD), a new platform that supports the creation of interactive multimodal and task-centric datasets in a Wizard-of-Oz experimental setup. TaskMAD includes support for text and voice, federated retrieval from text and knowledge bases, and structured logging of interactions for offline labeling. Its architecture supports a spectrum of tasks that span open-domain exploratory search to traditional frame-based dialogue tasks. It's open-source and offers rich capability as a platform used to collect data for the Amazon Alexa Prize Taskbot challenge, TREC Conversational Assistance track, undergraduate student research, and others. TaskMAD is distributed under the MIT license.
In this work, we present the Golden Retriever, a system leveraging state-of-the-art visio-linguistic models (VLMs) for real-time text-image retrieval. The unique feature of our system is that it can focus on words contained in the textual query, i.e., locate and high-light them within retrieved images. An efficient two-stage process implements real-time capability and the ability to focus. Therefore, we first drastically reduce the number of images processed by a VLM. Then, in the second stage, we rank the images and highlight the focussed word using the outputs of a VLM. Further, we introduce a new and efficient algorithm based on the idea of TF-IDF to retrieve images for short textual queries. One of multiple use cases where we employ the Golden Retriever is a language learner scenario, where visual cues for "difficult" words within sentences are provided to improve a user's reading comprehension. However, since the backend is completely decoupled from the frontend, the system can be integrated into any other application where images must be retrieved fast. We demonstrate the Golden Retriever with screenshots of a minimalistic user interface.
To satiate the comprehensive information need of users, retrieval systems surpassing the boundaries of language are inevitable in the present digital space in the wake of an ever-rising multilingualism. This work presents the first-of-its-kind Bilingual Text Retrieval Explanations (BiTe-REx) aimed at users performing competitor or wage analysis in the automotive domain. BiTe-REx supports users to gather a more comprehensive picture of their query by retrieving results regardless of the query language and enables them to make a more informed decision by exposing how the underlying model judges the relevance of documents. With a user study, we demonstrate statistically significant results on the understandability and helpfulness of the explanations provided by the system.
Technology-assisted review (TAR) is an important industrial application of information retrieval (IR) and machine learning (ML). While a small TAR research community exists, the complexity of TAR software and workflows is a major barrier to entry. Drawing on past open source TAR efforts, as well as design patterns from the IR and ML open source software, we present an open source Python framework for conducting experiments on TAR algorithms. Key characteristics of this framework are declarative representations of workflows and experiment plans, the ability for components to play variable numbers of workflow roles, and state maintenance and restart capabilities. Users can draw on reference implementations of standard TAR algorithms while incorporating novel components to explore their research interests. The framework is available at https://github.com/eugene-yang/tarexp.
Entity Matching (EM) aims to find data instances from different sources that refer to the same real-world entity. The existing EM techniques can be either costly or tailored for a specific data type. We present ZeroMatcher, a cost-off entity matching system, which supports (i) handling EM tasks with different data types, including relational tables and knowledge graphs; (ii) keeping its EM performance always competitive by enabling the sub-modules to be updated in a lightweight manner, thus reducing development costs; and (iii) performing EM without human annotations to further slash the labor costs. First, ZeroMatcher automatically suggests users a set of appropriate modules for EM according to the data types of the input datasets. Users could specify the modules for the subsequent EM process according to their preferences. Alternatively, users are able to customize the modules of ZeroMatcher. Then, the system proceeds to the EM task, where users can track the entire EM process and monitor the memory usage changes in real-time. When the EM process is completed, ZeroMatcher visualizes the EM results from different aspects to ease the understanding for users. Finally, ZeroMatcher provides EM results evaluation, enabling users to compare the effectiveness among different parameter settings.
Data scientists are constantly facing the problem of how to improve prediction accuracy with insufficient tabular data. We propose a table enrichment system that enriches a query table by adding external attributes (columns) from data lakes and improves the accuracy of machine learning predictive models. Our system has four stages, join row search, task-related table selection, row and column alignment, and feature selection and evaluation, to efficiently create an enriched table for a given query table and a specified machine learning task. We demonstrate our system with a web UI to show the use cases of table enrichment.
Quantities shape our understanding of measures and values, and they are an important means to communicate the properties of objects. Often, search queries contain numbers as retrieval units, e.g., "iPhone that costs less than 800 Euros''. Yet, modern search engines lack a proper understanding of numbers and units. In queries and documents, search engines handle them as normal keywords and therefore are ignorant of relative conditions between numbers, such as greater than or less than, or, more generally, the numerical proximity of quantities. In this work, we demonstrate QFinder, our quantity-centric framework for ranking search results for queries with quantity constraints. We also open-source our new ranking method as an Elasticsearch plug-in for future use. Our demo is available at: https://qfinder.ifi.uni-heidelberg.de/
Image retrieval from generative adversarial networks (GANs) is challenging for several reasons. First, there are no clear mappings between the GAN's latent space and useful semantic features, making it difficult for users to navigate. Second, the number of unique images that can be generated is exceptionally high, taxing the scaling properties of existing search algorithms. In this article, we present ROGUE, a system to support exploratory search of images generated from GANs. We demonstrate how to implement features that are commonly found in exploratory search interfaces, such as faceted search and relevance feedback, in the context of GAN search. We additionally use reinforcement learning to help users navigate the image space , trading off exploration (showing diverse images) and exploitation (showing images predicted to receive positive relevance feedback). Finally, we present a usability study where participants were situated in the role of a casting director who needs to explore actors' headshots for an upcoming movie. The system obtained an average SUS score of 72.8 and all participants reported being either satisfied or very satisfied with the images they identified with the system. The system is shown in this accompanying video: https://vimeo.com/680036160.
In this demo paper, we present a new open-source python module for building information retrieval pipelines with transformers namely CHERCHE. Our aim is to propose an easy to plug tool capable to execute, simple but strong, state-of-the-art information retrieval models. To do so, we have integrated classical models based on lexical matching but also recent models based on semantic matching. Indeed, a large number of models available on public hubs can be now tested on information retrieval tasks with only a few lines. CHERCHE is oriented to newcomers into the neural information retrieval field that want to use transformer-based models in small collections without struggling with heavy tools. The code and documentation of CHERCHE is public available at https://github.com/raphaelsty/cherche
Despite more than two decades of research on temporal tagging and temporal relation extraction, usable tools for annotating text remain very basic and hard to set up from an average end-user perspective, limiting the applicability of developments to a selected group of invested researchers. In this work, we aim to increase the accessibility of temporal tagging systems by presenting an intuitive web interface, called "Online DATEing", which simplifies the interaction with existing temporal annotation frameworks. Our system integrates several approaches in a single interface and streamlines the process of importing (and tagging) groups of documents, as well as making it accessible through a programmatic API. It further enables users to interactively investigate and visualize tagged texts, and is designed with an extensible API for the inclusion of new models or data formats. A web demonstration of our tool is available at https://onlinedating.ifi.uni-heidelberg.de and public code accessible at https://github.com/satya77/Temporal_Tagger_Service.
Are Taylor's Posts Risky? Evaluating Cumulative Revelations in Online Personal Data: A persona-based tool for evaluating awareness of online risks and harms
Searching for people online is a common search task that most of us have performed at some point or other. With so much information about people available online it is often amazing what one can find out about someone else -- especially when information taken from different sources is pieced together to create a more detailed picture of the individual, and then used to make inferences about them (leading to cumulative revelations ). As such, the relevance of one piece of information is often conditional and dependent on other pieces of information found. This creates interesting and novel challenges in evaluating informationrelevance when searching personal profiles, posts and related information about an individual, as well as the potential risks that can arise from such revelations. In this demonstration paper, we present a tool designed to investigate how people assess and judge the relevance and potential risks ofsmall, apparently innocuous pieces of information associated with fictitious personas, such as Taylor Addison, when searching and browsing online profiles and social media. The demonstrator also comprises a cyber-safety tool, which aims to provide education and raise awareness of the potential risks of cumulative revelations. It does so by engaging participants in different scenarios where the relevance of individual information items depends on the searcher and their particular underlying motivation.
We present LawNet-Viz, a web-based tool for the modeling, analysis and visualization of law reference networks extracted from a statute law corpus. LawNet-Viz is designed to support legal research tasks and help legal professionals as well as laymen visually exploring the article connections built upon the explicit law references detected in the article contents. To demonstrate LawNet-Viz, we show its application to the Italian Civil Code (ICC), which exploits a recent BERT-based model fine-tuned on the ICC. LawNet-Viz is a system prototype that is planned for product development.
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
Professional news media organizations have always touted the importance that they give to multiple perspectives. However, in practice, the traditional approach to all-sides has favored people in the dominant culture. Hence it has come under ethical critique under the new norms of diversity, equity, and inclusion (DEI). When DEI is applied to journalism, it goes beyond conventional notions of impartiality and bias and instead democratizes the journalistic practice of sourcing -- who is quoted or interviewed, who is not, how often, from which demographic group, gender, and so forth. There is currently no real-time or on-demand tool in the hands of reporters to analyze the persons they quote. In this paper, we present DIANES, a DEI Audit Toolkit for News Sources. It consists of a natural language processing pipeline on the backend to extract quotes, speakers, titles, and organizations from news articles in real time. On the frontend, DIANES offers the WordPress plugins, a Web monitor, and a DEI annotation API service, to help news media monitor their own quoting patterns and push themselves towards DEI norms.
Finding relevant literature is crucial for biomedical research and in the practice of evidence-based medicine, making biomedical search an important application area within the field of information retrieval. This is recognised by the broader IR community, and in particular by the organisers of Text Retrieval Conference (TREC) as early as 2003. While TREC provides crucial evaluation resources, to get started in biomedical IR one needs to tackle an important software engineering hurdle of parsing, indexing, and deploying several large document collections. Moreover, many newcomers to the field often face a steep learning curve, where theoretical concepts are tangled up with technical aspects. Finally, many of the existing baselines and systems are difficult to reproduce.
We aim to alleviate all three of these bottlenecks with the launch of A2A-API. It is a RESTful API which serves as an easy-to-use and programming-language-independent interface to existing biomedical TREC collections. It builds upon A2A, our system for biomedical information retrieval benchmarking, and extends it with additional functionalities. Apart from providing programmatic access to the features of the original A2A system - focused principally on benchmarking - A2A-API supports biomedical IR researchers in development of systems featuring reranking and query reformulation components. In this demonstration, we illustrate the capabilities of A2A-API with comprehensive use cases.
NeuralKG is an open-source Python-based library for diverse representation learning of knowledge graphs. It implements three kinds of Knowledge Graph Embedding (KGE) methods, including conventional KGEs, GNN-based KGEs, and Rule-based KGEs. With a unified framework, NeuralKG successfully reproduces link prediction results of these methods on benchmarks, freeing users from the laborious task of reimplementing them, especially for some methods originally written in non-python programming languages. Besides, NeuralKG is highly configurable and extensible. It provides various decoupled modules that can be mixed and adapted to each other. Thus with NeuralKG, developers and researchers can quickly implement their own designed models and obtain the optimal training methods to achieve the best performance efficiently. We built a website http://neuralkg.zjukg.org to organize an open and shared KG representation learning community. The library, experimental methodologies, and model reimplement results of NeuralKG are all publicly released at https://github.com/zjukg/NeuralKG.
Information retrieval (IR) evaluation can be considered as a form of competition in matching documents and queries. This paper introduces a learning environment based on gamification of query construction for document retrieval, called IRVILAB (Information Retrieval Virtual Lab). The lab has modules for creating standard evaluation settings, one for topic creation including relevance assessments and another for performance evaluation of user queries. In addition, multilingual Wikipedia online collection enables a module, where relevance assessments are translated to other languages. The underlying game utilizes IR performance metrics to measure and give feedback on participants' information retrieval performance. It aims to improve participants' search skills, subject knowledge and contributes to science education by introducing an experimental method. Distinctive features of the system include algorithmic relevance assessments and automatic recall base translation.
Mobile internet users generate personal data on the devices all the time in this era. In this paper, we demonstrate a novel system for integrating the data of a user from different sources into a Personal Knowledge Graph, i.e., PKG. We show how a user's intention can be detected and how the personal data can be aligned and connected by the user behaviors. The constructed PKG allows the system makes reasonable and accurate recommendations for users by a "neural + symbolic'' approach across different services. Our system is shown in https://youtu.be/hWuo8KCDrto.
PISA (Performant Indexes and Search for Academia) provides very efficient implementations of various retrieval algorithms over sparse inverted indices. The highly-optimized C++ implementation, however, has previously only been accessible via command line tools. From indexing to retrieval, 5--6 commands need to be executed in sequence, making the process relatively involved. Further complications when using PISA include a lengthy build process and minimal interoperability with other tools. In this work, we demonstrate a new tool that provides a native Python wrapper around PISA. The wrapper features a simplified interface that adheres to the PyTerrier API, making it easy to use (e.g., via Pandas DataFrames), apply to a multitude of datasets (e.g., those from the ir_datasets package) and combine with other methods (e.g., neural re-ranking and dense retrieval methods).
Arm: Efficient Learning of Neural Retrieval Models with Desired Accuracy by Automatic Knowledge Amalgamation
In recent years, there has been increasing interest in adopting published neural retrieval models learned from corpora for text retrieval. Although these models achieve excellent retrieval performance, in terms of popular accuracy metrics, on datasets they have been trained, their performance on new text data might degrade. To obtain the desired retrieval performance on both the data used in training and the latest data collected after training, the simple approach of learning a new model from both datasets is not always feasible since the annotated dataset used in training is often not published along with the learned model. Knowledge amalgamation (KA) is an emerging technique to deal with this problem of inaccessibility of data used in previous training. KA learns a new model (called a student model) from new data by reusing (called amalgamating) a number of trained models (called teacher models) instead of accessing the teachers' original training data. However, in order to efficiently learn an accurate student model, the classical KA approach requires manual selection of an appropriate subset of teacher models for amalgamation. This manual procedure for selecting teacher models prevents the classical KA from being scaled to retrieval tasks for which a large number of candidate teacher models are ready to be reused.
This paper presents Arm, an intelligent system for efficiently learning a neural retrieval model with the desired accuracy on incoming data by automatically amalgamating a subset of teacher models (called a teacher model combination or simply combination ) among a large number of teacher models. o filter combinations that fail to produce accurate student models, Arm employs Bayesian optimization to derive an accuracy prediction model based on sampled amalgamation tasks. Then, Arm uses the derived prediction model to exclude unqualified combinations without training the rest combinations.
To speed up training, Arm introduces a cost model that picks the teacher model combination with the minimal training cost among all qualified teacher model combinations to produce the final student model. This paper will demonstrate the major workflow of Arm and present the produced student models to users.
The use of attributed quotes is the most direct and least filtered pathway of information propagation in news. Consequently, quotes play a central role in the conception, reception, and analysis of news stories. Since quotes provide a more direct window into a speaker's mind than regular reporting, they are a valuable resource for journalists and researchers alike. While substantial research efforts have been devoted to methods for the automated extraction of quotes from news and their attribution to speakers, few comprehensive corpora of attributed quotes from contemporary sources are available to the public. Here, we present an adaptive web interface for searching Quotebank, a massive collection of quotes from the news, which we make available at https://quotebank.dlab.tools.
SESSION: SIRIP Papers
Learning-to-Rank(LTR) is widely used in many Information Retrieval(IR) scenarios, including web search and Location Based Services(LBS) search. However, most existing LTR techniques mainly focus on homogeneous ranking. Taking QAC in Dianping search as an example, heterogeneous documents including suggested queries (SQ) and Point-of-Interests(POI) need to be ranked and presented to enhance user experience. New challenges are faced when conducting heterogeneous ranking, including inconsistent feature space and more serious position bias caused by distinct representation spaces. Therefore, we propose Deep Debiasing Experts Network (DDEN), a novel heterogeneous LTR approach based on Mixture-of-Experts architecture and gating network, to deal with the inconsistent feature space of documents in ranking system. Furthermore, DDEN mitigates the position bias by adopting adversarial-debiasing framework embedded with heterogeneous LTR techniques. We conduct reproducible experiments on industrial datasets from Dianping, one of the largest local life platforms, and deploy DDEN in online application. Results show that DDEN substantially improves ranking performance in offline evaluation and boost the overall click-through rate in online A/B test by 2.1%.
ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search.
This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning.
An Auto Encoder-based Dimensionality Reduction Technique for Efficient Entity Linking in Business Phone Conversations
An entity linking system links named entities in a text to their corresponding entries in a knowledge base. In recent years, building an entity linking system that leverages the transformer architecture has gained lots of attention. However, deploying a transformer-based neural entity linking system in industrial production environments in a limited resource setting is a challenging task. In this work, we present an entity linking system that leverages a transformer-based BERT encoder (the BLINK model) to connect the product and organization type entities in business phone conversations to their corresponding Wikipedia entries. We propose a dimensionality reduction technique via utilizing an auto encoder that can effectively compress the dimension of the pre-trained BERT embeddings to 256 from the original size of 1024. This allows our entity linking system to significantly optimize the space requirement when deployed in a resource limited cloud machine while reducing the inference time along with retaining high accuracy.
In its most basic form, advertising video production communicates a message about a product or service to the public. In the age of digital marketing, where the most popular way to connect with audiences is through advertising videos. However, advertising video production is a costly and complicated process from creation, material shooting, editing to the final commercial video. Therefore, producing qualified advertising videos is a capital and talent-intensive task, which poses a huge challenge for start-ups or inexperienced ad creators. paper proposes an intelligent advertising video production system driven by multi-modal retrieval, which only requires the input of descriptive copy. This system can automatically generate scripts, then extract key queries, retrieve related short video materials in the video library, and finally synthesize short advertising videos. The whole process minimizes human input, greatly reduces the threshold for advertising video production and greatly improves output and efficiency. It has a modular design to encourage the study of new multi-modal algorithms, which can be evaluated in batch mode. It can also integrate with a user interface, which allows user studies and data collection in an interactive mode, where the back end can be fully algorithmic or a wizard of oz setup. The proposed system has been fully verified and has broad prospects in the production of short videos for commodity advertisements within Alibaba.
Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates generated by L1 in order to maximize precision of the final result list.
Traditionally, candidate retrieval was performed with an inverted index data structure, with exact lexical matching. Candidates are ordered by a dot-product-like scoring function f(q,d) where q and d are sparse vectors containing token weights, typically derived from the token's frequency in the document/query and corpus. The inverted index enables sub-linear ranking of the documents. Due to the sparse vector representation of the documents and queries, lexical match retrieval systems have also been called sparse retrieval.
To contrast, dense retrieval represents queries and documents by embedding the text into lower dimensional dense vectors. Candidate documents are scored based on the distance between the query and document embedding vectors. Practically, the similarity computations are made efficiently with approximate k-nearest neighbours (ANN) systems.
In this panel, we bring together experts in dense retrieval across multiple industry applications, including web search, enterprise and personal search, e-commerce, and out-of-domain retrieval.
The mission of major e-commerce platforms is to enable their customers to find the best products for their needs. In the common case of large inventories, complex User Interfaces (UIs) are required to allow a seamless navigation. However, as UIs often contain many widgets of different relevance, the task of constructing an optimal layout arises in order to improve the customer's experience. This is a challenging task, especially in the typical industrial setup where multiple independent teams conflict by adding and modifying UI widgets. It becomes even more challenging due to the customer preferences evolving over time, bringing the need for adaptive solutions. In a previous work , we addressed this task by introducing a UI governance framework powered by Machine Learning (ML) algorithms that automatically and continuously search for the optimal layout. Nevertheless, we highlighted that naive algorithmic choices exhibit several issues when implemented in the industry, such as widget dependency, combinatorial solution space and cold start problem. In this work, we demonstrate how we deal with these issues using Combinatorial Bandits, an extension of Multi-Armed Bandits (MAB) where the agent selects not only one but multiple arms at the same time. We develop two novel approaches to model combinatorial bandits, inspired by the Natural Language Processing (NLP) and the Evolutionary Algorithms (EA) fields and present their ability to enable scalable UI optimization.
Information retrieval has traditionally been framed in terms of searching and extracting information from mostly static resources. Interactive information retrieval (IIR) has widened the scope, with interactive dialogues largely playing the role of clarifying (i.e., making explicit, and/or refining) the information search space. Informed by market research practices, we seek to reframe IIR as a process of eliciting novel information from human interlocutors, with a chatbot-inspired virtual agent playing the role of an interviewer. This reframing flips conventional IIR into what we call an inverse information seeking dialogue, wherein the virtual agent recurrently extracts information from human utterances and poses questions intended to elicit related information. In this work, we introduce and provide a formal definition of an inverse information seeking agent, outline some of its unique challenges, and propose our novel framework to tackle this problem based on techniques from natural language processing (NLP) and IIR.
Information Ecosystem Threats in Minoritized Communities: Challenges, Open Problems and Research Directions
Journalists, fact-checkers, academics, and community media are overwhelmed in their attempts to support communities suffering from gender-, race- and ethnicity-targeted information ecosystem threats, including but not limited to misinformation, hate speech, weaponized controversy and online-to-offline harassment. Yet, for a plethora of reasons, minoritized groups are underserved by current approaches to combat such threats. In this panel, we will present and discuss the challenges and open problems facing such communities and the researchers hoping to serve them. We will also discuss the current state-of-the-art as well as the most promising future directions, both within IR specifically, across Computer Science more broadly, as well as that requiring transdisciplinary and cross-sectoral collaborations. The panel will attract both IR practitioners and researchers and include at least one panelist outside of IR, with unique expertise in this space.
Extractive search has been used to create datasets matching queries and syntactic patterns, but less attention has been paid on what to do with those datasets. We present a two-stage system targeted towards biomedical texts. First, it creates custom datasets using a powerful mix of keyword and syntactic matching. We then return lists of related words, provide semantic search, train a large language model, a synthetic data based QA model, a summarization model over those results, and so on. These are then used in downstream biomedical work.
A significant challenge in the legal domain is to organize and summarize a constantly growing collection of legal documents, uncovering hidden topics, or themes, that later can support tasks such as legal case retrieval and legal judgment prediction. This massive amount of digital legal documents, combined with the inherent complexity of judiciary systems worldwide, presents a promising scenario for Machine Learning solutions, mainly those taking advantage of all the advancements in the area of Natural Language Processing (NLP). It is in this scenario that Jusbrasil, the largest legal tech company in Brazil, is situated. Using a dataset partially curated by the Jusbrasil legal team, we explore topic modeling solutions using state of the art language models, trained with legal Portuguese documents, to automatically organize and summarize this complex collection of documents. Instead of using an entire legal case, which usually is composed of many pages, we show that it is possible to efficiently organize the collection using the syllabus (in Portuguese, ementa jurisprudencial) from each court decision as they concisely summarize the main points presented by the entire decision.
In an instant search setting such as Netflix Search where results are returned in response to every keystroke, determining how a partial query maps onto broad classes of relevant entities orfacets --- such as videos, talent, and genres --- can facilitate a better understanding of the underlying objective of that query. Such a query-to-facet mapping system has a multitude of applications. It can help improve the quality of search results, drive meaningful result organization, and can be leveraged to establish trust by being transparent with Netflix members when they search for an entity that is not available on the service. By anticipating the relevant facets with each keystroke entry, the system can also better guide the experience within a search session. When aggregated across queries, the facets can reveal interesting patterns of member interest. A key challenge for building such a system is to judiciously balance lexical similarity with behavioral relevance. In this paper, we present a high level overview of a Query Facet Mapping system that we have developed at Netflix, describe its main components, provide evaluation results with real-world data, and outline several potential applications.
A Low-Cost, Controllable and Interpretable Task-Oriented Chatbot: With Real-World After-Sale Services as Example
Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability and interpretability; (iii) annotation-hungry. In this paper, we propose to represent utterance with a simpler concept named Dialogue Action, upon which we construct a tree-structured TaskFlow and further build task-oriented chatbot with TaskFlow as core component. A framework is presented to automatically construct TaskFlow from large-scale dialogues and deploy online. Our experiments on real-world after-sale customer services show TaskFlow can satisfy the major needs, as well as reduce the developer burden effectively.
There exists the cold-start problem in the recommendation systems when observed user-item interactions are insufficient. To alleviate this problem, most existing works aim to learn globally shared prior knowledge across all items and be fast adapted to a new item with few interactions. However, such learning techniques are data demanding and work poorly on new items with no interactions. In this applied paper, we present an industrial framework recently deployed on Alipay to address the item cold-start problem in zero-shot scenarios. The proposed framework provides both efficient and high-quality recommendations for cold items with no log data. Specifically, we formulate the cold-start problem as a zero-shot learning problem and build a highly efficient infrastructure to accomplish online zero-shot recommendations used on large-scale platforms. Extensive offline experiments and online A/B testing demonstrate that the proposed framework has superior performance and recommends cold items to preferred users more effectively than other state-of-the-art methods.
The title of a product offering is the consolidation of a product's characteristics in textual format for user consumption. The low quality of the textual content of a product's title can negatively influence the entire shopping experience. The negative experience can start with the impossibility of discovering a desired product, going from problems in identifying a product and its characteristics up to the purchase of an unwanted item. A solution to this problem is to establish an indicator that automatically describes the quality of the product title. With this assessment, it is possible to notify sellers who have registered products with poor quality titles and encourage revisions or suggest improvements. The focus of this work is to show how it is possible to assign a score that indicates the descriptive quality of product offers in an e-commerce marketplace environment using unsupervised methods.
Learning to Rank Instant Search Results with Multiple Indices: A Case Study in Search Aggregation for Entertainment
At Xfinity, an instant search system provides a variety of results for a given query from different sources. For each keystroke, new results are rendered on screen to the user, which could contain movies, television series, sporting events, music videos, news clips, person pages, and other result types. Users are also able to use the Xfinity Voice Remote to submit longer queries, some of which are more open-ended. Examples of queries include incomplete words which match multiple results through lexical matching (i.e., "ali"), topical searches ("vampire movies"), and more specific longer searches ("Movies with Adam Sandler"). Since results can be based on lexical matches, semantic matches, item-to-item similarity matches, or a variety of business logic driven sources, a key challenge is how to combine results into a single list. To accomplish this, we propose merging the lists via a Learning to Rank (LTR) neural model which takes into account the search query. This combined list can be personalized via a second LTR neural model with knowledge of the user's search history and metadata of the programs. Because instant search is under-represented in the literature, we present our learnings from research to aid other practitioners.
Recently retrieval-augmented text generation has achieved state-of-the-art performance in many NLP tasks and has attracted increasing attention of the NLP and IR community, this tutorial thereby aims to present recent advances in retrieval-augmented text generation comprehensively and comparatively. It firstly highlights the generic paradigm of retrieval-augmented text generation, then reviews notable works for different text generation tasks including dialogue generation, machine translation, and other generation tasks, and finally points out some limitations and shortcomings to facilitate future research.
Retrieval and Recommendation Systems at the Crossroads of Artificial Intelligence, Ethics, and Regulation
This tutorial aims at providing its audience an interdisciplinary overview about the topics of fairness and non-discrimination, diversity, and transparency of AI systems, tailored to the research fields of information retrieval and recommender systems. By means of this tutorial, we would like to equip the mostly technical audience of SIGIR with the necessary understanding of the ethical implications of their research and development on the one hand, and of recent political and legal regulations that address the aforementioned challenges on the other hand.
In recent years, sequential recommender systems (SRSs) and session-based recommender systems (SBRSs) have emerged as a new paradigm of RSs to capture users' short-term but dynamic preferences for enabling more timely and accurate recommendations. Although SRSs and SBRSs have been extensively studied, there are many inconsistencies in this area caused by the diverse descriptions, settings, assumptions and application domains. There is no work to provide a unified framework and problem statement to remove the commonly existing and various inconsistencies in the area of SR/SBR. There is a lack of work to provide a comprehensive and systematic demonstration of the data characteristics, key challenges, most representative and state-of-the-art approaches, typical real- world applications and important future research directions in the area. This work aims to fill in these gaps so as to facilitate further research in this exciting and vibrant area.
Dialogue systems, commonly known as Chatbots, have gained escalating popularity in recent years due to their wide-spread applications in carrying out chit-chat conversations with users and accomplishing various tasks as personal assistants. However, they still have some major weaknesses. One key weakness is that they are typically trained from pre-collected and manually-labeled data and/or written with handcrafted rules. Their knowledge bases (KBs) are also fixed and pre-compiled by human experts. Due to the huge amount of manual effort involved, they are difficult to scale and also tend to produce many errors ought to their limited ability to understand natural language and the limited knowledge in their KBs. Thus, when these systems are deployed, the level of user satisfactory is often low.
In this tutorial, we introduce and discuss methods to give chatbots the ability to continuously and interactively learn new knowledge during conversation, i.e. "on-the-job" by themselves so that as the systems chat more and more with users, they become more and more knowledgeable and improve their performance over time. The first half of the tutorial focuses on introducing the paradigm of lifelong and continual learning and discuss various related problems and challenges in conversational AI applications. In the second half, we present recent advancements on the topic, with a focus on continuous lexical and factual knowledge learning in dialogues, open-domain dialogue learning after deployment and learning of new language expressions via user interactions for language grounding applications (e.g. natural language interfaces). Finally, we conclude with a discussion on the scopes for continual conversational skill learning and present some open challenges for future research.
This tutorial focuses on both theoretical and practical aspects of improving the efficiency and robustness of transformer-based approaches, so that these can be effectively used in practical, high-scale, and high-volume information retrieval (IR) scenarios. The tutorial is inspired and informed by our work and experience while working with massive narrative datasets (8.5 billion medical notes), and by our basic research and academic experience with transformer-based IR tasks. Additionally, the tutorial focuses on techniques for making transformer-based IR robust against adversarial (AI) exploitation. This is a recent concern in the IR domain that we needed to take into concern, and we want to want to share some of the lessons learned and applicable principles with our audience. Finally, an important, if not critical, element of this tutorial is its focus on didacticism -- delivering tutorial content in a clear, intuitive, plain-speak fashion. Transformers are a challenging subject, and, through our teaching experience, we observed a great value and a great need to explain all relevant aspects of this architecture and related principles in the most straightforward, precise, and intuitive manner. That is the defining style of our proposed tutorial.
Recent studies have shown that it is possible for stereotypical gender biases to find their way into representational and algorithmic aspects of retrieval methods; hence, exhibit themselves in retrieval outcomes. In this tutorial, we inform the audience of various studies that have systematically reported the presence of stereotypical gender biases in Information Retrieval (IR) systems. We further classify existing work on gender biases in IR systems as being related to (1) relevance judgement datasets, (2) structure of retrieval methods, and (3) representations learnt for queries and documents. We present how each of these components can be impacted by or cause intensified biases during retrieval. Based on these identified issues, we then present a collection of approaches from the literature that have discussed how such biases can be measured, controlled, or mitigated. Additionally, we introduce publicly available datasets that are often used for investigating gender biases in IR systems as well as evaluation methodology adopted for determining the utility of gender bias mitigation strategies.
Recommender systems have become key components for a wide spectrum of web applications (e.g., E-commerce sites, video sharing platforms, lifestyle applications, etc), so as to alleviate the information overload and suggest items for users. However, most existing recommendation models follow a supervised learning manner, which notably limits their representation ability with the ubiquitous sparse and noisy data in practical applications. Recently, self-supervised learning (SSL) has become a promising learning paradigm to distill informative knowledge from unlabeled data, without the heavy reliance on sufficient supervision signals. Inspired by the effectiveness of self-supervised learning, recent efforts bring SSL's superiority into various recommendation representation learning scenarios with augmented auxiliary learning tasks. In this tutorial, we aim to provide a systemic review of existing self-supervised learning frameworks and analyze the corresponding challenges for various recommendation scenarios, such as general collaborative filtering paradigm, social recommendation, sequential recommendation, and multi-behavior recommendation. We then raise discussions and future directions of this area. With the introduction of this emerging and promising topic, we expect the audience to have a deep understanding of this domain. We also seek to promote more ideas and discussions, which facilitates the development of self-supervised learning recommendation techniques.
Conducting studies involving actual users is a recurring challenge in information retrieval. In this tutorial we will address the main strategic and tactical choices for engaging with, designing and executing user studies, considering both evaluation and formative investigation. The tension between reproducibility and ensuring natural user behaviour will be a recurring focus, seeking to help individual researchers make an intentional and well-argued choice for their research. The presenters have over fifty years of combined experience working in interactive information retrieval, and information interaction in general.
Customer reviews are vital for making purchasing decisions in the Information Age. Such reviews can be automatically summarized to provide the user with an overview of opinions. In this tutorial, we present various aspects of opinion summarization that are useful for researchers and practitioners. First, we will introduce the task and major challenges. Then, we will present existing opinion summarization solutions, both pre-neural and neural. We will discuss how summarizers can be trained in the unsupervised, few-shot, and supervised regimes. Each regime has roots in different machine learning methods, such as auto-encoding, controllable text generation, and variational inference. Finally, we will discuss resources and evaluation methods and conclude with the future directions. This three-hour tutorial will provide a comprehensive overview over major advances in opinion summarization. The listeners will be well-equipped with the knowledge that is both useful for research and practical applications.
A knowledge graph (KG) has nodes and edges representing entities and relations. KGs are central to search and question answering (QA), yet research on deep/neural representation of KGs, as well as deep QA, have moved largely to AI, ML and NLP communities. The goal of this tutorial is to give IR researchers a thorough update on the best practices of neural KG representation and inference from AI, ML and NLP communities, and then explore how KG representation research in the IR community can be better driven by the needs of search, passage retrieval, and QA. In this tutorial, we will study the most widely-used public KGs, important properties of their relations, types and entities, best-practice deep representations of KG elements and how they support or cannot support such properties, loss formulations and learning methods for KG completion and inference, the representation of time in temporal KGs, alignment across multiple KGs, possibly in different languages, and the use and benefits of deep KG representations in QA applications.
Conversational information seeking (CIS) involves interaction sequences between one or more users and an information system. Interactions in CIS are primarily based on natural language dialogue, while they may include other types of interactions, such as click, touch, and body gestures. CIS recently attracted significant attention and advancements continue to be made. This tutorial follows the content of the recent Conversational Information Seeking book authored by several of the tutorial presenters. The tutorial aims to be an introduction to CIS for newcomers to CIS in addition to the recent advanced topics and state-of-the-art approaches for students and researchers with moderate knowledge of the topic. A significant part of the tutorial is dedicated to hands-on experiences based on toolkits developed by the presenters for conversational passage retrieval and multi-modal task-oriented dialogues. The outcomes of this tutorial include theoretical and practical knowledge, including a forum to meet researchers interested in CIS.
While recent progress in the field of machine learning (ML) and information retrieval (IR) has been significant, the reproducibility of these cutting-edge results is often lacking, with many submissions failing to provide the necessary information in order to ensure subsequent reproducibility. Despite the introduction of self-check mechanisms before submission (such as the Reproducibility Checklist, criteria for evaluating reproducibility during reviewing at several major conferences, artifact review and badging framework, and dedicated reproducibility tracks and challenges at major IR conferences, the motivation for executing reproducible research is lacking in the broader information community. We propose this tutorial as a gentle introduction to help ensure reproducible research in IR, with a specific emphasis on ML aspects of IR research.
Perhaps the applied nature of information retrieval research goes some way to explain the community's rich history of evaluating machine learning models holistically, understanding that efficacy matters but so does the computational cost incurred to achieve it. This is evidenced, for example, by more than a decade of research on efficient training and inference of large decision forest models in learning-to-rank. As the community adopts even more complex, neural network-based models in a wide range of applications, questions on efficiency have once again become relevant. We propose this workshop as a forum for a critical discussion of efficiency in the era of neural information retrieval, to encourage debate on the current state and future directions of research in this space, and to promote more sustainable research by identifying best practices in the development and evaluation of neural models for information retrieval.
The goal of the seventh edition of SCAI (https://scai.info) is to bring together and further grow a community of researchers and practitioners interested in conversational systems for information access. The previous iterations of the workshop already demonstrated the breadth and multidisciplinarity inherent in the design and development of conversational search agents. The proposed shift from traditional web search to search interfaces enabled via human-like dialogue leads to a number of challenges, and although such challenges have received more attention in the recent years, there are many pending research questions that should be addressed by the information retrieval community and can largely benefit from a collaboration with other research fields, such as natural language processing, machine learning, human-computer interaction and dialogue systems. This workshop is intended as a platform enabling a continuous discussion of the major research challenges that surround the design of search-oriented conversational systems. This year, participants have the opportunity to meet in person and have more in-depth interactive discussions with a full-day onsite workshop.
A rapidly changing news ecosystem presents new challenges to research, media organizations, consumers, and societies. The 10th edition of the International Workshop on News Recommendation and Analytics (INRA) serves to exchange ideas and discuss recent trends, technological advancements, and open problems concerning news. We welcome contributions in scientific articles, demonstrations, and ideas. We strive to bring together researchers, practitioners, and decision-makers to address crucial challenges. The workshop provides an opportunity to learn about recent research and interactively discuss technical and interdisciplinary aspects related to news. Topics of interest include information access systems for news, advances in natural language processing, multi-modality, mis- and disinformation, trust and user experiences, and personalization.
Steadily increasing numbers of patent applications per year and large amounts of available patent data necessitate highly efficient and interactive next-generation information retrieval systems in the patent domain. AI and Machine Learning (ML) methods such as Deep Learning (DL) are successfully adopted in many domains, so patent researchers and practitioners start to employ AI-based approaches as well, to support experts in the patenting process or to automate patent analysis and retrieval processes. AI-enhanced Information Retrieval systems can improve patent search and analysis but also require millions of annotated sample data for training the ML models. When working with patent data, particular challenges arise that call for adaption of existing IR and AI methods as well as development of novel approaches suited for the patent domain. The focus of the 3rd edition of this workshop will be on two-way communication between industry and academia from all areas of Information Retrieval, such as Natural Language Processing (NLP), Text and Data Mining (TDM), and Semantic Technologies (ST). We want to bring together novel research results and the latest systems and methods employed by the Intellectual Property (IP) industry.
QUARE - measuring the QUality of explAnations in REcommender systems - is the first workshop that aims to promote discussion upon future research and practice directions around evaluation methodologies for explanations in recommender systems. To that end, we bring together researchers and practitioners from academia and industry to facilitate discussions about the main issues and best practices in the respective areas, identify possible synergies, and outline priorities regarding future research directions. Additionally, we want to stimulate reflections around methods to systematically and holistically assess explanation approaches, impact, and goals, at the interplay between organisational and human values. The homepage of the workshop is available at: https://sites.google.com/view/quare-2022/.
Since its inception, the World Wide Web has become a major information source, consulted for a diversity of informational tasks. With an abundance of information available online, Web search engines have been a main entry point, supporting users in finding suitable Web content for ever more complex information needs. The IWILDS workshop series invites research on complex search activities related to human learning. It provides an interdisciplinary platform for the presentation and discussion of recent research on human learning on the Web, welcoming perspectives from computer & information science, education and psychology.
eCommerce Information Retrieval (IR) is receiving increasing attention in the academic literature and is an essential component of some of the world's largest web sites (e.g. Airbnb, Alibaba, Amazon, eBay, Facebook, Flipkart, Lowe's, Taobao, and Target). SIGIR has for several years seen sponsorship from eCommerce organisations, reflecting the importance of IR research to them. The purpose of this workshop is (1) to bring together researchers and practitioners of eCommerce IR to discuss topics unique to it, (2) to determine how to use eCommerce's unique combination of free text, structured data, and customer behavioral data to improve search relevance, and (3) to examine how to build datasets and evaluate algorithms in this domain. Since eCommerce customers often do not know exactly what they want to buy (i.e. navigational and spearfishing queries are rare), recommendations are valuable for inspiration and serendipitous discovery as well as basket building.
The theme of this year's eCommerce IR workshop is Bridging IR Metrics and Business Metrics and Multi-objective Optimization. The workshop includes papers on this topic as well as a panel focused on this area (see Section 3). In addition, Farfetch is sponsoring a recommendation challenge focused on outfit completion: as part of the event, Farfetch will release to the research community a novel, large dataset containing multi-modal information and extensive labels curated by fashion experts. The data challenge reflects themes from prior SIGIR workshops in 2017, 2018, 2019, 2020, 2021.
Information retrieval (IR) systems have become an essential component in modern society to help users find useful information, which consists of a series of processes including query expansion, item recall, item ranking and re-ranking, etc. Based on the ranked information list, users can provide their feedbacks. Such an interaction process between users and IR systems can be naturally formulated as a decision-making problem, which can be either one-step or sequential. In the last ten years, deep reinforcement learning (DRL) has become a promising direction for decision-making, since DRL utilizes the high model capacity of deep learning for complex decision-making tasks. Recently, there have been emerging research works focusing on leveraging DRL for IR tasks. However, the fundamental information theory under DRL settings, the principle of RL methods for IR tasks, or the experimental evaluation protocols of DRL-based IR systems, has not been deeply investigated.
To this end, we propose the third DRL4IR workshop (https://drl4ir.github.io) at SIGIR 2022, which provides a venue for both academia researchers and industry practitioners to present the recent advances of DRL-based IR system, to foster novel research, interesting findings, and new applications of DRL for IR. In the last two years, DRL4IR organized at SIGIR'20/21 was one of the most successful workshops and attracted over 200 workshop attendees each year. In this year, we will pay more attention to fundamental research topics and recent application advances, with an expectation of over 300 workshop participants.
SESSION: Doctoral Consortium
Conversational intelligent assistants, such as Amazon Alexa, Google Assistant, and Apple Siri, are a form of voice-only Question Answering (QA) system and have the potential to address complex information needs. However, at the moment they are mostly limited to answering with facts expressed in a few words. For example, when a user asks Google Assistant if coffee is good for their health, it responds by justifying why it is good for their health without shedding any light on the side effects coffee consumption might have \citegao2020toward. Such limited exposure to multiple perspectives can lead to change in perceptions, preferences, and attitude of users, as well as to the creation and reinforcement of undesired cognitive biases. Getting such QA systems to provide a fair exposure to complex answers -- including those with opposing perspectives -- is an open research problem. In this research, I aim to address the problem of fairly exposing multiple perspectives and relevant answers to users in a multi-turn conversation without negatively impacting user satisfaction.
Classical evaluation of information retrieval systems evaluates a system in a static test collection. In the case of Web search, the evaluation environment (EE) is continuously changing and the hypothesis of using a static test collection is not representative of this changing reality. Moreover, the changes in the evaluation environment, as the document set, the topics set, the relevance judgments, and the chosen metrics, have an impact on the performance measurement [1, 4]. To the best of our knowledge, there is no way to evaluate two versions of a search engine with evolving EEs.
We aim at proposing a continuous framework to evaluate different versions of a search engine in different evaluation environments. The classical paradigm relies on a controlled test collection (i.e., set of topics, corpus of documents and relevant assessments) as a stable and meaningful EE that guarantees the reproducibility of system results. We define the different EEs as a dynamic test collection (DTC). A DTC is a list of test collections based on a controlled evolution of a static test collection. The DTC allows us to quantify and relate the differences between the test collection elements, called Knowledge delta (K)Δ, and the performance differences between systems evaluated on these varying test collections, called Result delta (R)Δ. Finally, the continuous evaluation is characterized by KΔs and RΔs. The related changes in both deltas will allow for interpreting the evaluations in systems performances. The expected contributions of the thesis are: (i) a pivot strategy based on RΔ to compare systems evaluated in different EEs; (ii) a formalization of DTC to simulate the continuous evaluation and provide significant RΔ in evolving contexts; and (iii) a continuous evaluation framework that incorporates KΔ to explain RΔ of evaluated systems.
It is not possible to measure the RΔ of two systems evaluated in different EEs, because the performance variations are dependent on the changes in the EEs. . To get an estimation of this RΔ measure, we propose to use a reference system, called the pivot system, which would be evaluated within the two EEs considered. Then, the RΔ value is measured using the relative distance between the pivot system and each evaluated system. Our results [2, 3] show that using the pivot strategy we improve the correctness of the ranking of systems (RoS) evaluated in two EEs (i.e., similarity with the RoS evaluated in the ground truth), compared to the RoS constructed with the absolute performance values for each system evaluated in the different EEs. The correctness of the RoS depends on the system defined as pivot and the metric.
The proposal focus moves to a continuous evaluation as a repeated assessment of the same or different versions of a web search across evolving EEs. Current test collections do not consider the evolution of documents, topics and relevance judgements. We require a DTC to extract RΔs of the compared system and its relation with the changes on the EEs (KΔ). We provide a method to define a DTC from static test collections based on controlled features as a way to better simulate the evolving EE. According to our preliminary experiments, a system evaluated in our proposed DTC shows more variable performances, and larger RΔs, than when it is evaluated in several random shards or bootstraps of documents.
As future work, we will integrate the KΔs to formalize an explainable continuous evaluation framework. The pivot strategy tells us when the performance of the system is improving across EEs. The DTC provides us with the required EEs to identify significant RΔs, and the inclusion of KΔs in the framework will define a set of factors that explain the system's performance changes.
Reasonable explanation is helpful to increase the trust and satisfaction of user to the recommender system. Among many previous studies, there is growing concern about generating explanation based on review text.
Collaborative filtering is one of the most successful approaches to predict user's preference. However, most of them suffer from data sparsity problem. Researcher often utilizes auxiliary data to address this problem, such as review, knowledge graph (KG), image and so on. Some researchers have proven that recommendation accuracy can be improved via incorporating rating and review data. Besides, neural network is also applied to learn more powerful representations for user and item from the review data. For example, convolution neural network (CNN) is used to extract representation from review text by using convolutional filters. Recurrent neural network (RNN) is another widely used model, which can encode the sequential behaviours as hidden states. However, most of them lack the ability to generate explanation.
In order to generate explanation, there are two main approaches are used, i.e., template-based approach and generation-based approach. It is usually necessary for the templated-based approach to define serval templates. Then, these templates will be further filled with different personalized features/words. Although they can offer readable explanations, they rely heavily on pre-defined templates. It causes large manual efforts, limiting their explanation expression. Due to the strong generation ability of natural language model, the generation-based approach is capable to generate explanation without templates, which can largely enhance the expression of the generated sentence. Although they can generate more free and flexible explanation, the explanation might tend to be uninformative.
To tackle these challenges of the above-mentioned work, we propose a Generating Knowldge-based Explanation for Recommendation from Review (GKER) to provide informative explanation. Unlike the traditional generation-based approach with a multi-task framework, we design a single-task framework to simultaneously model user's preference and explanation generation. The multi-task training usually needs more manual effort and time overhead. In this unitary framework, we inject the user's sentiment preference into the explanation generation, aiming at capturing the user's interest while producing high-quality explanation. Specifically, we build three graphs, including a bipartite graph, a KG and a co-occur graph. All of them are integrated to form a unitary graph, thus bringing the semantic among user-item interaction, KG and review. Based on this integrated graph, it is possible to learn more effective representations for user and item. To make better use of the integrated KG, a graph convolution network (GCN) is utilized to obtain improved embeddings due to its superior representation learning ability. We argue that these embeddings can contain more semantic interaction signals with the help of the integrated KG and GCN. After obtaining these extensive embeddings, a multilayer perceptron (MLP) layer is further employed to capture non-linear interaction signals between user and item, aiming at predicting user's rating accurately. The predicted rating would be regarded as a sentiment indicator to explore why the user likes or dislikes the target item. To investigate the association between sentiment indicator and the related review data, a transformer-enhanced encoder-decoder architecture is designed to produce informative and topic-relevant explanation. Besides, the aspect semantic is added in this architecture through an attention mechanism. In this framework, the transformer is utilized as a "teacher" model to supervise the generation of the encoder-decoder process. Finally, experiments conducted on three datasets have shown the state-of-the-art performance of GKER.
There are some research issues for discussion: 1) although KG is a useful tool for recommendation accuracy and explainability, it is always incomplete in the real world. Hence, it is worth completing it for the recommendation. 2) Besides, as for explainable, it still needs more metrics to evaluate the quality of its explanation.
Information elicitation conversations, for example, when a medical professional asks about a patient's history or a sales agent tries to understand their client's preferences, often start with a set of routine questions. The interviewer asks a predetermined set of questions conversationally, adapting them to the unique characteristics and context of an individual. Multiple-choice questionnaires are commonly used as a screening tool before the client sees the professional for more efficient information elicitation . However, recent proof-of-concept studies show that users are more likely to report their symptoms to an embodied conversational agent (ECA) than on a pen-and-paper survey , and rate ECAs highly on user experience . Chatbots allow the user to give free-form responses and ask clarification questions instead of having to interpret and choose from a list of given options. They can also keep the user engaged by sharing relevant information and offering empathetic acknowledgments when appropriate. However, many of the technical challenges involved in building such a conversational agent remain unsolved.
Mathematical formulas are an important tool to concisely communicate ideas in science and education, used to clarify descriptions, calculations or derivations. When searching in scientific literature, mathematical notation, which is often written using the LATEX notation, therefore plays a crucial role that should not be neglected. The task of mathematics-aware information retrieval is to retrieve relevant passages given a query or question, which both can include natural language and mathematical formulas. As in many domains that rely on Natural Language Understanding, transformer-based models are now dominating the field of information retrieval . Apart from their size and the transformerencoder architecture, pre-training is considered to be a key factor for the high performance of these models. It has also been shown that domain-adaptive pre-training improves their performance on down-stream tasks even further  especially when the vocabulary overlap between pre-training and in-domain data is low. This is also the case for the domain of mathematical documents.
Pseudo-relevance feedback mechanisms have long served as an effective technique to improve the retrieval effectiveness in information retrieval. Recently, large pre-trained language models, such as T5 and BERT, have shown a strong capacity to capture the latent traits of texts. Given the success of these models, we seek to study the capacity of these models for query reformulation. In addition, the BERT models have demonstrated further promise for dense retrieval, where the query and documents are encoded into the contextualised embeddings and relevant documents are retrieved by conducting the semantic matching operation. Although the success of pseudo-relevance feedback for sparse retrieval is well documented, effective pseudo-relevance feedback approaches for dense retrieval paradigm are still in their infancy. Thus, we are concerned with excavating the potential of the pseudo-relevance feedback information combined with the large pre-trained models to conduct effective query reformulation operating on both sparse retrieval and dense retrieval.
Streaming services have become one of today's main sources of music consumption, with music recommender systems (MRS) as important components. The MRS' choices strongly influence what users consume, and vice versa. Therefore, there is a growing interest in ensuring the fairness of these choices for all stakeholders involved. Firstly, for users, unfairness might result in some users receiving lower-quality recommendations in terms of accuracy and coverage. Secondly, item provider (i.e. artist) unfairness might result in some artists receiving less exposure, and therefore less revenue. However, it is challenging to improve fairness without a decrease in, for instance, overall recommendation quality or user satisfaction. Additional complications arise when balancing possibly domain-specific objectives for multiple stakeholders at once. While fairness research exists from both the user and artist perspective in the music domain, there is a lack of research directly consulting artists---with Ferraro et al. (2021) as an exception.
When interacting with recommendation systems and evaluating their fairness, the many factors influencing recommendation system decisions can cause another difficulty: lack of transparency. Artists indicate they would appreciate more transparency in MRS---both towards the user and themselves. While e.g. Millecamp et al. (2019) use explanations to increase transparency for MRS users, to the best of our knowledge, no research has addressed improving transparency for artists this way.
State-of-the-art conversational question answering (ConvQA) operates over homogeneous sources of information: either a knowledge base (KB), or a text corpus, or a collection of tables. This inherently limits the answer coverage of ConvQA systems. Therefore, during my PhD, we would like to tap into heterogeneous sources for answering conversational questions. Further, we plan to investigate the explainability of such ConvQA systems, to identify what helps users in understanding the answer derivation process.
Chronic disease patients, such as diabetics, cancer patients, and heart disease patients, actively seek health information for self-management and decision-making every single day. Patient focused health recommender systems (PHRSs) that suggest health information relevant to patients' changing needs, assists them with easy information accessibility. Nevertheless, patients' needs become more complex with disease progression and their increased knowledge about disease. Hence, a unique requirement of the PHRS would be to suggest health information in line with patients' changing knowledge about the disease. However, current PHRS are personalized to patient interest and don't consider their knowledge about disease. By providing patients with information tailored at their knowledge-level, they not only are more likely to understand and engage better in disease management, but can use PHRS for disease related learning. Hence, the overarching goal of my PhD thesis is to explore technologies in the field of recommender systems and personalized learning for the purpose of suggesting health information that accounts for patients' dynamic information needs and level of knowledge about disease. We will explore these ideas in the context of developing a knowledge-appropriate PHRS (KA-PHRS ). A critical innovation of KA-PHRS is the patient knowledge model that keeps track of patients' changing knowledge-level about disease and enables knowledge-appropriate recommendations. The expectation is that health information suggested by KA-PHRS will increase as well as benefit patients' involvement in self management and treatment.Chronic disease patients, such as diabetics, cancer patients, and heart disease patients, actively seek health information for self-management and decision-making every single day. Patient focused health recommender systems (PHRSs) that suggest health information relevant to patients' changing needs, assists them with easy information accessibility. Nevertheless, patients' needs become more complex with disease progression and their increased knowledge about disease. Hence, a unique requirement of the PHRS would be to suggest health information in line with patients' changing knowledge about the disease. However, current PHRS are personalized to patient interest and don't consider their knowledge about disease. By providing patients with information tailored at their knowledge-level, they not only are more likely to understand and engage better in disease management, but can use PHRS for disease related learning. Hence, the overarching goal of my PhD thesis is to explore technologies in the field of recommender systems and personalized learning for the purpose of suggesting health information that accounts for patients' dynamic information needs and level of knowledge about disease. We will explore these ideas in the context of developing a knowledge-appropriate PHRS (KA-PHRS ). A critical innovation of KA-PHRS is the patient knowledge model that keeps track of patients' changing knowledge-level about disease and enables knowledge-appropriate recommendations. The expectation is that health information suggested by KA-PHRS will increase as well as benefit patients' involvement in self management and treatment.
In this research, we aim to examine the assumptions made about users when searching for non-factoid answers using search engines. That is, the way they approach non-factoid question-answering tasks, the language they use to express their questions, the variability in their queries and their behavior towards the provided answers. The investigation will also examine the extent to which these neglected factors affect retrieval performance and potentially highlight the importance of building more realistic methodologies and test collections that capture the real nature of this task. Through our preliminary work, we have begun to explore the characteristics of non-factoid question-answering queries and investigate query variability and their impact on modern retrieval models. Our preliminary results demonstrate notable differences between non-factoid questions sampled from a large query log and those used in QA datasets. In addition, our results demonstrate a profound effect of query variability on retrieval consistency, indicating a potential impact on retrieval performance that is worth studying. We highlight the importance of understanding user behaviour while searching for non-factoid answers, specifically the way they behave in response to receiving an answer. This should advance our understanding of the support users require across different types of non-factoid questions and inform the design of interaction models that support learning and encourage exploring.